LanceDB leans on open source to build a database for multimodal AI
A columnar data format could become the AI standard
Welcome to Forkable’s COSS Corner column, where I profile startups and key figures from the commercial open source software (COSS) space.
In this edition, I chat with Chang She, one of the original co-authors of Pandas — the open source data analysis and manipulation library for Python. She also previously sold business intelligence startup DataPad to Cloudera, and he was VP of engineering at Tubi when Fox acquired it in 2020 for $440 million.
Today, She is co-founder and CEO of LanceDB, a startup he launched in 2021 with CTO and former Cloudera colleague Lei Xu (pictured together above).
LanceDB is setting out to be the “database for multimodal AI,” built atop a columnar open source data format of its own creation.
A ‘data lake for multimodal AI’
When I put the call out last month for open source-aligned startups to profile, I was overwhelmed with submissions (a good problem to have). But a common theme to emerge from those that I’ve looked at so far is that of managing data for AI applications — and LanceDB is very much in that vein.
With AI models swiftly becoming commoditized, it becomes all the more important to differentiate through data, domain expertise, and integrating with real-world workflows at the application level.
Y Combinator (YC) alum LanceDB is setting out to build a “data lake for multimodal AI,” making it easier for developers to search, train, pre-process, and explore all their AI data (text, images, audio, video, vector embeddings) in a single place — “at petabyte-scale,” She explained to me in an interview.
“Multimodal will continue to rise, but I think we'll see a commoditization of the models,” She said. “So what does that mean for enterprises adopting AI? The most valuable commodity is their data, so if they can successfully connect their own data to a model to build their AI applications, that's where the real value is for them. That's essentially what we're hoping to solve, whether they're building applications or core models.”
At the heart of all this is Lance, a columnar data format designed for multimodal data and machine learning — 100 times faster than Apache Parquet, according to She.
On top of Lance, the company has built the LanceDB vector database (read more about vector databases here). For the core open source incarnation, this basically just means the embedded library, which is best for small-scale dabbling.
“This is perfect for experimentation -- install it in about 10 seconds, and it runs everywhere,” She said.
Companies might start out there, but when they edge toward production, they’ll need something with a bit more oomph — a distributed service, operational expertise, automation, security, and maybe even hosting. And this is where its Cloud and Enterprise incarnations enter the fray.
“I think every company who's using the open source [product] today, as they go into production and go up in scale, the commercial product that we provide is a no-brainer,” She said. “It's not that they can't do it [themselves], but the amount of engineering time that they'd have to put in it makes us a no-brainer.”
LanceDB Cloud is its hosted, serverless offering which means the user doesn’t have to manage any servers or infrastructure. This is probably best suited for individuals or smaller teams looking to get up-and-running quickly and scale from an initial proof-of-concept into production. LanceDB Enterprise, meanwhile, is for proper scale — where data privacy, security, and dedicated support are paramount. This also allows the customer to deploy LanceDB on any cloud of their choosing.
The open source factor
While LanceDB has already amassed a fairly impressive roster of customers, such as Midjourney, as an open source company it has attracted all manner of users. TikTok-owner ByteDance, for example, is using Lance for its cloud computing and AI unit Volcano Engine. Other notable Lance adopters include Fei-Fei Li’s World Labs, which exited stealth with $230 million in funding back in September; and Luma Labs, which creates 3D visuals from simple text prompts.
This kind of traction has also brought in some big-name investors. Last May, LanceDB announced it had raised $8 million in a seed round of funding led by CRV, a Silicon Valley VC firm that has previously backed the likes of Airtable, DoorDash, Dropbox, HubSpot, and Twitter.
“A very significant percentage of all the top generative AI companies doing image and video generation are now using Lance format,” She said. “I don't think it's an exaggeration to say that tomorrow's multimodal, cutting edge models are getting built on Lance today.”
And this gets to the thrust of why LanceDB has forged a path with open source at its core. In the same way as Apache Arrow has emerged as the standard for in-memory data, and Apache Parquet for business intelligence and analytical data, Lance could become the standard format for storing and processing AI-specific data.
The LanceDB team is well-equipped to make this happen — the founders, among other team members, have significant experience at the heart of some of the most well known open source projects. CTO and co-founder Lei Xu is a committer to Apache Hadoop (a framework for storing and processing large datasets), while he’s also on the Hadoop project management committe. And She, of course, was one of the original authors of Pandas.
“My background is basically open source data tooling -- I've been doing this for quite some time,” She said. “Our team is stacked with core contributors from some of the most important open source data projects from the last decade. And we’re hoping that Lance will become one of the most important open source projects for the next decade — if we want it to be a data standard, it has to be open source.”