Learning - 2023-02-17
Vector Databases 101
Want to build the next great machine learning product or just sound cool at a cocktail party? Then look no further because vector databases are here for you. Let us begin.
Why are vector databases suddenly popular?
Traditional databases are good at working with structured data – i.e. data that can be stored in tables with rows and columns. Think of your typical spreadsheet or relational database like MySQL. Querying these databases is pretty straight-forward. "Find me all my users who's first name is Ronnie with a birthday in July". Easy enough.
But what if you wanted to understand how these Ronnie's compare to other users in your database? Could you write a query to return their in-app behavior, such as a sequence of events like clicks, views, and purchases – and then contrast that behavior against other users? Not very easily, and depending on how your data is structured maybe not at all.
This is where embeddings come in. With recent advancements in machine learning, we can create embeddings that represent user behavior. These embeddings are plotted as vectors in what's called a vector space.
Let's say we're comparing four users, Dave, Liv, Vibeke, and Lars, with three dimensions, clicks, views, and purchases. In a vector space, this would look something like:
embeddings of user behaviors as vectors
Vector DBs are optimized for this kind of data representation. They are used to calculate how close or far a vector is to another – or in the case of user behavior, how similar or dissimilar.
Shoutout to Lars, the contrarian of the group!
Finding the similarities between four users along three dimensions is a relatively simple example. In practice, machine learning and embeddings can take on much more data, with some cutting edge models calculating millions of vectors with over a thousand dimensions in each.
There's gold in them hills...of unstructured data.
The general availability of vector DBs couldn't come at a better time because the amount of data they are built for is exploding. The IDC estimates that unstructured data will account for 80-90% of all data growth by 2025. Images, tweets, audio files, videos and more are proliferating at an exponential rate. And there are no signs of it stopping.
This is why it's key to not only understand how this data is stored, but how to use it. Organizations who are first to adopt technology like embeddings will have a competitive advantage, they will be able to understand things about their customers that their peers will not.
The unstructured data revolution will be live streamed
The technology industry is famous for creating new and innovative products, but every once in a while there are tectonic shifts that simply change the game. We believe we are witnessing one of those shifts.
If you are interested in learning more we would love to help! You can sign up on our website and we'll drop you a line :)