🎉 Seed RoundWe've raised 2.5M from Swift, YC, & Chapter One. Read more

Learning - 2023-05-04

Nearest Neighbor Search

Rock band

Rock band

Search functions have almost been around as long as the internet itself. Don't believe me? Google it.

Now that we've opened with a joke, let's get serious. Search is by far one of the most useful tools ever built in software. From finding products and services to local files on your computer, search is how we retrieve digital information – full stop.

But search is a lot like making a grilled cheese sandwich – there's more than one way to do it. In this post we're going to talk about a type of search that is especially useful with embeddings: nearest neighbor search.

Nearest Neighbor Search

Remember that embeddings are spatial representations of data, effectively capturing the semantic meaning of words, images, user behaviors, and so on. But embeddings are only useful if there are other data points to compare against. And these comparisons are made by measuring the distance between points. This is the basic idea for nearest neighbor search:

  1. Find a datapoint given a query
  2. Measure the distance of this datapoint to others
  3. Return the closest datapoints measured in step #2

As the name suggests, nearest neighbor finds whatever data is closest in distance to a query result. But there are different methods available depending on how much data you're searching through and what kind of insights you're trying to uncover. Two common approaches are K-Nearest Neighbor (KNN) and Approximate Nearest Neighbor (ANN).

K-Nearest Neighbor (KNN)

You can think of KNN as the "no stone left unturned" approach. This approach searches, measures, and compares the distance of every single datapoint in a given set to a query result. While KNN scores high in the accuracy department, it can be slow and computationally expensive with large datasets.

Approximate Nearest Neighbor (ANN)

ANN is the "it's close enough" approach. Unlike KNN, this approach approximates the distance between data in a set. For example, some ANN algorithms first index data based on their similarities and then return the most similar indexes related to a query. This makes ANN well equipped for speed and efficiency, but at the cost of accuracy.

It's a Beautiful Day in the Neighborhood

The important takeaway with this introduction is that there's no one-size-fits-all approach to search. Think of approaches like KNN or ANN as tools for a given task. But knowing these concepts will help you understand how complex data representations like embeddings can be put to use.

There is far more to read on this subject. But for those that learn by doing, you can see nearest neighbor search in action with Metal today 🤘