Learning - 2023-02-28
Use Case Spotlight: Clustering
Have you heard the old saying, "Similarities, similarities everywhere, but not an easy way to cluster them"? No? Not a thing? Well given the amount of data out there that can benefit from clustering, it just might be soon.
What is clustering?
Clustering is one of the most popular use cases for embeddings. It's a way to group similar data together in order to discover insights that are not obvious at first glance.
Imagine walking into a bar with thirty people. Some commonalities would be immediately observable, such as height, beverage choice, or seating arrangements. Clustering the room by one of these traits would be easy. But people don't exist in single dimensions! They exhibit many traits at once. One person might be wearing a red shirt, drinking a beer and dancing, while another might be wearing a jacket, having a soda by the front door. Clustering the bar-goers by many traits would be a much more challenging task.
This is where embeddings can be put to work. By representing many dimensions of data and plotting them in a vector space, we can use techniques to compare and group the data by multiple dimensions of similarity.
One of the most popular algorithms for clustering is *k-means. *If you want to skip the details, just know that the goal of k-means is to create clusters where the datapoints are as similar as possible to each other, while at the same time being as different as possible from the points in the other clusters. Here's a rundown of how it works.
To kick things off, you first determine how many clusters you want (this is represented by the variable k). Datapoints are assigned to each cluster using the closest cluster centroid (a centroid is the average position of all the points in a cluster).
Initial assignment of datapoints to a cluster
After each point is assigned to a cluster, the centroids are recalculated based on the new cluster members, and the process is repeated until the centroids stop shifting about so much. Once the clusters have stabilized they're ready for use!
Clusters are stabilized after many iterations
An important takeaway here is that clustering can be useful for both exploring your data as well as determining production ready groupings. It can be a really useful tool to simply learn what hidden insights your data may contain.
Clustering in the wild with Metal
Hopefully this post has helped you think of ways you could cluster your data in a real world setting. But if you're still unsure of how to get started, Metal can help! We love exploring different approaches with our users doing cluster analysis. A few examples:
- Clustering users by in-app behavior to provide more personalized marketing at scale
- Grouping similar images by their visual features in order to improve the search experience of a product catalogue
- Spotting anomalies in financial transactions by flagging events that falls outside of typical clusters
Please get in touch through our homepage to learn more :)