Learning - 2023-08-10
Advanced Image Retrieval with CLIP and Metal
by Pablo Rios

A cute cat on a mystery.
When it comes to computer vision, we're no strangers to the complexities and nuances involved in training machines to 'see' and 'understand' visual data. Image classification, a vital component of this field, serves as the foundation for numerous advanced applications, from augmented reality tools to real-time object detection.
The Rise of CLIP
While convolutional neural networks (CNNs) and deep learning have been staples in image classification, OpenAI’s CLIP, which stands for (Contrastive Language-Image Pre-Training) offers an exciting innovation. Its uniqueness lies in:
Merging Vision and Language: Rather than focusing solely on pixel analysis, CLIP interprets images in the context of natural language. This promotes a richer understanding of visuals within a context, offering a deeper, more holistic interpretation that mirrors human visual perception.
Zero-Shot Learning: Traditional models require extensive, specific training. In contrast, CLIP, having been pre-trained on a vast array of visual and linguistic data, can identify unfamiliar images using textual descriptions. This adaptability means it can cater to various applications without the need for constant retraining.
Demo: Building a Semantic Text-to-Image Product Finder
In this demo, we will explore the application of CLIP within Metal, specifically to develop a Semantic Text-to-Image Product Finder tailored for e-commerce environments. The goal is to allow online shoppers to use natural language queries to retrieve products that closely match their preferences.
Let's get started.
Initial Setup in Metal
Navigate to the Metal Dashboard and initiate a new index. Here, specify OpenAI CLIP as the desired model. Remember, Metal allows you to also incorporate specific metadata for filtering – product ID, category, and gender, to name a few. This metadata will serve as valuable parameters to refine search results later on.
In your programming environment:
from metal_sdk.metal import Metalapi_key= 'METAL_API_KEY'client_id= 'METAL_CLIENT_ID'index_id = 'INDEX_ID'metal = Metal(api_key, client_id, index_id)
Data Preparation
For this example, we will use the E-commerce Product Image dataset. It has images from 'Apparel' (for Boys and Girls) and 'Footwear' (for Men and Women).
Let's load our data and select a sample for model evaluation. To avoid overlapping sets, we'll use scikit-learn's 'train_test_split' function. While it divides the data into what seems like traditional "training" and "test" sets, we're essentially creating a "base" set (similar to training data) and an "evaluation" set for assessing the already trained CLIP model.
from sklearn.model_selection import train_test_splitdf = pd.read_csv('fashion.csv')base_df, evaluation_df = train_test_split(df, test_size=0.3, random_state=42)
Populating Metal with Images
Metal requires images to be provided as URLs to generate embeddings. Thus, the first step is mapping downloaded images to their respective "ProductId" in the 'evaluation_df'. In this notebook you can find the code in more detail.
You can use an image hosting service like Postimage.org to host your images and copy their public URLs. Then you can merge them to the 'evaluation_df' in your code.
# Extracting product IDsproduct_ids = [url.split('/')[-1].split('.')[0] for url in links]# Convert product IDs to integersproduct_ids = [int(pid) for pid in product_ids]# Create a new DataFrame with URLs and product IDsdf_urls = pd.DataFrame({'ImageUrl': links,'ProductId': product_ids})# Merge the dataframes on the URL column to get the product IDs in the original dataframemerged_df = pd.merge(evaluation_df, df_urls, on='ProductId', how='left')
Once we have our dataframe with the ImageUrl ready, we can ingest them into Metal, specifying the "metadata" fields we created earlier. Note that we can upload the documents to Metal in chunks of 100, so we need to iterate and use the 'iterrows()' function.
number_of_chunks = math.ceil(len(merged_df) // 100)payloads = []for i in range(number_of_chunks):start_index = i * 100end_index = (i + 1) * 100chunk_payload = [{"imageUrl": row_data['ImageUrl'],"index": index_id,"metadata": {"product_id": row_data['ProductId'],"category": row_data['Category'],"gender": row_data['Gender']}}for index, row_data in merged_df.iloc[start_index:end_index].iterrows()]metal.index_many(chunk_payload)payloads.append(chunk_payload)
Query Execution and Analysis
Well done! Things should be looking good in our Dashboard by now and we are ready to test our embeddings.
Let's try searching for "red sneakers for boys". The results should look something like this:

search for shoes
While there might not be products explicitly categorized under "Boys," you'll find that the model retrieves products from the 'Men's' category. This nuanced retrieval is a testament to CLIP's semantic understanding. The model comprehends the contextual relationship between the terms "Men" and "Boy," ensuring that it fetches items most pertinent to the user's intent.
Moving on, search for "Minnie Mouse T-shirts" and let's see what shows up.

search for minnie mouse t shirts
The results? A collection of products showcasing the iconic Minnie Mouse design. What's fascinating here is the absence of explicit training or labeling to recognize such designs. The model discerns the characteristics and patterns associated with "Minnie Mouse" and efficiently surfaces the relevant products.
Evaluating Performance Metrics
Now when evaluating a system, especially one as nuanced as an e-commerce product retrieval tool, the metrics should be aligned with its most pertinent use case. In our scenario, users typically seek quick, accurate results for their most immediate needs, often reflected in their queries. Therefore, understanding and assessing the system's performance based on the most frequent queries becomes crucial.
Precision at Top K
A practical approach involves isolating the top-most frequent queries made by users and evaluating the accuracy of the top K retrieved results. These represent the most common items or categories shoppers are looking for.
For instance, if in our use case, we define K=5 and one frequent query is "red sneakers for boys" and the top 5 results are all relevant, then Precision@5 is 100%. But if only 4 out of the top 5 are relevant, then the Precision@5 would be 80%.
Recall at Top K
Recall measures the fraction of all relevant instances that are retrieved at the top K positions. In the context of our e-commerce setup, If there are 6 relevant "Minnie Mouse t-shirts" and 3 of them are retrieved within the Top 5 results, the recall would be 50%.
Optimizing this metric is key to ensuring users are exposed to the broadest range of relevant products, maximizing the likelihood of satisfying their shopping needs.
Wrapping Up
As demonstrated in this demo, the capabilities of OpenAI's CLIP model underscore a significant evolution in the realm of image classification. By combining vision and language, and with its innate knack for zero-shot learning, CLIP presents a transformative approach to understanding and categorizing visual content. This is particularly valuable in applications like e-commerce, where precise product matching based on textual descriptions can greatly enhance user experience.
When integrated with platforms like Metal, the potential of CLIP becomes even more tangible, as illustrated in our Semantic Text-to-Image Product Finder. Not only does this pave the way for more efficient and user-friendly shopping experiences, but it also hints at the future of AI-driven solutions across various industries. As the lines between language and vision continue to blur, it's clear that models like CLIP are steering us toward a more semantically rich and intuitive digital future.
To put this tutorial in action, you can get started with a free Metal account here!