🎉 Seed RoundWe've raised 2.5M from Swift, YC, & Chapter One. Read more

Learning - 2023-10-05

From Raw Data to LLM Apps

by Pablo Rios

Renaissance cats playing soccer, thinking about data

Renaissance cats playing soccer, thinking about data

In our previous post, we broke down how to get to production with Metal in three steps: data in, data out, and checking your work. Today, we’re going to look at the first step in more detail and how Metal’s automated processing makes it easy to turn unstructured data into LLM ready embeddings. We’ll use data from the 2023 FIFA Women's World Cup as a case study, demonstrating how to set up a chatbot enriched with this information!

For a walkthrough of the code discussed in this post, check out the accompanying notebook here.

Data sources: Complex Files, Simple Preprocessing

Unstructured data tends to be messy. Whether it's PDFs with embedded tables and charts, CSV files with many columns or sheets, or dense DOCX files with inconsistent formatting - this data needs to be preprocessed before it can be used by an LLM.

Consider the following screenshot, taken from the final match summary report

Scattered data from the post match summary report

Scattered data from the post match summary report

We see a real mix of data formats – there are tables with stats, charts visualizing trends, and other nested figures that might be harder to pin down. For an LLM to really grasp it, we need to separate each piece, and structure everything in a way that's clear and organized for extraction. Only then can we expect the LLM to make sense of it and be able to answer questions about it.  

Metal chatbot answering queries from complex files

Metal chatbot answering queries from complex files

This is where we introduce data sources in Metal. They are collections of data you want to preprocess before being run through an embedding model. For our World Cup application, we want to create a single data source that houses all of our data.

Here's how to create a data source with the Python SDK:

payload = {
"name": "Women World Cup 2023 Datasources",
"sourcetype": "File",
"autoExtract": True,
"metadataFields": [],
datasource = metal.add_datasource(payload)

Indexes: Organizing and Accessing Your Data

The next step is to create an index and connect it to your data source. Indexes are where your preprocessed data will be transformed and made queryable for your application. They are built to be flexible –  using different indexes to power different features or functionality.

For instance, suppose you run a sports analytics platform. You could have individual indexes for player statistics, team rankings, and historical match data. A feature like "Player of the Month," could pull from the player statistics index, while a "Historical Match Lookup" feature could use the match data or team rankings index. Another way to think about indexes is that they are specialized to serve a task for an application!

This is how to add an index in Metal:

datasource_id = datasource['data']['id']
payload = {
"model": "text-embedding-ada-002",
"name": "Women World Cup 2023 Index",
"datasource": datasource_id,
"indexType": "HNSW",
"dimensions": 1536
wwc_index = metal.add_index(payload)

Data entities: From Raw Files to Ready-to-Use Embeddings

Once you've set up your data source and index, the next crucial step is to ingest your files. When files are stored in a data source, they are called data entities in Metal’s platform. 

Think of a data entity as a specific piece of information you want your application to understand and respond to. Now, depending on the source and complexity of your data, preprocessing can be a challenging task. But with Metal, it's automatic. The platform does all the heavy lifting: parsing tables, extracting key metadata, and augmenting your files so they will be converted into semantically rich embeddings.

Let's look at various data types from our World Cup data to see how this works:

  • A PDF with embedded tables and visual graphs
  • A DOCX with densely packed statistics from FIFA
  • An XLSX detailing the past World Cup winners
  • And a CSV capturing the final rounds' match outcomes

Here's a snippet showcasing how you can easily iterate over a directory of the files and feed them into a data source:

import os
# Specify the directory path
directory = "wwc"
# Retrieve all files in the directory
files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]
# Add each file as a Data Entity
for file in files:
file_path = os.path.join(directory, file)
results = metal.add_data_entity(datasource_id, file_path)

Additionally, we also want to pull in dynamic data, like:

  • The biography of Mary Earps, England's ace goalkeeper, straight from Wikipedia's API.
# Add player Bio via wikipedia API
from wikipediaapi import Wikipedia
wiki = Wikipedia('WorldCup23/0.0', 'en')
me_page = wiki.page('Mary_Earps').text
me_page = me_page.split('\nReferences\n')[0]
#Push text into Metal's Index
metal.index({ "text": me_page}, index_id = wwc_index['data']['id'])

Putting It All Together: Launching the App

After defining our data source, setting up our index, and integrating both static and dynamic data, it's time to see our LLM application in action.

Head over to the Metal Chatbot repository and follow the steps to deploy there. In no time, you'll have a chatbot powered by the rich data from the 2023 FIFA Women's World Cup!

Chatbot in action

Chatbot in action

Wrapping Up

Everything we do at Metal is geared towards you getting to production as easily as possible. Our data model is built to maximize the power of LLMs while also giving you flexibility as a developer. The Women's World Cup data serves as a good example for this, as Metal’s preprocessing and ingestion pipeline is built to handle many different file formats, fast!

Feel free to try this yourself by creating a free account and checking out our docs. Of course, if you need any help or have questions – we’re here to help! 🤘