How to load a Hugging Face dataset into Qdrant?
Loading a Hugging Face dataset into Qdrant is easy. This post shows how to do it.
%pip install datasets qdrant-client --q
Loading our dataset
For this post we'll use the Cohere/wikipedia-22-12-simple-embeddings dataset which has already had embeddings generated for it. This dataset was created by Cohere and creates embeddings for millions of Wikipedia articles. See this post for more details.
We'll use the Hugging Face datasets library to load the dataset.
from datasets import load_dataset
dataset = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")
Let's take a quick look at the dataset.
dataset
We can see the dataset has a emb
column which contains the embeddings for each article. Alongside this we see the title
and text
for the articles alongside some other metadata. Let's also take a look at the features of the dataset.
Let's also take a quick look at the features of the dataset. Hugging Face Dataset objects have a features
attribute which contains the features of the dataset. We can see that the emb
column is a Sequence
of float32
values. We also have some other columns with string
values, int32
and float32
values.
dataset.features
Qdrant has support for a pretty varied range of types. All of these types in our dataset are supported by Qdrant so we don't need to do any conversion.
Creating a Qdrant collection
We'll use the Qdrant Python client for this post. This client is really nice since it allows you to create a local collection using pure Python i.e. no need to run a Qdrant server. This is great for testing and development. Once you're ready to deploy your collection you can use the same client to connect to a remote Qdrant server.
from qdrant_client import QdrantClient
We first create a client, in this case using a local path for our DB.
client = QdrantClient(path="db") # Persists changes to disk
Configuring our Qdrant collection
Qdrant is very flexible but we need to let Qdrant now a few things about our collection. These include the name, and a config for the vectors we want to store. This config includes the dimensionality of the vectors and the distance metric we want to use. Let's first check out the dimensionality of our vectors.
vector_size = len(dataset[0]['emb'])
We'll also store our collection in a variable so we can use it later.
collection_name = "cohere_wikipedia"
from qdrant_client.models import Distance, VectorParams
client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
)
Adding our data to Qdrant
Note this code can be heavily optimized but gives an idea of how easy adding data to Qdrant can be. For many datasets this naive approach will work fine.
The approach we'll take below is to loop through our dataset and yield each row as a PointStruct
. This is a Qdrant object that contains the vector and any other data, referred to as the payload, that we want to store.
from qdrant_client.models import PointStruct
def yield_rows(dataset):
for idx, row in enumerate(dataset, start=1):
vector = row["emb"] # grab the vector
payload = {k: v for k, v in row.items() if k != "emb"} # grab the rest of the fields without the vector
yield PointStruct(id=idx, vector=vector, payload=payload)
For this post we'll use a smallish subset of the dataset. We'll use the first 100_000 rows. Big enough to be interesting but small enough to play around with quickly.
sample = dataset.select(range(100_000))
We'll use the toolz
libraries partition_all
function to get batches from our yield_rows function. We'll use tqdm
to show a progress bar.
from toolz import partition_all
from tqdm.auto import tqdm
%%time
bs = 100
for batch in tqdm(partition_all(bs, yield_rows(sample)), total=len(sample) // bs):
client.upsert(collection_name=collection_name, points=list(batch), wait=False)
On my 2021 MacBook Pro with an M1 chip this takes about 90 seconds to run. As mentioned above this can be heavily optimized but this gives an idea of how easy it is to add data to Qdrant from a Hugging Face dataset.
from rich import print
print(client.get_collection(collection_name))
We can see a bunch of information about our collection. Including the vector count, the dimensionality of the vectors and the distance metric we're using. You'll see that there are plenty of knobs to turn here to optimize your collection but that's for another post.
We can use the scroll
method to get the first vector from our collection
print(client.scroll(collection_name,limit=1)[0][0])
We can also grab items from the payload for each point.
print(client.scroll('cohere_wikipedia',limit=1)[0][0].payload['text'])
We can see this article is about the 24-hour clock system. Let's see what other pages are similar to this one. We can optionally get the vector for the query point.
vector = client.scroll('cohere_wikipedia',limit=1,with_vectors=True)[0][0].vector
We can use our vector as a query to find similar vectors in our collection. We'll use the search
method to do this.
query_vector = client.scroll(collection_name, limit=1, with_vectors=True)[0][0].vector
hits = client.search(
collection_name=collection_name,
query_vector=query_vector,
limit=15, # Return 5 closest points
)
Let's look at some of the results. We can see that the first result is the same article. The rest also seem to be about time/24 hour clock systems!
for hit in hits:
print(f"{hit.payload['title']} | {hit.payload['text']}")
print("---")
Conclusion
This post showed how it's possible to easily convert a Hugging Face dataset into a Qdrant collection. We then showed how we can use this collection to find similar articles.
There is a lot of scope for optimization here. For example, we could use a more efficient way to add data to Qdrant. We could also use a more efficient way to search our collection. It would be very cool to directly have a from_hf_datasets
method in the Qdrant Python client that would do all of this for us and include some optimizations!
I hope this post has shown how easy it is to use Qdrant with Hugging Face datasets. If you have any questions or comments please let me know on Twitter.