%pip install datasets qdrant-client --q
[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Loading our dataset

For this post we'll use the Cohere/wikipedia-22-12-simple-embeddings dataset which has already had embeddings generated for it. This dataset was created by Cohere and creates embeddings for millions of Wikipedia articles. See this post for more details.

We'll use the Hugging Face datasets library to load the dataset.

from datasets import load_dataset

dataset = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")
/Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)

Let's take a quick look at the dataset.

dataset
Dataset({
    features: ['id', 'title', 'text', 'url', 'wiki_id', 'views', 'paragraph_id', 'langs', 'emb'],
    num_rows: 485859
})

We can see the dataset has a emb column which contains the embeddings for each article. Alongside this we see the title and text for the articles alongside some other metadata. Let's also take a look at the features of the dataset.

Let's also take a quick look at the features of the dataset. Hugging Face Dataset objects have a features attribute which contains the features of the dataset. We can see that the emb column is a Sequence of float32 values. We also have some other columns with string values, int32 and float32 values.

dataset.features
{'id': Value(dtype='int32', id=None),
 'title': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'url': Value(dtype='string', id=None),
 'wiki_id': Value(dtype='int32', id=None),
 'views': Value(dtype='float32', id=None),
 'paragraph_id': Value(dtype='int32', id=None),
 'langs': Value(dtype='int32', id=None),
 'emb': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}

Qdrant has support for a pretty varied range of types. All of these types in our dataset are supported by Qdrant so we don't need to do any conversion.

Creating a Qdrant collection

We'll use the Qdrant Python client for this post. This client is really nice since it allows you to create a local collection using pure Python i.e. no need to run a Qdrant server. This is great for testing and development. Once you're ready to deploy your collection you can use the same client to connect to a remote Qdrant server.

from qdrant_client import QdrantClient

We first create a client, in this case using a local path for our DB.

client = QdrantClient(path="db")  # Persists changes to disk

Configuring our Qdrant collection

Qdrant is very flexible but we need to let Qdrant now a few things about our collection. These include the name, and a config for the vectors we want to store. This config includes the dimensionality of the vectors and the distance metric we want to use. Let's first check out the dimensionality of our vectors.

vector_size = len(dataset[0]['emb'])

We'll also store our collection in a variable so we can use it later.

collection_name = "cohere_wikipedia"
from qdrant_client.models import Distance, VectorParams

client.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
)
True

Adding our data to Qdrant

Note this code can be heavily optimized but gives an idea of how easy adding data to Qdrant can be. For many datasets this naive approach will work fine.

The approach we'll take below is to loop through our dataset and yield each row as a PointStruct. This is a Qdrant object that contains the vector and any other data, referred to as the payload, that we want to store.

from qdrant_client.models import PointStruct
def yield_rows(dataset):
    for idx, row in enumerate(dataset, start=1):
        vector = row["emb"] # grab the vector
        payload = {k: v for k, v in row.items() if k != "emb"} # grab the rest of the fields without the vector
        yield PointStruct(id=idx, vector=vector, payload=payload)

For this post we'll use a smallish subset of the dataset. We'll use the first 100_000 rows. Big enough to be interesting but small enough to play around with quickly.

sample = dataset.select(range(100_000))

We'll use the toolz libraries partition_all function to get batches from our yield_rows function. We'll use tqdm to show a progress bar.

from toolz import partition_all
from tqdm.auto import tqdm
%%time
bs = 100
for batch in tqdm(partition_all(bs, yield_rows(sample)), total=len(sample) // bs):
    client.upsert(collection_name=collection_name, points=list(batch), wait=False)
CPU times: user 30.9 s, sys: 35.7 s, total: 1min 6s
Wall time: 1min 19s

On my 2021 MacBook Pro with an M1 chip this takes about 90 seconds to run. As mentioned above this can be heavily optimized but this gives an idea of how easy it is to add data to Qdrant from a Hugging Face dataset.

Searching our Qdrant collection

What can we do with our Qdrant collection? We can use our embeddings to find similar wikipedia articles. Let's see how we can do that.

First we'll use the get_collection method to see some information about our collection.

from rich import print

print(client.get_collection(collection_name))
CollectionInfo(
    status=<CollectionStatus.GREEN: 'green'>,
    optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>,
    vectors_count=100000,
    indexed_vectors_count=0,
    points_count=100000,
    segments_count=1,
    config=CollectionConfig(
        params=CollectionParams(
            vectors=VectorParams(
                size=768,
                distance=<Distance.COSINE: 'Cosine'>,
                hnsw_config=None,
                quantization_config=None,
                on_disk=None
            ),
            shard_number=None,
            replication_factor=None,
            write_consistency_factor=None,
            read_fan_out_factor=None,
            on_disk_payload=None
        ),
        hnsw_config=HnswConfig(
            m=16,
            ef_construct=100,
            full_scan_threshold=10000,
            max_indexing_threads=0,
            on_disk=None,
            payload_m=None
        ),
        optimizer_config=OptimizersConfig(
            deleted_threshold=0.2,
            vacuum_min_vector_number=1000,
            default_segment_number=0,
            max_segment_size=None,
            memmap_threshold=None,
            indexing_threshold=20000,
            flush_interval_sec=5,
            max_optimization_threads=1
        ),
        wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0),
        quantization_config=None
    ),
    payload_schema={}
)

We can see a bunch of information about our collection. Including the vector count, the dimensionality of the vectors and the distance metric we're using. You'll see that there are plenty of knobs to turn here to optimize your collection but that's for another post.

We can use the scroll method to get the first vector from our collection

print(client.scroll(collection_name,limit=1)[0][0])
Record(
    id=1,
    payload={
        'id': 0,
        'title': '24-hour clock',
        'text': 'The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and
is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only
in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very 
rarely) as continental time. In some parts of the world, it is called railway time. Also, the international 
standard notation of time (ISO 8601) is based on this format.',
        'url': 'https://simple.wikipedia.org/wiki?curid=9985',
        'wiki_id': 9985,
        'views': 2450.62548828125,
        'paragraph_id': 0,
        'langs': 30
    },
    vector=None
)

We can also grab items from the payload for each point.

print(client.scroll('cohere_wikipedia',limit=1)[0][0].payload['text'])
The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 
24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and 
the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as 
continental time. In some parts of the world, it is called railway time. Also, the international standard notation 
of time (ISO 8601) is based on this format.

We can see this article is about the 24-hour clock system. Let's see what other pages are similar to this one. We can optionally get the vector for the query point.

vector = client.scroll('cohere_wikipedia',limit=1,with_vectors=True)[0][0].vector

We can use our vector as a query to find similar vectors in our collection. We'll use the search method to do this.

query_vector = client.scroll(collection_name, limit=1, with_vectors=True)[0][0].vector
hits = client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=15,  # Return 5 closest points
)

Let's look at some of the results. We can see that the first result is the same article. The rest also seem to be about time/24 hour clock systems!

for hit in hits:
    print(f"{hit.payload['title']} | {hit.payload['text']}")
    print("---")
24-hour clock | The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and 
is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only
in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very 
rarely) as continental time. In some parts of the world, it is called railway time. Also, the international 
standard notation of time (ISO 8601) is based on this format.
---
24-hour clock | However, the US military prefers not to say 24:00 - they do not like to have two names for the same
thing, so they always say "23:59", which is one minute before midnight.
---
24-hour clock | 24-hour clock time is used in computers, military, public safety, and transport. In many Asian, 
European and Latin American countries people use it to write the time. Many European people use it in speaking.
---
24-hour clock | In railway timetables 24:00 means the "end" of the day. For example, a train due to arrive at a 
station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the 
day go at 00:00.
---
24-hour clock | A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or 
hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under 
the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and 
ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 
24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you 
would say "Tuesday at 24:00" and "Wednesday at 00:00" to mean exactly the same time.
---
12-hour clock | The 12-hour clock is a way of dividing the 24 hours of the day into two sections. The two halves 
are called ante meridiem (a.m.) and post meridiem (p.m.).
---
12-hour clock | Both names are from Latin, and numbered from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12. Time from 
midnight to noon is a.m. and from noon to midnight p.m. The table at right shows how it relates to the 24-hour 
clock.
---
Hour | An hour (abbreviation: h or hr) is a unit of measurement used to measure time. An hour is equal to 60 
minutes. 24 hours are equal to one day. Unlike the second, the hour is not a base SI unit.
---
Midnight | The time period "00:00 - 00:01" is midnight. On computer clocks, the day changes to the next day the 
minute(s) after midnight.
---
Chinese zodiac | In the old days, China and Japan used a 12-hour system to tell the time of day and night (unlike 
the 24 hour system used today). The 12 hour system divides the day of 24 hours into 12 hours, each of which has a 
sign of the zodiac:
---
Coordinated Universal Time | Note that UTC uses the 24-hour clock. That means there is no 'AM' or 'PM'. For 
example, 4:00PM would be 16:00 or 1600. UTC also does not use daylight savings time - that way the time stays 
consistent the entire year.
---
Midnight | In the world, midnight is the start of one day and the end of the last day. It's the dividing point 
between two days.
---
Noon | Noon is the time exactly halfway through the day (12.00-12:00 in the 24-hour clock and 12:00 PM-12:00 PM in 
the 12-hour clock). Midday also means noon, although this also means "around" noon, or very early afternoon.
---
Coordinated Universal Time | The standard before was Greenwich Mean Time (GMT). UTC and GMT are almost the same. In
fact, there is no practical difference which would be noticed by ordinary people.
---
Midnight | In the United States and Canada, digital clocks and computers usually show 12 a.m. right at midnight. 
However, people have to remember that any time is actually an instant. The "a.m." shown on clock displays means the
12-hour period after the instant of midnight. So when a clock says "12:00 a.m.", midnight has already passed and a 
new day has started. In other words, 11:59 p.m. shows until midnight; at the instant of midnight, it changes to 
12:00. At the same time, the p.m. changes to a.m., but a.m. does not mean the instant of midnight which separates 
p.m. and a.m.
---

Conclusion

This post showed how it's possible to easily convert a Hugging Face dataset into a Qdrant collection. We then showed how we can use this collection to find similar articles.

There is a lot of scope for optimization here. For example, we could use a more efficient way to add data to Qdrant. We could also use a more efficient way to search our collection. It would be very cool to directly have a from_hf_datasets method in the Qdrant Python client that would do all of this for us and include some optimizations!

I hope this post has shown how easy it is to use Qdrant with Hugging Face datasets. If you have any questions or comments please let me know on Twitter.