%pip install datasets qdrant-client --q
[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Daniel van Strien
November 8, 2023
[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
For this post we’ll use the Cohere/wikipedia-22-12-simple-embeddings dataset which has already had embeddings generated for it. This dataset was created by Cohere and creates embeddings for millions of Wikipedia articles. See this post for more details.
We’ll use the Hugging Face datasets library to load the dataset.
from datasets import load_dataset
dataset = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")
/Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
table = cls._concat_blocks(blocks, axis=0)
Let’s take a quick look at the dataset.
Dataset({
features: ['id', 'title', 'text', 'url', 'wiki_id', 'views', 'paragraph_id', 'langs', 'emb'],
num_rows: 485859
})
We can see the dataset has a emb
column which contains the embeddings for each article. Alongside this we see the title
and text
for the articles alongside some other metadata. Let’s also take a look at the features of the dataset.
Let’s also take a quick look at the features of the dataset. Hugging Face Dataset objects have a features
attribute which contains the features of the dataset. We can see that the emb
column is a Sequence
of float32
values. We also have some other columns with string
values, int32
and float32
values.
{'id': Value(dtype='int32', id=None),
'title': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None),
'url': Value(dtype='string', id=None),
'wiki_id': Value(dtype='int32', id=None),
'views': Value(dtype='float32', id=None),
'paragraph_id': Value(dtype='int32', id=None),
'langs': Value(dtype='int32', id=None),
'emb': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}
Qdrant has support for a pretty varied range of types. All of these types in our dataset are supported by Qdrant so we don’t need to do any conversion.
We’ll use the Qdrant Python client for this post. This client is really nice since it allows you to create a local collection using pure Python i.e. no need to run a Qdrant server. This is great for testing and development. Once you’re ready to deploy your collection you can use the same client to connect to a remote Qdrant server.
We first create a client, in this case using a local path for our DB.
Qdrant is very flexible but we need to let Qdrant now a few things about our collection. These include the name, and a config for the vectors we want to store. This config includes the dimensionality of the vectors and the distance metric we want to use. Let’s first check out the dimensionality of our vectors.
We’ll also store our collection in a variable so we can use it later.
Note this code can be heavily optimized but gives an idea of how easy adding data to Qdrant can be. For many datasets this naive approach will work fine.
The approach we’ll take below is to loop through our dataset and yield each row as a PointStruct
. This is a Qdrant object that contains the vector and any other data, referred to as the payload, that we want to store.
For this post we’ll use a smallish subset of the dataset. We’ll use the first 100_000 rows. Big enough to be interesting but small enough to play around with quickly.
We’ll use the toolz
libraries partition_all
function to get batches from our yield_rows function. We’ll use tqdm
to show a progress bar.
%%time
bs = 100
for batch in tqdm(partition_all(bs, yield_rows(sample)), total=len(sample) // bs):
client.upsert(collection_name=collection_name, points=list(batch), wait=False)
CPU times: user 30.9 s, sys: 35.7 s, total: 1min 6s
Wall time: 1min 19s
On my 2021 MacBook Pro with an M1 chip this takes about 90 seconds to run. As mentioned above this can be heavily optimized but this gives an idea of how easy it is to add data to Qdrant from a Hugging Face dataset.
What can we do with our Qdrant collection? We can use our embeddings to find similar wikipedia articles. Let’s see how we can do that.
First we’ll use the get_collection
method to see some information about our collection.
CollectionInfo( status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=100000, indexed_vectors_count=0, points_count=100000, segments_count=1, config=CollectionConfig( params=CollectionParams( vectors=VectorParams( size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None ), shard_number=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None ), hnsw_config=HnswConfig( m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None ), optimizer_config=OptimizersConfig( deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1 ), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None ), payload_schema={} )
We can see a bunch of information about our collection. Including the vector count, the dimensionality of the vectors and the distance metric we’re using. You’ll see that there are plenty of knobs to turn here to optimize your collection but that’s for another post.
We can use the scroll
method to get the first vector from our collection
Record( id=1, payload={ 'id': 0, 'title': '24-hour clock', 'text': 'The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.', 'url': 'https://simple.wikipedia.org/wiki?curid=9985', 'wiki_id': 9985, 'views': 2450.62548828125, 'paragraph_id': 0, 'langs': 30 }, vector=None )
We can also grab items from the payload for each point.
The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.
We can see this article is about the 24-hour clock system. Let’s see what other pages are similar to this one. We can optionally get the vector for the query point.
We can use our vector as a query to find similar vectors in our collection. We’ll use the search
method to do this.
Let’s look at some of the results. We can see that the first result is the same article. The rest also seem to be about time/24 hour clock systems!
24-hour clock | The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.
---
24-hour clock | However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say "23:59", which is one minute before midnight.
---
24-hour clock | 24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.
---
24-hour clock | In railway timetables 24:00 means the "end" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00.
---
24-hour clock | A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say "Tuesday at 24:00" and "Wednesday at 00:00" to mean exactly the same time.
---
12-hour clock | The 12-hour clock is a way of dividing the 24 hours of the day into two sections. The two halves are called ante meridiem (a.m.) and post meridiem (p.m.).
---
12-hour clock | Both names are from Latin, and numbered from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12. Time from midnight to noon is a.m. and from noon to midnight p.m. The table at right shows how it relates to the 24-hour clock.
---
Hour | An hour (abbreviation: h or hr) is a unit of measurement used to measure time. An hour is equal to 60 minutes. 24 hours are equal to one day. Unlike the second, the hour is not a base SI unit.
---
Midnight | The time period "00:00 - 00:01" is midnight. On computer clocks, the day changes to the next day the minute(s) after midnight.
---
Chinese zodiac | In the old days, China and Japan used a 12-hour system to tell the time of day and night (unlike the 24 hour system used today). The 12 hour system divides the day of 24 hours into 12 hours, each of which has a sign of the zodiac:
---
Coordinated Universal Time | Note that UTC uses the 24-hour clock. That means there is no 'AM' or 'PM'. For example, 4:00PM would be 16:00 or 1600. UTC also does not use daylight savings time - that way the time stays consistent the entire year.
---
Midnight | In the world, midnight is the start of one day and the end of the last day. It's the dividing point between two days.
---
Noon | Noon is the time exactly halfway through the day (12.00-12:00 in the 24-hour clock and 12:00 PM-12:00 PM in the 12-hour clock). Midday also means noon, although this also means "around" noon, or very early afternoon.
---
Coordinated Universal Time | The standard before was Greenwich Mean Time (GMT). UTC and GMT are almost the same. In fact, there is no practical difference which would be noticed by ordinary people.
---
Midnight | In the United States and Canada, digital clocks and computers usually show 12 a.m. right at midnight. However, people have to remember that any time is actually an instant. The "a.m." shown on clock displays means the 12-hour period after the instant of midnight. So when a clock says "12:00 a.m.", midnight has already passed and a new day has started. In other words, 11:59 p.m. shows until midnight; at the instant of midnight, it changes to 12:00. At the same time, the p.m. changes to a.m., but a.m. does not mean the instant of midnight which separates p.m. and a.m.
---
This post showed how it’s possible to easily convert a Hugging Face dataset into a Qdrant collection. We then showed how we can use this collection to find similar articles.
There is a lot of scope for optimization here. For example, we could use a more efficient way to add data to Qdrant. We could also use a more efficient way to search our collection. It would be very cool to directly have a from_hf_datasets
method in the Qdrant Python client that would do all of this for us and include some optimizations!
I hope this post has shown how easy it is to use Qdrant with Hugging Face datasets. If you have any questions or comments please let me know on Twitter.