%%bash
//nlsfoundry.s3.amazonaws.com/data/nls-data-encyclopaediaBritannica.zip
wget https:*.zip unzip
Label Studio x Hugging Face datasets hub
Full stack deep learning: annotating data
I’m currently going through the Full Stack Deep Learning course. As part of this we’ve been going through tools for different parts of the machine learning pipeline. This post talks about data annotation, and how we can combine Label Studio and the Hugging Face Datasets hub. I’ll use the example of annotating image data for an image classification task. The details of why I’m annotating this data will wait for a future post!
note this post assumes you already know roughly what the Hugging Face Hub is. If you don’t this is a nice intro.
What is the goal?
We want to have a way of easily moving from different stages of our machine learning project pipeline. For many projects, especially the weird stuff I’m likely to do, you will need to do some of your own annotating. It almost always makes sense to move quickly between annotating a first batch of data, trying to train a model and iterating. This can help:
- flag issues with your data
- identify if you have ambiguous labels
- help you get some sense of how a model might perform on the task you are working on
- allow you to deploy a model early so you can begin iterating on the whole pipeline
- …
This approach can cause some challenges; how do you keep updating your annotations, how can you version the changes?
A more mundane challenge
In the full stack deep learning course, one of the labs covered using Label Studio to annotate data. Label studio is a great open source tool for annotating data across a range of domains and for a variety of tasks.
Label studio has great support for annotating image data. One challenge we can face, however, is how to load images into label studio. This can be particularly tricky if you only have the images locally since label studio prefers images to be available via a URL. There are various ways around this but we may also be able to tackle this challenge using the datasets hub.
We’ll start by downloading a dataset we want annotate warning this dataset is pretty big ~44GB uncompressed.
We’ll import some standard libraries
import pandas as pd
from pathlib import Path
Create a new dataset on the Hub
Since we want to upload our data to the Hugging Face hub we’ll create a new dataset on the Hugging Face Hub via the CLI.
huggingface-cli repo create encyclopaedia_britannica --type dataset
Under the hood, Hugging Face hub datasets (and models) are Git repositories. We’ll clone this repo and move the downloaded dataset into this new Git repository.
git clone https://huggingface.co/datasets/davanstrien/encyclopaedia_britannica
mv nls-data-encyclopaediaBritannica encyclopaedia_britannica/
Since the number of examples in this dataset is beyond what we’re likely to annotate we do a bit of deleting of the dataset. You could also take a sample of the original but in this case I’m happy to reclaim some space on my hardrive!
import shutil
from tqdm.auto import tqdm
first we get rid of some alto folders that we don’t need for the dataset we’re aiming to create
for directory in tqdm(
list(
("encyclopaedia_britannica/nls-data-encyclopaediaBritannica").rglob(
Path("*alto"
)
)
)
): shutil.rmtree(directory)
100%|██████████| 195/195 [00:34<00:00, 5.62it/s]
there are a few other *.xml
files in this dataset we also remove
for xml_file in tqdm(
list(
("encyclopaedia_britannica/nls-data-encyclopaediaBritannica").rglob(
Path("*xml"
)
)
)
): xml_file.unlink()
100%|██████████| 195/195 [00:00<00:00, 1464.47it/s]
Let’s take a look at how many images we have now
= list(
image_files "encyclopaedia_britannica/nls-data-encyclopaediaBritannica").rglob("*jpg"))
(Path( )
len(image_files)
155388
We’re not likely to annotate this many images, let’s aim to have a max of 10,000 images. This is also likely to be more than we’ll annotate but we may use a smaller sample for unsupervised pre-training.
= len(image_files) - 10_000 num_to_remove
We’ll now randomly remove the extra images we don’t need beyond our sample
import random
= random.sample(image_files, num_to_remove)
to_remove for file in tqdm(to_remove):
file.unlink()
100%|██████████| 90000/90000 [00:33<00:00, 2659.02it/s]
len(
list(
("encyclopaedia_britannica/nls-data-encyclopaediaBritannica").rglob(
Path("*jpg"
)
)
) )
10000
Uploading our raw data to the hub
We can now upload this data to the Hugging Face Hub. Under the hood the Hub uses Git so everything you love (and hate) about Git should be familiar. The main difference between using the hub and GitHub or another Git hosting platform is that the Hugging Face hub has support for large files. This means we can more easily work with large files (like our images).
cd encyclopaedia_britannica
git lfs track "*.jpg"
git add .gitattributes
git add nls-data-encyclopaediaBritannica
git commit -m "add image files"
git push
Loading local files and metadata
The particular dataset we’re working with also has a metadata file associated with it. We can grab all of the images so we can put them in a DataFrame and merge this with metadata about these images. We may not use this extra metadata but it’s nice to have this additional metadata about our items alongside our annotations. This can help us debug where our model is performing badly later on.
= list(
image_files "encyclopaedia_britannica/nls-data-encyclopaediaBritannica").rglob("*.jpg")
Path( )
= pd.DataFrame(image_files, columns=["filename"]) df
This dataset also comes with some metadata. We’ll load that in to another DataFrame
= pd.read_csv(
metadata_df "encyclopaedia_britannica/nls-data-encyclopaediaBritannica/encyclopaediaBritannica-inventory.csv",
=None,
header=["id", "meta"],
names={"id": "int64"},
dtype )
"id"] = df.filename.apply(lambda x: x.parts[-3]).astype("int64") df[
= df.merge(metadata_df, on="id")
df df
filename | id | meta | |
---|---|---|---|
0 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291 | Encyclopaedia Britannica - Third edition, Volu... |
1 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291 | Encyclopaedia Britannica - Third edition, Volu... |
2 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291 | Encyclopaedia Britannica - Third edition, Volu... |
3 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291 | Encyclopaedia Britannica - Third edition, Volu... |
4 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291 | Encyclopaedia Britannica - Third edition, Volu... |
... | ... | ... | ... |
9995 | encyclopaedia_britannica/nls-data-encyclopaedi... | 193696083 | Encyclopaedia Britannica - Seventh edition, Vo... |
9996 | encyclopaedia_britannica/nls-data-encyclopaedi... | 193696083 | Encyclopaedia Britannica - Seventh edition, Vo... |
9997 | encyclopaedia_britannica/nls-data-encyclopaedi... | 193696083 | Encyclopaedia Britannica - Seventh edition, Vo... |
9998 | encyclopaedia_britannica/nls-data-encyclopaedi... | 193696083 | Encyclopaedia Britannica - Seventh edition, Vo... |
9999 | encyclopaedia_britannica/nls-data-encyclopaedi... | 193696083 | Encyclopaedia Britannica - Seventh edition, Vo... |
10000 rows × 3 columns
Annotating using label studio
Now we have our images uploaded to the Hugging Face hub, how we go about annotating? As was mentioned already the Hugging Face hub is essentially a Git repo. Since we uploaded our image files individually i.e. not in a compressed folder, we can access each file from that repo. We mentioned before that label studio can load images from URLs. The hub has an API that we can use to interact with our repository. Let’s see how we can use this to get our data ready for label studio.
from huggingface_hub import list_repo_files, hf_hub_url
= list_repo_files("davanstrien/encyclopaedia_britannica", repo_type="dataset")
files 2] files[:
['.gitattributes',
'nls-data-encyclopaediaBritannica/144133901/image/188082865.3.jpg']
We’ll filter out some data we are not interested in annotating
= [file for file in files if not file.startswith(".")]
files len(files)
10002
hf_hub_url
can be used to generate the URL for a particular file
hf_hub_url("davanstrien/encyclopaedia_britannica",
"192866824.3.jpg",
="sample/nls-data-encyclopaediaBritannica/192547788/image",
subfolder="dataset",
repo_type )
'https://huggingface.co/datasets/davanstrien/encyclopaedia_britannica/resolve/main/sample/nls-data-encyclopaediaBritannica/192547788/image/192866824.3.jpg'
We can use this to grab all of the URLs we’re interested in
= []
urls for file in files:
file = Path(file)
urls.append(
hf_hub_url("davanstrien/encyclopedia_britannica",
file.name,
=file.parents[0],
subfolder="dataset",
repo_type
) )
We can now load these into a DataFrame, and save this to a CSV file.
=["image"]).to_csv("data.csv", index=False) pd.DataFrame(urls, columns
"data.csv") pd.read_csv(
image | |
---|---|
0 | https://huggingface.co/datasets/davanstrien/en... |
1 | https://huggingface.co/datasets/davanstrien/en... |
2 | https://huggingface.co/datasets/davanstrien/en... |
3 | https://huggingface.co/datasets/davanstrien/en... |
4 | https://huggingface.co/datasets/davanstrien/en... |
... | ... |
9997 | https://huggingface.co/datasets/davanstrien/en... |
9998 | https://huggingface.co/datasets/davanstrien/en... |
9999 | https://huggingface.co/datasets/davanstrien/en... |
10000 | https://huggingface.co/datasets/davanstrien/en... |
10001 | https://huggingface.co/datasets/davanstrien/en... |
10002 rows × 1 columns
Loading annotations into label studio
We can use this file to load our data into label studio
From here, we need to define our annotation task. We can then begin annotating data.
Export annotations
You can either wait until you’ve finished doing all the labels, however, we may have a lot of data to annotate so it’s likely instead that we will want to export once we’ve either hit a reasonable number of labels or get too bored of annotating. There are various different export formats available in this case we’ll use JSON-Min
Load annotations
Now we have export our annotations lets load them into a new DatafFame. We’ll only select the columns we’re interested in
= pd.read_json("project-3-at-2022-09-08-15-16-4279e901.json")[
annotation_dataframe "image", "choice"]
[ ]
annotation_dataframe
image | choice | |
---|---|---|
0 | https://huggingface.co/datasets/davanstrien/en... | text-only |
1 | https://huggingface.co/datasets/davanstrien/en... | text-only |
2 | https://huggingface.co/datasets/davanstrien/en... | text-only |
3 | https://huggingface.co/datasets/davanstrien/en... | text-only |
4 | https://huggingface.co/datasets/davanstrien/en... | text-only |
... | ... | ... |
1516 | https://huggingface.co/datasets/davanstrien/en... | text-only |
1517 | https://huggingface.co/datasets/davanstrien/en... | text-only |
1518 | https://huggingface.co/datasets/davanstrien/en... | text-only |
1519 | https://huggingface.co/datasets/davanstrien/en... | text-only |
1520 | https://huggingface.co/datasets/davanstrien/en... | text-only |
1521 rows × 2 columns
If we take a look at the URL for one of the annotations, you’ll see that we still have a nice path that mirrors the folder structure of the original data. This also means we can merge this annotations DataFrame with our previous metadata DataFrame.
0, "image"] annotation_dataframe.loc[
'https://huggingface.co/datasets/davanstrien/encyclopaedia_britannica/resolve/main/nls-data-encyclopaediaBritannica/192693396/image/192979378.3.jpg'
0, "image"].split("/")[-4:] annotation_dataframe.loc[
['nls-data-encyclopaediaBritannica', '192693396', 'image', '192979378.3.jpg']
"filename"] = annotation_dataframe["image"].apply(
annotation_dataframe[lambda x: "/".join(x.split("/")[-4:])
)
"filename"] = annotation_dataframe["filename"].astype(str) annotation_dataframe[
= df.merge(annotation_dataframe, on="filename", how="outer") df
df
filename | id | meta | image | choice | |
---|---|---|---|---|---|
0 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291.0 | Encyclopaedia Britannica - Third edition, Volu... | NaN | NaN |
1 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291.0 | Encyclopaedia Britannica - Third edition, Volu... | NaN | NaN |
2 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291.0 | Encyclopaedia Britannica - Third edition, Volu... | NaN | NaN |
3 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291.0 | Encyclopaedia Britannica - Third edition, Volu... | NaN | NaN |
4 | encyclopaedia_britannica/nls-data-encyclopaedi... | 190273291.0 | Encyclopaedia Britannica - Third edition, Volu... | NaN | NaN |
... | ... | ... | ... | ... | ... |
11516 | nls-data-encyclopaediaBritannica/144133901/ima... | NaN | NaN | https://huggingface.co/datasets/davanstrien/en... | text-only |
11517 | nls-data-encyclopaediaBritannica/144133901/ima... | NaN | NaN | https://huggingface.co/datasets/davanstrien/en... | text-only |
11518 | nls-data-encyclopaediaBritannica/144133901/ima... | NaN | NaN | https://huggingface.co/datasets/davanstrien/en... | text-only |
11519 | nls-data-encyclopaediaBritannica/144133901/ima... | NaN | NaN | https://huggingface.co/datasets/davanstrien/en... | text-only |
11520 | nls-data-encyclopaediaBritannica/144133901/ima... | NaN | NaN | https://huggingface.co/datasets/davanstrien/en... | text-only |
11521 rows × 5 columns
This means we can keep our nice orignal metadata intact but also add our additional metadata where it exists. Let’s check how many annotations we have
df.choice.value_counts()
text-only 1436
illustrated 70
Name: choice, dtype: int64
We can also see how much of our dataset we have coverage for
len(df[df.choice.notna()]) / len(df)
0.13071781963371235
How to use our annotations?
We now have some annoations inside a DataFrame. What should we do we these? We can also use the Hub for storing this. This comes with a few benefits: - we keep our data and annotations in the same place. - since the Hub uses Git under the hood we also get versioning for our dataset. We can use this version information to track for example how different models perform during training as we add more labels.
Another nice thing about the Hub is that we can create dataset loading scripts to load our data. This script can use this CSV we’ve just created and only load the data we have examples for.
First we’ll save to a CSV file:
"annotations.csv", index=None) df.to_csv(
We can then copy these into the same repository used to host our dataset.
cp annotations.csv encyclopedia_britannica/
Once we’ve done this we can commit these and push our annotations to the hub:
cd encyclopedia_britannica/
git add annotations.csv
git commit -m "update annotations"
git push
What next?
We now have a repository which contains a bunch of images, and a CSV file which contains annotations for some of these images. How do we use this for model training? From this point we can create a dataset loading script inside the same repository.
This dataset loading script will allow us to load the data from the hub using the datasets
library. Additionally we can write this script so that it only loads data we have annotations for.
What does this mean: - we have a dataset we can use to train our model - the dataset is hosted on the Hugging Face hub which means it’s easy to share with other people - we can keep adding new annotations to this dataset and pushing our changes to the hub - Since the datasets
library has nice caching support it will only download the dataset if there are changes. This change will be triggered by changes to our annotations.csv file.
Loading the dataset
Once we have our loading script we can load our annotations using the datasets
library:
from datasets import load_dataset
import datasets
= load_dataset('davanstrien/encyclopedia_britannica') ds
Using custom data configuration default
Reusing dataset encyclopedia_britannica (/Users/dvanstrien/.cache/huggingface/datasets/davanstrien___encyclopedia_britannica/default/1.1.0/8dd4d7982f31fd11ed71020b79b4b11a0068c8243080066e43b9fe3980934467)
'train'][0] ds[
{'metadata': 'nan',
'image': 'https://huggingface.co/datasets/davanstrien/encyclopaedia_britannica/resolve/main/nls-data-encyclopaediaBritannica/192693396/image/192979378.3.jpg',
'label': 0}
from datasets import load_dataset
from datasets.utils.file_utils import get_datasets_user_agent
from functools import partial
from concurrent.futures import ThreadPoolExecutor
import urllib
import io
import PIL
= get_datasets_user_agent()
USER_AGENT
def fetch_single_image(image_url, timeout=None, retries=0):
for _ in range(retries + 1):
try:
= urllib.request.Request(
request
image_url,=None,
data={"user-agent": USER_AGENT},
headers
)with urllib.request.urlopen(request, timeout=timeout) as req:
= PIL.Image.open(io.BytesIO(req.read()))
image break
except Exception:
= None
image return image
def fetch_images(batch, num_threads, timeout=1, retries=0):
= partial(fetch_single_image, timeout=timeout, retries=retries)
fetch_single_image_with_args with ThreadPoolExecutor(max_workers=num_threads) as executor:
"image"] = list(executor.map(fetch_single_image_with_args, batch["image"]))
batch[return batch
= 16
num_threads = ds.map(fetch_images, batched=True, batch_size=64, fn_kwargs={"num_threads": num_threads}, writer_batch_size=64) ds
Loading cached processed dataset at /Users/dvanstrien/.cache/huggingface/datasets/encyclopaedia_britannica/default/1.1.0/f7fb8d1f26daa72fbaf883bb1707e13d304414c1af16f02c00782c985971f87c/cache-fda9502ac5b20332.arrow
= ds.cast_column('image', datasets.Image()) ds
'train'][0]['image'] ds[
Where won’t this work?
This workflow is based on the assumption that the dataset you are annotating is public from the start. This is usually possible for the domain I work in (libraries) but could be a major blocker for other people. This workflow might also break if you have lots of people annotating. There are probably ways around this but things could start becoming a bit hacky…
The loading script for loading this dataset does some slightly strange things to avoid loading images that don’t yet have annotations. I think it would make sense to rework this script if you get to a point you are unlikely to do any more annotations.