Combining Hugging Face datasets with dask
Using 🤗 datasets in combination with dask
Hugging Face datasets is a super useful library for loading, processing and sharing datasets with other people.
For many pre-processing steps it works beautifully. The one area where it can be a bit trickier to use is for EDA style analysis. This column-wise EDA is often important as an early step in working with some data or for preparing a data card.
Fortunately combining datasets and another data library, dask works pretty smoothly. This isn't intended to be a full intro to either datasets or dask but hopefully gives you a sense of how both libaries work and how they can complement each other.
First, make sure we have the required libraries. Rich is there for a little added visual flair ✨
%%capture
!pip install datasets toolz rich[jupyter] dask
%load_ext rich
For this example we will use a the blbooksgenre dataset that contains metadata about some digitised books from the British Library. This collection also includes some annotations for the genre of the book which we could use to train a machine learning model.
We can load a dataset hosted on the Hugging Face hub by using the load_dataset
function.
from datasets import load_dataset
ds = load_dataset("blbooksgenre", "annotated_raw", split="train")
Since we requested only the train split we get back a Dataset
ds
We can see this has a bunch of columns. One that is of interest is the Data of publication
column. Since we could use this dataset to train some type of classifier we may want to check whether we have enough examples across different time periods in the dataset.
ds[0]["Date of publication"]
One quick way we can get the frequency count for a column is using the wonderful toolz library
If our data fits in memory, we can simply pass in a column containing a categorical value to a frequency function to get a frequency count.
from toolz import frequencies, topk
dates = ds["Date of publication"]
frequencies(dates)
Make it parallel!
If our data doesn't fit in memory or we want to do things in parallel we might want to use a slightly different approach. This is where dask can play a role.
Dask offers a number of different collection abstractions that make it easier to do things in parallel. This includes dask bag.
First we'll create a dask client here, I won't dig into the details of this here but you can get a good overview in the getting started pages.
from distributed import Client
client = Client()
Since we don't want to load all of our data into memory we can great a generator that will yield one row at a time. In this case we'll start by exploring the Title
column
def yield_titles():
for row in ds:
yield row["Title"]
We can see that this returns a generator
yield_titles()
next(iter(yield_titles()))
We can store this in a titles variable.
titles = yield_titles()
We'll now import dask bag.
import dask.bag as db
We can create a dask bag object using the from_sequence
method.
bag = db.from_sequence(titles)
bag
We can look at an example using the take
method
bag.take(1)
dask bag has a bunch of handy methods for processing data (some of these we could also do in 🤗 datasets but others are not available as specific methods in datasets).
For example we can make sure we only have unique titles using the distinct
method.
unique_titles = bag.distinct()
unique_titles.take(4)
Similar to 🤗 datasets we have a map method that we can use to apply a function to all of our examples. In this case we split the title text into individual words.
title_words_split = unique_titles.map(lambda x: x.split(" "))
title_words_split.take(2)
We can see we now have all our words in a list. Helpfully dask bag has a flatten
method. This will consume our lists and put all the words in a single sequence.
flattend_title_words = title_words_split.flatten()
flattend_title_words.take(2)
We could now use the frequencies
method to get the top words.
freqs = flattend_title_words.frequencies(sort=True)
freqs
Since dask bag methods are lazy by default nothing has actually been calculated yet. We could just grab the top 10 words.
top_10_words = freqs.topk(10, key=1)
If we want the results of something we call compute
which will call all of the chained methods on our bag.
top_10_words.compute()
We could also do the same with lowered version
lowered_title_words = flattend_title_words.map(lambda x: x.lower())
freqs = lowered_title_words.frequencies(sort=True)
The visualize method gives you some insights into how the computation is managed by dask.
freqs.visualize(engine="cytoscape", optimize_graph=True)
ds.to_parquet("genre.parquet")
import dask.dataframe as dd
and load from this file
ddf = dd.read_parquet("genre.parquet")
As dask dataframe works quite similar to a pandas dataframe. It is lazy by default so if we just print it out
ddf
You'll see we don't actually get back any data. If we use head we get the number of examples we ask for.
ddf.head(3)
We have some familiar methods from pandas available to us
ddf = ddf.drop_duplicates(subset="Title")
As an example of something that would be a bit tricky in datasets, we can see how to groupby the mean title length by year of publication. First we create a new column for title length
ddf["title_len"] = ddf["Title"].map(lambda x: len(x))
We can then groupby the date of publication
grouped = ddf.groupby("Date of publication")
and then calculate the mean title_len
mean_title_len = grouped["title_len"].mean()
To actually compute this value we call the compute
method
mean_title_len.compute()
We can also create a plot in the usual way
mean_title_len.compute().plot()
This was a very quick overview. The dask docs go into much more detail as do the Hugging Face datasets docs.