A (very brief) intro to exploring metadata on the Hugging Face Hub
How we can use the `huggingface_hub` library to explore metadata on the Hugging Face Hub.
The Hugging Face Hub has become the de facto place to share machine learning models and datasets.
As the number of models and datasets grow the challenge of finding the right model or dataset for your needs may become more challenging.
There are various ways in which we can try and make it easier for people to find relevant models and datasets.
One of these is by associating metadata with datasets and models.
This blog post will (very briefly) begin to explore metadata on the Hugging Face Hub.
Often you'll want to explore models and datasets via the Hub website but this isn't the only way to explore the Hub.
As part of the process of exploring metadata on the Hugging Face Hub we'll briefly look at how we can use the huggingface_hub
library to programmatically interact with the Hub.
For this post we'll need a few libraries, pandas
, requests
and matplotlib
are likely old friends (or foes...). The huggingface_hub
library might be new to you but will soon become a good friend too! The rich
library is fantastically useful for quickly getting
familiar with a library (i.e. avoiding reading all the docs!) so we'll import that too.
import requests
from huggingface_hub import hf_api
import pandas as pd
import matplotlib.pyplot as plt
import rich
%matplotlib inline
plt.style.use("ggplot")
We'll instantiate an instance of the HfApi
class.
api = hf_api.HfApi()
We can use rich
inspect
to get a better sense of what a function or class instance is all about. Let's see what methods the api has.
rich.inspect(api, methods=True)
You'll see from looking through this there is a bunch of different things we can now do programmatically via the hub. For this post we're interested in the list_datasets
and list_models
methods. If we look at one of these we can see it has a bunch of different options we can use when listing datasets or models.
rich.inspect(api.list_models)
For our use case we want everything, so we set limit=None
, we don't want any filters
so we set this to None
(this is the default behaviour, but we set them explicitly here to make it clearer for our future selves). We also set full=True
so we get back more verbose information about our dataset and models. We also wrap the result in iter
and list
since the behaviour of these methods will change in future versions to support paging.
hub_datasets = list(iter(api.list_datasets(limit=None, filter=None, full=True)))
hub_models = list(iter(api.list_models(limit=None, filter=None, full=True)))
Let's peek at an example of what we get back
hub_models[0]
hub_datasets[0]
Since we want both models and datasets we create a dictionary which stores the types of item i.e. is it a dataset or a model.
hub_data = {"model": hub_models, "dataset": hub_datasets}
We'll be putting our data inside a pandas DataFrame, so we'll grab the .__dict__
attribute for each hub item, so it's more pandas friendly.
hub_item_dict = []
for hub_type, hub_item in hub_data.items():
for item in hub_item:
data = item.__dict__
data["type"] = hub_type
hub_item_dict.append(data)
df = pd.DataFrame.from_dict(hub_item_dict)
How many hub items do we have?
len(df)
What info do we have?
df.columns
df.loc[30, "tags"]
We can see that tags
capture can relate to tasks i.e. text-classification
, libraries supported i.e. tf
, or the licence
associated with a model or dataset. As a starting point for exploring tags we can take a look at how many tags models and datasets have. We'll add a new column to capture this number.
def calculate_number_of_tags(tags: [str]) -> int:
return len(tags)
df["number_of_tags"] = df["tags"].apply(lambda x: calculate_number_of_tags(x))
We can now use describe
to see the breakdown of this number.
df.number_of_tags.describe()
We can see that we have quite a range of tag numbers ranging from 0
to 650
! If your brain works anything like mine you probably want to know what this high value is about!
df[df.number_of_tags > 640][["id", "tags"]]
df[df.number_of_tags > 640]["tags"].tolist()
We can see that in this case many of the tags relate to language. Since the dataset is bible related and the bible has been heavily translated this might not be as surprising.
Although these high-level stats are somewhat interesting we probably want to break these numbers down. At a high level we can groupby datasets vs models.
df.groupby("type")["number_of_tags"].describe()
We can see that the mean number of tags for models is higher than datasets. We can also see at the 75% percentile models also have more tags compared to datasets. The possible reasons for this (and whether this is a problem or not) is something we may wish to explore further...
Since the hub hosts models from different libraries we may want to also breakdown by library. First let's grab only the model part of our DataFrame.
models_df = df[df["type"] == "model"]
The library_name
column contains info about the library. Let's see how many unique libraries we have.
models_df.library_name.unique().shape
This is quite a few! We can do a groupby on this column
models_df.groupby("library_name")["number_of_tags"].describe()
We might find this a bit tricky to look at. We may want to only include the top n libraries since some of these libraries may be less well used.
models_df.library_name.value_counts()[:15]
top_libraries = models_df.library_name.value_counts()[:9].index.to_list()
top_libraries_df = models_df[models_df.library_name.isin(top_libraries)]
top_libraries_df.groupby("library_name")["number_of_tags"].describe()
Let's take a quick look at some examples from the library with the highest and lowest number or tags.
top_libraries_df[top_libraries_df.library_name == "sentence-transformers"].sample(15)[
"tags"
]
top_libraries_df[top_libraries_df.library_name == "timm"].sample(15)["tags"]
We can see here that some tags for sentence-transformers
are very closely tied to that libraries purpose e.g. the sentence-similarity
tag. This tag migth be useful when a user is looking for models to do sentence-similarity
but might be less useful if you are trying to choose between models for this task i.e. trying to find the setence-transformer
model that will be useful for you. We should be careful, therefore, in treating number of tags as a proxy for quality.
models_df["pipeline_tag"].value_counts()
We may also want to see if there are some type of task that have more tags.
models_df.groupby("pipeline_tag")["number_of_tags"].mean().sort_values().plot.barh()
We can also look at the breakdown for a particular task
text_classification_df = models_df[models_df["pipeline_tag"] == "text-classification"]
text_classification_df["number_of_tags"].describe()
Again, we have some extreme outliers
text_classification_df[text_classification_df.number_of_tags > 230][["tags", "modelId"]]
We see that these mostly seem to relate to language. Let's remove these outliers and look at the distribution in the number of tags without these.
text_classification_df_no_outliers = text_classification_df[
text_classification_df["number_of_tags"]
<= text_classification_df["number_of_tags"].quantile(0.95)
]
text_classification_df_no_outliers["number_of_tags"].plot.hist(bins=9)
from toolz import concat
First we grab all the tags and put them in a single list.
all_tags = list(concat(df.tags.tolist()))
If we look at some examples, we'll see some tags are in the form of something:somethingelse
.
all_tags[:10]
for example dataset:wikipedia
, we should therefore avoid treating all tags as the same since tags can have a particular purpose. i.e. indicating a dataset is associated with a model.
def is_special_tag(tag: str):
return ":" in tag
from toolz import countby, valmap
special_tag_vs_normal = countby(is_special_tag, all_tags)
special_tag_vs_normal
total = sum(special_tag_vs_normal.values())
valmap(lambda x: x / total, special_tag_vs_normal)
We can see that a good chunk of tags are 'special' tags. i.e. they have a 'type' associated with them. If we want to explore tags on the hub more carefully we'll need to take this into account...