Dynamically updating a Hugging Face hub organization README
Using the huggingface_hub library and Jinja to update a README dynamically
- What are we aiming to do?
- How can we create a README that dynamically updates
- Updating the README on the Hugging Face Hub
tl;dr we can use the huggingface_hub
library to auto generate a model card readme for the BigLAM organization.
What are we aiming to do?
The Hugging Face hub allows organizations to create a README card to describe their organization.
Whilst you can manually create this there might be some content that would be nice to auto populate. For example, for the BigLAM organization, we're mainly focused on collecting datasets. Since we have many tasks supported by these datasets we might want to create a list of datasets organized by task. Ideally we don't want to have to manually update this. Let's see how we can do this!
First we'll install the huggingface_hub
library which allows us to interact with the hub. We'll install Jinja2
for templating and toolz
because toolz
makes Python infinitely more delightful!
%pip install huggingface_hub toolz Jinja2
import toolz
from huggingface_hub import list_datasets
We list all the datasets under this organization
big_lam_datasets = list(iter(list_datasets(author="biglam", limit=None, full=True)))
We want to check which tasks our organization currently has. If we look at an example of one dataset:
big_lam_datasets[0]
We can see the cardData
attribute contains an item containing the tasks supported by a dataset
big_lam_datasets[0].cardData['task_categories']
def get_task_categories(dataset):
try:
yield from dataset.cardData['task_categories']
except KeyError:
return None
We can use the toolz.frequencies
function to get counts of these tasks in our org.
task_frequencies = toolz.frequencies(
toolz.concat(map(get_task_categories, big_lam_datasets))
)
task_frequencies
Since we want to organize by task type, let's grab the names of all the tasks in the BigLAM organization.
tasks = task_frequencies.keys()
tasks
We now want to group together datasets by the task(s) they support. We can use a default dict to create a dictionary where the keys are the task and the values are a list of datasets supporting that task. Note some datasets support multiple tasks so may appear under more than one task key.
from collections import defaultdict
datasets_by_task = defaultdict(list)
for dataset in big_lam_datasets:
tasks = get_task_categories(dataset)
for task in tasks:
datasets_by_task[task].append(dataset)
We now have a dictionary which allows us to get all datasets supporting a task, for example fill-mask
datasets_by_task["fill-mask"]
How can we create a README that dynamically updates
We now have our datasets organized by task. However, at the moment, this is in the form of a Python dictionary. It would be much nicer to render it a more pleasing format. This is where a templating engine can help. In this case we'll use Jinja.
A templating engine allows us to create a template which can dynamically be updated based on values we pass in. We won't go in depth to templating engines/Jinja in this blog post because I'm not an expert in templating engines. This Real Python article is a nice introduction to Jinja.
from jinja2 import Environment, FileSystemLoader
We can start by taking a look at our template. Since a lot of the template I created doesn't update, we'll use tail
to look at the bottom of the template which is dynamically updating.
!tail -n 12 templates/readme.jinja
Even if you aren't familiar with templating engines, you can probably see roughly what this does. We look through all the keys and values in our dictionary, create a section for that task based on the dictionary key. We next loop through the dictionary values (which in this case is a list) and create a link for that dataset. Since we're looping through DatasetInfo
objects in the list we can grab things like the pretty_name
for the dataset and dynamically create a URL link.
We can load this template as follows
environment = Environment(loader=FileSystemLoader("templates/"))
template = environment.get_template("readme.jinja")
Create a context dictionary which we use to pass through our dictionary
context = {
"task_dictionary": datasets_by_task,
}
We can now render this and see how it looks
print(template.render(context))
with open('/tmp/README.md','w') as f:
f.write(template.render(context))
This looks pretty good! It would be nice to also update the org README without having to manually edit the file. The huggingface_hub
library helps us out here once again. Since the organization README is actually a special type of Hugging Face Space, we can interact with it in the same way we could for models or datasets.
from huggingface_hub import HfApi
from huggingface_hub import notebook_login
We'll create a HFApi
instance.
api = HfApi()
Since we're planning to write to a repo we'll need to login to the hub.
notebook_login()
We can now upload the rendered README file we created above to our biglam/README
space.
api.upload_file(
path_or_fileobj="/tmp/readme.md",
path_in_repo="README.md",
repo_id="biglam/README",
repo_type="space",
)
If we look at our updated README, we'll see we now have some nice collapsible sections for each task type containing the datasets for that task
Next steps, whilst this was already quite useful, at the moment we still have to run this code when we want to regenerate our README. Webhooks make it possible to make this fully automated by creating a webhook that monitors any changes to repos under the BigLAM org. Would love to hear from anyone who tries this out!