tl;dr we can use the huggingface_hub library to auto generate a model card readme for the BigLAM organization.

What are we aiming to do?

The Hugging Face hub allows organizations to create a README card to describe their organization.

Whilst you can manually create this there might be some content that would be nice to auto populate. For example, for the BigLAM organization, we're mainly focused on collecting datasets. Since we have many tasks supported by these datasets we might want to create a list of datasets organized by task. Ideally we don't want to have to manually update this. Let's see how we can do this!

First we'll install the huggingface_hub library which allows us to interact with the hub. We'll install Jinja2 for templating and toolz because toolz makes Python infinitely more delightful!

%pip install huggingface_hub toolz Jinja2
Requirement already satisfied: huggingface_hub in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (0.11.1)
Requirement already satisfied: toolz in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (0.12.0)
Requirement already satisfied: Jinja2 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (3.1.2)
Requirement already satisfied: pyyaml>=5.1 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from huggingface_hub) (6.0)
Requirement already satisfied: packaging>=20.9 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from huggingface_hub) (23.0)
Requirement already satisfied: requests in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from huggingface_hub) (2.28.2)
Requirement already satisfied: filelock in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from huggingface_hub) (3.9.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from huggingface_hub) (4.4.0)
Requirement already satisfied: tqdm in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from huggingface_hub) (4.64.1)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from Jinja2) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from requests->huggingface_hub) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from requests->huggingface_hub) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from requests->huggingface_hub) (2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/davanstrien/Documents/daniel/blog/venv/lib/python3.10/site-packages (from requests->huggingface_hub) (3.0.1)

[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
import toolz
from huggingface_hub import list_datasets

We list all the datasets under this organization

big_lam_datasets = list(iter(list_datasets(author="biglam", limit=None, full=True)))

We want to check which tasks our organization currently has. If we look at an example of one dataset:

big_lam_datasets[0]
DatasetInfo: {
	id: biglam/illustrated_ads
	sha: 688e7d96e99cd5730a17a5c55b0964d27a486904
	lastModified: 2023-01-18T20:38:15.000Z
	tags: ['task_categories:image-classification', 'task_ids:multi-class-image-classification', 'annotations_creators:expert-generated', 'size_categories:n<1K', 'license:cc0-1.0', 'lam', 'historic newspapers']
	private: False
	author: biglam
	description: The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection.
	citation: @dataset{van_strien_daniel_2021_5838410,
  author       = {van Strien, Daniel},
  title        = {{19th Century United States Newspaper Advert images 
                   with 'illustrated' or 'non illustrated' labels}},
  month        = oct,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {0.0.1},
  doi          = {10.5281/zenodo.5838410},
  url          = {https://doi.org/10.5281/zenodo.5838410}}
	cardData: {'annotations_creators': ['expert-generated'], 'language': [], 'language_creators': [], 'license': ['cc0-1.0'], 'multilinguality': [], 'pretty_name': "19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels", 'size_categories': ['n<1K'], 'source_datasets': [], 'tags': ['lam', 'historic newspapers'], 'task_categories': ['image-classification'], 'task_ids': ['multi-class-image-classification']}
	siblings: []
	_id: 62b9bb453b3301c319d5b53e
	disabled: False
	gated: False
	gitalyUid: 4a051da032bb27da0bc286b288384bb3362f56546a387b130121cd279db336e1
	likes: 3
	downloads: 11
}

We can see the cardData attribute contains an item containing the tasks supported by a dataset

big_lam_datasets[0].cardData['task_categories']
['image-classification']
def get_task_categories(dataset):
    try:
        yield from dataset.cardData['task_categories']
    except KeyError:
        return None

We can use the toolz.frequencies function to get counts of these tasks in our org.

task_frequencies = toolz.frequencies(
    toolz.concat(map(get_task_categories, big_lam_datasets))
)
task_frequencies
{'image-classification': 8,
 'text-classification': 6,
 'image-to-text': 2,
 'text-generation': 7,
 'object-detection': 5,
 'fill-mask': 2,
 'text-to-image': 1,
 'image-to-image': 1,
 'token-classification': 1}

Since we want to organize by task type, let's grab the names of all the tasks in the BigLAM organization.

tasks = task_frequencies.keys()
tasks
dict_keys(['image-classification', 'text-classification', 'image-to-text', 'text-generation', 'object-detection', 'fill-mask', 'text-to-image', 'image-to-image', 'token-classification'])

We now want to group together datasets by the task(s) they support. We can use a default dict to create a dictionary where the keys are the task and the values are a list of datasets supporting that task. Note some datasets support multiple tasks so may appear under more than one task key.

from collections import defaultdict
datasets_by_task = defaultdict(list)
for dataset in big_lam_datasets:
    tasks = get_task_categories(dataset)
    for task in tasks:
        datasets_by_task[task].append(dataset)

We now have a dictionary which allows us to get all datasets supporting a task, for example fill-mask

datasets_by_task["fill-mask"]
[DatasetInfo: {
 	id: biglam/berlin_state_library_ocr
 	sha: a890935d5bd754ddc5b85f56b6f34f6d2bb4abba
 	lastModified: 2022-08-05T09:36:24.000Z
 	tags: ['task_categories:fill-mask', 'task_categories:text-generation', 'task_ids:masked-language-modeling', 'task_ids:language-modeling', 'annotations_creators:machine-generated', 'language_creators:expert-generated', 'multilinguality:multilingual', 'size_categories:1M<n<10M', 'language:de', 'language:nl', 'language:en', 'language:fr', 'language:es', 'license:cc-by-4.0', 'ocr', 'library']
 	private: False
 	author: biglam
 	description: None
 	citation: None
 	cardData: {'annotations_creators': ['machine-generated'], 'language': ['de', 'nl', 'en', 'fr', 'es'], 'language_creators': ['expert-generated'], 'license': ['cc-by-4.0'], 'multilinguality': ['multilingual'], 'pretty_name': 'Berlin State Library OCR', 'size_categories': ['1M<n<10M'], 'source_datasets': [], 'tags': ['ocr', 'library'], 'task_categories': ['fill-mask', 'text-generation'], 'task_ids': ['masked-language-modeling', 'language-modeling']}
 	siblings: []
 	_id: 62e0431281d9ca6484efac31
 	disabled: False
 	gated: False
 	gitalyUid: 3818ba9c8b624d79f1fcfb0c79bd197fb5b3a3f9de2452aed5028e8b6435f56a
 	likes: 3
 	downloads: 5
 },
 DatasetInfo: {
 	id: biglam/bnl_newspapers1841-1879
 	sha: 588db6c242ecae417b92830d5646121c15726fea
 	lastModified: 2022-11-15T09:25:43.000Z
 	tags: ['task_categories:text-generation', 'task_categories:fill-mask', 'task_ids:language-modeling', 'task_ids:masked-language-modeling', 'annotations_creators:no-annotation', 'language_creators:expert-generated', 'multilinguality:multilingual', 'size_categories:100K<n<1M', 'source_datasets:original', 'language:de', 'language:fr', 'language:lb', 'language:nl', 'language:la', 'language:en', 'license:cc0-1.0', 'newspapers', '1800-1900']
 	private: False
 	author: biglam
 	description: None
 	citation: None
 	cardData: {'annotations_creators': ['no-annotation'], 'language': ['de', 'fr', 'lb', 'nl', 'la', 'en'], 'language_creators': ['expert-generated'], 'license': ['cc0-1.0'], 'multilinguality': ['multilingual'], 'pretty_name': 'BnL Newspapers 1841-1879', 'size_categories': ['100K<n<1M'], 'source_datasets': ['original'], 'tags': ['newspapers', '1800-1900'], 'task_categories': ['text-generation', 'fill-mask'], 'task_ids': ['language-modeling', 'masked-language-modeling']}
 	siblings: []
 	_id: 6372286ce8891da06b2a5d2f
 	disabled: False
 	gated: False
 	gitalyUid: 039f217af964cfa1317f03d58c367ba6f0e415721b107a298cd4e75cbad50e8b
 	likes: 2
 	downloads: 3
 }]

How can we create a README that dynamically updates

We now have our datasets organized by task. However, at the moment, this is in the form of a Python dictionary. It would be much nicer to render it a more pleasing format. This is where a templating engine can help. In this case we'll use Jinja.

A templating engine allows us to create a template which can dynamically be updated based on values we pass in. We won't go in depth to templating engines/Jinja in this blog post because I'm not an expert in templating engines. This Real Python article is a nice introduction to Jinja.

from jinja2 import Environment, FileSystemLoader

We can start by taking a look at our template. Since a lot of the template I created doesn't update, we'll use tail to look at the bottom of the template which is dynamically updating.

!tail -n 12 templates/readme.jinja
An overview of datasets currently made available via BigLam organised by task type.

{% for task_type, datasets in task_dictionary.items() %}

<details>
  <summary>{{ task_type }}</summary>
    {% for dataset in datasets %}
  - [{{dataset.cardData['pretty_name']}}](https://huggingface.co/datasets/biglam/{{ dataset.id }})
  {%- endfor %}

</details>
{% endfor %}

Even if you aren't familiar with templating engines, you can probably see roughly what this does. We look through all the keys and values in our dictionary, create a section for that task based on the dictionary key. We next loop through the dictionary values (which in this case is a list) and create a link for that dataset. Since we're looping through DatasetInfo objects in the list we can grab things like the pretty_name for the dataset and dynamically create a URL link.

We can load this template as follows

environment = Environment(loader=FileSystemLoader("templates/"))
template = environment.get_template("readme.jinja")

Create a context dictionary which we use to pass through our dictionary

context = {
    "task_dictionary": datasets_by_task,
}

We can now render this and see how it looks

print(template.render(context))
---
title: README
emoji: 📚
colorFrom: pink
colorTo: gray
sdk: static
pinned: false
---

BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models.


BigLAM started as a [datasets hackathon](https://github.com/bigscience-workshop/lam) focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine-learning applications accessible via the Hugging Face Hub.
We are continuing to work on making more datasets available via the Hugging Face hub to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine-learning datasets more closely reflect the richness of human culture.


## Dataset Overview

An overview of datasets currently made available via BigLam organised by task type.



<details>
  <summary>image-classification</summary>
    
  - [19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels](https://huggingface.co/datasets/biglam/biglam/illustrated_ads)
  - [Brill Iconclass AI Test Set ](https://huggingface.co/datasets/biglam/biglam/brill_iconclass)
  - [National Library of Scotland Chapbook Illustrations](https://huggingface.co/datasets/biglam/biglam/nls_chapbook_illustrations)
  - [Encyclopaedia Britannica Illustrated](https://huggingface.co/datasets/biglam/biglam/encyclopaedia_britannica_illustrated)
  - [V4Design Europeana style dataset](https://huggingface.co/datasets/biglam/biglam/v4design_europeana_style_dataset)
  - [Early Printed Books Font Detection Dataset](https://huggingface.co/datasets/biglam/biglam/early_printed_books_font_detection)
  - [Dataset of Pages from Early Printed Books with Multiple Font Groups](https://huggingface.co/datasets/biglam/biglam/early_printed_books_with_multiple_font_groups)
  - [DEArt: Dataset of European Art](https://huggingface.co/datasets/biglam/biglam/european_art)

</details>


<details>
  <summary>text-classification</summary>
    
  - [Annotated dataset to assess the accuracy of the textual description of cultural heritage records](https://huggingface.co/datasets/biglam/biglam/cultural_heritage_metadata_accuracy)
  - [Atypical Animacy](https://huggingface.co/datasets/biglam/biglam/atypical_animacy)
  - [Old Bailey Proceedings](https://huggingface.co/datasets/biglam/biglam/old_bailey_proceedings)
  - [Lampeter Corpus](https://huggingface.co/datasets/biglam/biglam/lampeter_corpus)
  - [Hansard Speeches](https://huggingface.co/datasets/biglam/biglam/hansard_speech)
  - [Contentious Contexts Corpus](https://huggingface.co/datasets/biglam/biglam/contentious_contexts)

</details>


<details>
  <summary>image-to-text</summary>
    
  - [Brill Iconclass AI Test Set ](https://huggingface.co/datasets/biglam/biglam/brill_iconclass)
  - [Old Book Illustrations](https://huggingface.co/datasets/biglam/biglam/oldbookillustrations)

</details>


<details>
  <summary>text-generation</summary>
    
  - [Old Bailey Proceedings](https://huggingface.co/datasets/biglam/biglam/old_bailey_proceedings)
  - [Hansard Speeches](https://huggingface.co/datasets/biglam/biglam/hansard_speech)
  - [Berlin State Library OCR](https://huggingface.co/datasets/biglam/biglam/berlin_state_library_ocr)
  - [Literary fictions of Gallica](https://huggingface.co/datasets/biglam/biglam/gallica_literary_fictions)
  - [Europeana Newspapers ](https://huggingface.co/datasets/biglam/biglam/europeana_newspapers)
  - [Gutenberg Poetry Corpus](https://huggingface.co/datasets/biglam/biglam/gutenberg-poetry-corpus)
  - [BnL Newspapers 1841-1879](https://huggingface.co/datasets/biglam/biglam/bnl_newspapers1841-1879)

</details>


<details>
  <summary>object-detection</summary>
    
  - [National Library of Scotland Chapbook Illustrations](https://huggingface.co/datasets/biglam/biglam/nls_chapbook_illustrations)
  - [YALTAi Tabular Dataset](https://huggingface.co/datasets/biglam/biglam/yalta_ai_tabular_dataset)
  - [YALTAi Tabular Dataset](https://huggingface.co/datasets/biglam/biglam/yalta_ai_segmonto_manuscript_dataset)
  - [Beyond Words](https://huggingface.co/datasets/biglam/biglam/loc_beyond_words)
  - [DEArt: Dataset of European Art](https://huggingface.co/datasets/biglam/biglam/european_art)

</details>


<details>
  <summary>fill-mask</summary>
    
  - [Berlin State Library OCR](https://huggingface.co/datasets/biglam/biglam/berlin_state_library_ocr)
  - [BnL Newspapers 1841-1879](https://huggingface.co/datasets/biglam/biglam/bnl_newspapers1841-1879)

</details>


<details>
  <summary>text-to-image</summary>
    
  - [Old Book Illustrations](https://huggingface.co/datasets/biglam/biglam/oldbookillustrations)

</details>


<details>
  <summary>image-to-image</summary>
    
  - [Old Book Illustrations](https://huggingface.co/datasets/biglam/biglam/oldbookillustrations)

</details>


<details>
  <summary>token-classification</summary>
    
  - [Unsilencing Colonial Archives via Automated Entity Recognition](https://huggingface.co/datasets/biglam/biglam/unsilence_voc)

</details>

with open('/tmp/README.md','w') as f:
    f.write(template.render(context))

Updating the README on the Hugging Face Hub

This looks pretty good! It would be nice to also update the org README without having to manually edit the file. The huggingface_hub library helps us out here once again. Since the organization README is actually a special type of Hugging Face Space, we can interact with it in the same way we could for models or datasets.

from huggingface_hub import HfApi
from huggingface_hub import notebook_login

We'll create a HFApi instance.

api = HfApi()

Since we're planning to write to a repo we'll need to login to the hub.

notebook_login()

We can now upload the rendered README file we created above to our biglam/README space.

api.upload_file(
    path_or_fileobj="/tmp/readme.md",
    path_in_repo="README.md",
    repo_id="biglam/README",
    repo_type="space",
)
'https://huggingface.co/spaces/biglam/README/blob/main/README.md'

If we look at our updated README, we'll see we now have some nice collapsible sections for each task type containing the datasets for that task

Next steps, whilst this was already quite useful, at the moment we still have to run this code when we want to regenerate our README. Webhooks make it possible to make this fully automated by creating a webhook that monitors any changes to repos under the BigLAM org. Would love to hear from anyone who tries this out!