Intro

Part of what the flyswot should take care of is handle machine learning models. The flyswot tool is essentially just a command-line wrapper for the machine learning model. However, these two things the command-line tool, and the model, are kept separate for a number of reasons:

  • the flyswot tool might have changes separate from the model changing i.e. some new functionality is added or some bug fixed
  • we might want to update the model based on new training data or a change in the labels used

We want to be able to release a new model without having to create a new release of the flyswot tool and vice-versa. As a result of this, both of these things are versioned separately.

We might want to keep our model separate because flyswot is made available as a Python Package. Since a computer vision model can be pretty large, we probably don't want this to be included as a part of the Python package.

How is this currently being done

Currently models are stored in a separate GitHub repository separate from the repository used to store the flyswot code. flyswot has some functionality for checking against this GitHub repository to see if a more recent remote model has superseded a local model. If there is a more recent model available (and a CLI flag indicates that the latest model should be used), then flyswot downloads the new model and stores it in a new directory.

What is wrong with this

Whilst this approach does work okay there is quite a surprising amount of code that is needed to take care of some of this. Currently the option to pass a specific snapshot of a model doesn't exist.

On the storage side although GitHub is great for storing code there are some limitations to it for storing large files. I've created a GitHub action to create a release when new pull requests to update the model are made. This then creates a new release with date information in the filename. Again, this works okay, but there might be a better way...

šŸ¤— hub to the rescue?

I have already been using the huggingface hub when using other peoples models and uploading fine-tuned transformer models. However, digging around the docs, it seemed like there are a few things in this ecosystem that could be useful for flyswot.

What is the šŸ¤— hub?

If you haven't come across the šŸ¤— hub before, it is essentially a place where models can be uploaded and explored by others. So, for example, if I want a language model trained on Arabic text, I might find it in the hub. The goal is to help avoid duplication of effort and allow other people to use or adapt existing work.

This video gives a helpful overview of navigating the hub and finding models you might be interested in using.

Part of the aim of sharing the flyswot models on GitHub (or the šŸ¤— hub) is to make them available to other people to use. The šŸ¤— hub well supports this use case. We can easily share models (including large ones) because of the underpinning use of git-lfs. However, our interest is not only in sharing a model for others to use but also in grabbing the correct model for the flyswot CLI tool easier. Some other components might help here.

The hub vs huggingface_hub library

The šŸ¤— hub already provides a place to store the model. You can interact with this model using the web interface only but what we want is also to download models using our CLI from the hub. We already have a way to do this with GitHub, so ideally, we want something that works better than our current approach.

This is where the huggingface_hub Python Library might come in. This Python library provides us with various ways of interacting with the hub. This could give us enough ways of interacting with the hub that we can delete some of the code that currently does this with GitHub (and there is nothing nicer than deleting code šŸ˜ƒ)

I'll use the remainder of this blog to see if we can use the šŸ¤— hub and the [huggingface_hub](https://pypi.org/project/huggingface-hub/) library for this purpose as a replacement for the current approach.

We'll start by installing the huggingface_hub library

!pip install huggingface_hub

Getting information about a model

One of the things we need to be able to do is get the latest version of the model. One way we could try and do this is by grabbing metadata about the model. This is the current approach taken by flyswot. We can import model_info to do this:

from huggingface_hub import model_info
info = model_info("distilbert-base-cased")
info
ModelInfo: {
	modelId: distilbert-base-cased
	sha: 935ac13b473164bb9d578640e33d9f21144c365e
	lastModified: 2020-12-11T21:23:53.000Z
	tags: ['pytorch', 'tf', 'distilbert', 'en', 'dataset:bookcorpus', 'dataset:wikipedia', 'arxiv:1910.01108', 'transformers', 'license:apache-2.0', 'infinity_compatible']
	pipeline_tag: None
	siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='tf_model.h5'), ModelFile(rfilename='tokenizer.json'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='vocab.txt')]
	config: {'model_type': 'distilbert'}
	id: distilbert-base-cased
	private: False
	downloads: 3556770
	library_name: transformers
	mask_token: [MASK]
	likes: 4
	model-index: None
	cardData: {'language': 'en', 'license': 'apache-2.0', 'datasets': ['bookcorpus', 'wikipedia']}
}
type(info)
huggingface_hub.hf_api.ModelInfo

You can see this gives us back a bunch of information about the model. We could for example grab the date the model was changed:

info.lastModified
'2020-12-11T21:23:53.000Z'

This already gives us what we need for checking if a model is updated in comparison to a local model already downloaded by the flyswot CLI. However we might be able to cut out some of this checking work.

Lets see if there are other ways we can do this in the library. Since huggingface_hub requires git-lfs lets start by installing this.

!apt install git-lfs
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.

We also need to make sure we have git-lfs setup

!git init && git lfs install
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint: 	git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint: 	git branch -m <name>
Initialized empty Git repository in /Users/dvanstrien/Documents/daniel/blog/_notebooks/.git/
Updated git hooks.
Git LFS initialized.

Downloading files from the hub

We can use hf_hub_url to get the url for a specific file from a repository

from huggingface_hub import hf_hub_url
onnx_model_url = hf_hub_url("davanstrien/flyswot-test", "2021-09-22.onnx")
onnx_model_url
'https://huggingface.co/davanstrien/flyswot-test/resolve/main/2021-09-22.onnx'

We can pass this url to cached_download, this will download the file for us if we don't have the latest version, we can also specify a place to download the file. This is important so we can make sure we put the file somewhere flyswot can find.

from huggingface_hub import cached_download
cached_download(onnx_model_url, cache_dir=".")
'./de9d2ce054e3e410e3fc61b5c2ad55da7861d30e3b90aa018615b7d902e6e51e.1300a5792e44de2c59f4d25c4f7efd447ef91d69971a121e6b4df8b95047ad7c'

If we try and download this again it won't download, and will instead return the path to the model

path = cached_download(onnx_model_url, cache_dir=".")
path
'./de9d2ce054e3e410e3fc61b5c2ad55da7861d30e3b90aa018615b7d902e6e51e.1300a5792e44de2c59f4d25c4f7efd447ef91d69971a121e6b4df8b95047ad7c'

Downloading all files from the hub

This is quite close to what we want our current approach requires us to get a bunch of different files in a folder. To replicate this we can instead use snapshot_download

from huggingface_hub import snapshot_download

Let's see what this does

?snapshot_download
Signature:
snapshot_download(
    repo_id: str,
    revision: Union[str, NoneType] = None,
    cache_dir: Union[str, pathlib.Path, NoneType] = None,
    library_name: Union[str, NoneType] = None,
    library_version: Union[str, NoneType] = None,
    user_agent: Union[Dict, str, NoneType] = None,
    proxies=None,
    etag_timeout=10,
    resume_download=False,
    use_auth_token: Union[bool, str, NoneType] = None,
    local_files_only=False,
) -> str
Docstring:
Downloads a whole snapshot of a repo's files at the specified revision.
This is useful when you want all files from a repo, because you don't know
which ones you will need a priori.
All files are nested inside a folder in order to keep their actual filename
relative to that folder.

An alternative would be to just clone a repo but this would require that
the user always has git and git-lfs installed, and properly configured.

Note: at some point maybe this format of storage should actually replace
the flat storage structure we've used so far (initially from allennlp
if I remember correctly).

Return:
    Local folder path (string) of repo snapshot
File:      ~/miniconda3/envs/blog/lib/python3.8/site-packages/huggingface_hub/snapshot_download.py
Type:      function

This will do something similar to cached_download but will instead do it for a whole model repository. If we pass our repository it will download the directory if we don't have the latest version of the files, if for example, the model has been updated.

model = snapshot_download("davanstrien/flyswot-test", cache_dir=".")
model
'./davanstrien__flyswot-test.main.e54a7421f5e5eb240783452ab734288f252bb402'

If we look inside this directory we can see we have the files from the repository.

!ls {model}
2021-09-22.onnx README.md       modelcard.md    vocab.txt

If we try and download it again we just get back the directory path without having to download the files again.

model = snapshot_download("davanstrien/flyswot-test", cache_dir=".")
model
'./davanstrien__flyswot-test.main.e54a7421f5e5eb240783452ab734288f252bb402'

This gives a replication of what we currently have setup for flyswot in terms of downloading models. There are a few extra things we might want though to be able to make flyswot more flexible. First though let's look at how we can upload to the model hub.

Uploading to the hub

At the moment flyswot models are uploaded to a GitHub repository which then creates a release. It would be nice to simplify this and upload directly at the end of model training. For this we can use the Repository class.

from huggingface_hub import Repository
?Repository
Init signature:
Repository(
    local_dir: str,
    clone_from: Union[str, NoneType] = None,
    repo_type: Union[str, NoneType] = None,
    use_auth_token: Union[bool, str] = True,
    git_user: Union[str, NoneType] = None,
    git_email: Union[str, NoneType] = None,
    revision: Union[str, NoneType] = None,
    private: bool = False,
    skip_lfs_files: bool = False,
)
Docstring:     
Helper class to wrap the git and git-lfs commands.

The aim is to facilitate interacting with huggingface.co hosted model or dataset repos,
though not a lot here (if any) is actually specific to huggingface.co.
Init docstring:
Instantiate a local clone of a git repo.

If specifying a `clone_from`:
will clone an existing remote repository, for instance one
that was previously created using ``HfApi().create_repo(name=repo_name)``.
``Repository`` uses the local git credentials by default, but if required, the ``huggingface_token``
as well as the git ``user`` and the ``email`` can be explicitly specified.
If `clone_from` is used, and the repository is being instantiated into a non-empty directory,
e.g. a directory with your trained model files, it will automatically merge them.

Args:
    local_dir (``str``):
        path (e.g. ``'my_trained_model/'``) to the local directory, where the ``Repository`` will be initalized.
    clone_from (``str``, `optional`):
        repository url (e.g. ``'https://huggingface.co/philschmid/playground-tests'``).
    repo_type (``str``, `optional`):
        To set when creating a repo: et to "dataset" or "space" if creating a dataset or space, default is model.
    use_auth_token (``str`` or ``bool``, `optional`, defaults to ``True``):
        huggingface_token can be extract from ``HfApi().login(username, password)`` and is used to authenticate against the hub
        (useful from Google Colab for instance).
    git_user (``str``, `optional`):
        will override the ``git config user.name`` for committing and pushing files to the hub.
    git_email (``str``, `optional`):
        will override the ``git config user.email`` for committing and pushing files to the hub.
    revision (``str``, `optional`):
        Revision to checkout after initializing the repository. If the revision doesn't exist, a
        branch will be created with that revision name from the default branch's current HEAD.
    private (``bool``, `optional`, defaults to ``False``):
        whether the repository is private or not.
    skip_lfs_files (``bool``, `optional`, defaults to ``False``):
        whether to skip git-LFS files or not.
File:           ~/miniconda3/envs/blog/lib/python3.8/site-packages/huggingface_hub/repository.py
Type:           type
Subclasses:     

I'll use flyswot-test as a way of playing around with this. To start with we can use Repository to clone the current version of the model.

repo = Repository(local_dir="flyswot-models", clone_from="davanstrien/flyswot-test")
Cloning https://huggingface.co/davanstrien/flyswot-test into local empty directory.
repo
<huggingface_hub.repository.Repository at 0x7fb10cccbc10>

We'll need to be logged in to push changes

from huggingface_hub import notebook_login

notebook_login()

To start with let's mock making a change to some of the repo files and seeing how we can upload these changes. We can use the Repository class as a context manager to make changes and have them committed to our model repository. Here we update the vocab file to add a new label.

with Repository(
    local_dir="flyswot-models",
    clone_from="davanstrien/flyswot-test",
    git_user="Daniel van Strien",
).commit("update model"):
    with open("vocab.txt", "a") as f:
        f.write("new label")
/Users/dvanstrien/Documents/daniel/blog/_notebooks/flyswot-models is already a clone of https://huggingface.co/davanstrien/flyswot-test. Make sure you pull the latest changes with `repo.git_pull()`.
Pulling changes ...
To https://huggingface.co/davanstrien/flyswot-test
   e54a742..18d149e  main -> main

This could already be used at the end of our training script. Currently I have some util files that package up the model vocab, convert Pytorch to ONNX etc. This could easily be adapted to also push the updated model to the hub. There is only one thing we might still want to add.

Adding more metadata: creating revision branches

Currently flyswot uses filenames to capture metadata about the model version. The models are versioned using calendar versioning. This works okay but we might be able to manage this in a slightly better way. One of the nice features that hf_hub (the Python library) offers that flyswot currently doesn't support well is being able to pass in a specific revision when using snapshot_download. This would then allow someone to run a specific older version of the model. This might be useful for various different scenarios. To do this we'll create a revision branch for the date the model was created. All that we'll do now is pass in a formatted date as the revision.

from datetime import datetime
date_now = datetime.now()
now = date_now.strftime("%Y-%m-%d")
now
'2021-12-30'
with Repository(
    local_dir="flyswot-models",
    clone_from="davanstrien/flyswot-test",
    git_user="Daniel van Strien",
    revision=now,
).commit(f"update model {now}"):
    for model in Path(".").glob(".onnx"):
        model.rename(f"{now}-model.onnx")
/Users/dvanstrien/Documents/daniel/blog/_notebooks/flyswot-models is already a clone of https://huggingface.co/davanstrien/flyswot-test. Make sure you pull the latest changes with `repo.git_pull()`.
Checked out 2021-12-30 from 2021-12-30.
Your branch is up to date with 'origin/2021-12-30'.

Pulling changes ...
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.
Everything up-to-date

This creates a new revision branch for the current date. Since I also want to have the default branch be the current model we would also push the same model to the default branch. This would then mean that we end up with a bunch of different branches with model snapshots that could be passed in but for the default behavior we can easily grab the latest model by not specifying a revision.