Using the š¤ Hub for model storage
How I'm planning to use the huggingface hub for storing flyswot models
- Intro
- How is this currently being done
- What is wrong with this
- š¤ hub to the rescue?
- The hub vs huggingface_hub library
- Getting information about a model
Intro
Part of what the flyswot should take care of is handle machine learning models. The flyswot tool is essentially just a command-line wrapper for the machine learning model. However, these two things the command-line tool, and the model, are kept separate for a number of reasons:
- the flyswot tool might have changes separate from the model changing i.e. some new functionality is added or some bug fixed
- we might want to update the model based on new training data or a change in the labels used
We want to be able to release a new model without having to create a new release of the flyswot tool and vice-versa. As a result of this, both of these things are versioned separately.
We might want to keep our model separate because flyswot is made available as a Python Package. Since a computer vision model can be pretty large, we probably don't want this to be included as a part of the Python package.
How is this currently being done
Currently models are stored in a separate GitHub repository separate from the repository used to store the flyswot code. flyswot has some functionality for checking against this GitHub repository to see if a more recent remote model has superseded a local model. If there is a more recent model available (and a CLI flag indicates that the latest model should be used), then flyswot downloads the new model and stores it in a new directory.
What is wrong with this
Whilst this approach does work okay there is quite a surprising amount of code that is needed to take care of some of this. Currently the option to pass a specific snapshot of a model doesn't exist.
On the storage side although GitHub is great for storing code there are some limitations to it for storing large files. I've created a GitHub action to create a release when new pull requests to update the model are made. This then creates a new release with date information in the filename. Again, this works okay, but there might be a better way...
š¤ hub to the rescue?
I have already been using the huggingface hub when using other peoples models and uploading fine-tuned transformer models. However, digging around the docs, it seemed like there are a few things in this ecosystem that could be useful for flyswot.
What is the š¤ hub?
If you haven't come across the š¤ hub before, it is essentially a place where models can be uploaded and explored by others. So, for example, if I want a language model trained on Arabic text, I might find it in the hub. The goal is to help avoid duplication of effort and allow other people to use or adapt existing work.
This video gives a helpful overview of navigating the hub and finding models you might be interested in using.
Part of the aim of sharing the flyswot models on GitHub (or the š¤ hub) is to make them available to other people to use. The š¤ hub well supports this use case. We can easily share models (including large ones) because of the underpinning use of git-lfs. However, our interest is not only in sharing a model for others to use but also in grabbing the correct model for the flyswot CLI tool easier. Some other components might help here.
The hub vs huggingface_hub library
The š¤ hub already provides a place to store the model. You can interact with this model using the web interface only but what we want is also to download models using our CLI from the hub. We already have a way to do this with GitHub, so ideally, we want something that works better than our current approach.
This is where the huggingface_hub
Python Library might come in. This Python library provides us with various ways of interacting with the hub. This could give us enough ways of interacting with the hub that we can delete some of the code that currently does this with GitHub (and there is nothing nicer than deleting code š)
I'll use the remainder of this blog to see if we can use the š¤ hub and the [
huggingface_hub](https://pypi.org/project/huggingface-hub/)
library for this purpose as a replacement for the current approach.
We'll start by installing the huggingface_hub
library
!pip install huggingface_hub
from huggingface_hub import model_info
info = model_info("distilbert-base-cased")
info
type(info)
You can see this gives us back a bunch of information about the model. We could for example grab the date the model was changed:
info.lastModified
This already gives us what we need for checking if a model is updated in comparison to a local model already downloaded by the flyswot CLI. However we might be able to cut out some of this checking work.
Lets see if there are other ways we can do this in the library. Since huggingface_hub
requires git-lfs lets start by installing this.
!apt install git-lfs
We also need to make sure we have git-lfs setup
!git init && git lfs install
We can use hf_hub_url
to get the url for a specific file from a repository
from huggingface_hub import hf_hub_url
onnx_model_url = hf_hub_url("davanstrien/flyswot-test", "2021-09-22.onnx")
onnx_model_url
We can pass this url to cached_download
, this will download the file for us if we don't have the latest version, we can also specify a place to download the file. This is important so we can make sure we put the file somewhere flyswot
can find.
from huggingface_hub import cached_download
cached_download(onnx_model_url, cache_dir=".")
If we try and download this again it won't download, and will instead return the path to the model
path = cached_download(onnx_model_url, cache_dir=".")
path
This is quite close to what we want our current approach requires us to get a bunch of different files in a folder. To replicate this we can instead use snapshot_download
from huggingface_hub import snapshot_download
Let's see what this does
?snapshot_download
This will do something similar to cached_download
but will instead do it for a whole model repository. If we pass our repository it will download the directory if we don't have the latest version of the files, if for example, the model has been updated.
model = snapshot_download("davanstrien/flyswot-test", cache_dir=".")
model
If we look inside this directory we can see we have the files from the repository.
!ls {model}
If we try and download it again we just get back the directory path without having to download the files again.
model = snapshot_download("davanstrien/flyswot-test", cache_dir=".")
model
This gives a replication of what we currently have setup for flyswot in terms of downloading models. There are a few extra things we might want though to be able to make flyswot more flexible. First though let's look at how we can upload to the model hub.
from huggingface_hub import Repository
?Repository
I'll use flyswot-test
as a way of playing around with this. To start with we can use Repository to clone the current version of the model.
repo = Repository(local_dir="flyswot-models", clone_from="davanstrien/flyswot-test")
repo
We'll need to be logged in to push changes
from huggingface_hub import notebook_login
notebook_login()
To start with let's mock making a change to some of the repo files and seeing how we can upload these changes. We can use the Repository
class as a context manager to make changes and have them committed to our model repository. Here we update the vocab file to add a new label.
with Repository(
local_dir="flyswot-models",
clone_from="davanstrien/flyswot-test",
git_user="Daniel van Strien",
).commit("update model"):
with open("vocab.txt", "a") as f:
f.write("new label")
This could already be used at the end of our training script. Currently I have some util files that package up the model vocab, convert Pytorch to ONNX etc. This could easily be adapted to also push the updated model to the hub. There is only one thing we might still want to add.
Adding more metadata: creating revision branches
Currently flyswot uses filenames to capture metadata about the model version. The models are versioned using calendar versioning. This works okay but we might be able to manage this in a slightly better way. One of the nice features that hf_hub
(the Python library) offers that flyswot
currently doesn't support well is being able to pass in a specific revision when using snapshot_download
. This would then allow someone to run a specific older version of the model. This might be useful for various different scenarios. To do this we'll create a revision branch for the date the model was created. All that we'll do now is pass in a formatted date as the revision.
from datetime import datetime
date_now = datetime.now()
now = date_now.strftime("%Y-%m-%d")
now
with Repository(
local_dir="flyswot-models",
clone_from="davanstrien/flyswot-test",
git_user="Daniel van Strien",
revision=now,
).commit(f"update model {now}"):
for model in Path(".").glob(".onnx"):
model.rename(f"{now}-model.onnx")
This creates a new revision branch for the current date. Since I also want to have the default branch be the current model we would also push the same model to the default branch. This would then mean that we end up with a bunch of different branches with model snapshots that could be passed in but for the default behavior we can easily grab the latest model by not specifying a revision.