Training an object detection model using Hugging Face
training a Detr object detection model using Hugging Face transformers and datasets
- Training a Detr object detection model using Hugging Face transformers and datasets
- What is object detection?
- What is Detr?
- Using Hugging Face for object detection
- Loading the dataset
- Preparing the data
- Creating a detr model
- Training the detr model
- Conclusion
Training a Detr object detection model using Hugging Face transformers and datasets
The Hugging Face transformers library has increasingly expanded from its original focus on Natural Language Processing tasks to include more models covering a range of computer vision tasks. This blog post will look at how we can train an object detection model using the Hugging Face transformers and datasets libraries.
What is object detection?
Object detection is the task of predicting objects contained within an image.
Object detection can be helpful in several applications where you want to know not only whether a thing is in an image but where (and how many) of that thing there are. Various approaches have been developed over the years for this task, often relying on various complex hand-crafted features.
As with other areas of computer vision, there has been an increasing adoption of transformer-based solutions to this task. One model using transformers is the Detr architecture.
What is Detr?
Detr (DEtection TRansformer) is a model architecture introduced in the paper End-to-End Object Detection with Transformers. We won't dig into the architecture in massive detail in this blog since we're focused on the practical use of this model architecture in this post. One thing that is important to note here is that DETR
still uses a CNN backbone. More recently, other models such as YOLOS use a transformer backbone too. Currently, however, these fully transformer-based approaches show some performance gap over more traditional techniques (because this is deep learning, 'traditional' refers to stuff from last year, of course).
Using Hugging Face for object detection
There are existing examples for using the Hugging Face transformers library and datasets with the Trainer class to do image classification. There are also example notebooks showing how to fine-tune a Detr model on custom data. However, I didn't find examples that use the datasets library and the Trainer class to manage training. Training an object detection model using datasets and the transformers library is what this blog post covers.
Why the datasets library?
You may ask why it is helpful to provide an example of using the datasets library for training an object detection model, i.e. why not use PyTorch for the data loading, which already has many examples for training object detection models?
There are a few reasons why trying to use datasets for this can be helpful. A significant one for me is the close integration between the datasets library and the Hugging Face datasets hub. Loading a dataset from the hugging face hub often involves two lines of code (including the imports).
Quickly loading a dataset and then using the same library to prepare the dataset for training an object detection model removes some friction. This becomes especially helpful when you are iterating on the process of creating training data, training a model, and creating more training data. In this iterative process, the hub can be used for storing models and datasets at each stage. Having a clear provenance of these changes (without relying on additional tools) is also a benefit of this workflow. This is the kind of pipeline hugit is intended to support (in this case, for image classification models).
Scope of this blog post
At the moment, this is mainly intended to give a quick overview of the steps involved. It isn't intended to be a proper tutorial. If I have time later, I may flesh this out (particularly if other projects I'm working on that use object detection progress further).
Enough talk, let's get started. First we install required libraries.
%%capture
!pip install datasets transformers timm wandb rich[jupyter]
I'm a big fan of the rich library so almost always have this extension loaded.
%load_ext rich
The next couple of lines gets us authenticated with the Hugging Face hub.
!git config --global credential.helper store
from huggingface_hub import notebook_login
notebook_login()
We'll use Weights and Biases for tracking our model training.
import wandb
wandb.login()
%env WANDB_PROJECT=chapbooks
%env WANDB_ENTITY=davanstrien
Loading the dataset
In this blog post will use a dataset being added to the Hugging Face datasets hub as part of the BigLAM hackathon. This dataset has a configuration for object detection and image classification, so we'll need to specify which one we want. Since the dataset doesn't define train/test/valid splits for us, we'll grab the training split. I won't provide a full description of the dataset in this blog post since the dataset is still in the process of being documented. The tl;dr summary is that the dataset includes images of digitized books with bounding boxes for illustrations.
from datasets import load_dataset
dataset = load_dataset(
"biglam/nls_chapbook_illustrations", "illustration-detection", split="train"
)
Let's take a look at one example from this dataset to get a sense of how the data looks
dataset[0]
You will see we hav some metadata for the image, the image itself and the field objects
contains the annotations themselves. Looking just at an example of the annotations:
{
"category_id": 0,
"image_id": "4",
"id": 1,
"area": 110901,
"bbox": [
34.529998779296875,
556.8300170898438,
401.44000244140625,
276.260009765625,
],
"segmentation": [
[
34.529998779296875,
556.8300170898438,
435.9700012207031,
556.8300170898438,
435.9700012207031,
833.0900268554688,
34.529998779296875,
833.0900268554688,
]
],
"iscrowd": False,
}
We see here, that we again have some metadata for each image. We also have a category_id
and a bbox
. Some of these fields should look familiar to you if you are familiar with the coco format. This will become relevant later, so don't worry if these aren't familiar to you.
One issue we can run into when training object detection models is stray bounding boxes (i.e. ones where the bounding boxes stretch beyond the edge of the image). We can check and remove these quite easily. This is some ugly code/there is probably a better way, but this is a quick check, so I'll forgive myself.
from tqdm.auto import tqdm
remove_idx = []
for idx, row in tqdm(enumerate(dataset)):
objects_ = row["objects"]
for ob in objects_:
bbox = ob["bbox"]
negative = [box for box in bbox if box < 0]
if negative:
remove_idx.append(idx)
len(remove_idx)
keep = [i for i in range(len(dataset)) if i not in remove_idx]
len(keep)
The above code has given us a list of indexes to keep so we use the select
method to grab those.
dataset = dataset.select(keep)
We also create a test split. If we were properly doing this we'd likely want to be a bit more thoughfull about how to do this split.
dataset = dataset.train_test_split(0.1)
Preparing the data
This section of the blog post is the bit which focuses on getting data ready for an object detection model such as detr via the datasets library. This is, therefore, also the section which will differ most from the other examples showing how to train models using PyTorch data loaders.
The Feature Extractor
If you are familiar with Hugging Face for natural language tasks, you are probably familiar with using Tokenizer_for_blah_model
when pre-processing text. Often if you are using a pre-trained model, you will use AutoTokenizer.from_pretrained
, passing in the ID to the model you want to fine-tune. This tokenizer then ensures that the tokenization matches the approach used for the pre-trained model.
The Feature Extractor performs a similar task. Let's look at this more closely. We'll use a pre-trained model for this example and fine-tune it. I also include commented-out code, which shows how you could use the same process with any CNN backbone. This may be useful if you have particular requirements about what backbone to use or if you have a CNN backbone that is already fine-tuned on your domain.
model_checkpoint = "facebook/detr-resnet-50"
from transformers import DetrFeatureExtractor
feature_extractor = DetrFeatureExtractor.from_pretrained(model_checkpoint)
If you wanted to use a different CNN backbone as your starting point you would instead define a config.
# from transformers import DetrConfig
# from transformers import DetrFeatureExtractor
# feature_extractor = DetrFeatureExtractor()
from rich import inspect
inspect(feature_extractor, methods=True, dunder=True)
The output of inspect can be pretty verbose, but I often find this a handy tool for quickly trying to work out a new library of API.
We’ll look at the most critical parts in more detail, but I’ll point out a few things; you’ll see some attributes that will probably sound familiar.
image_mean = [0.485, 0.456, 0.406]
image_std = [0.229, 0.224, 0.225]
These are the mean and standard deviation used during the model training. It’s essential when we’re doing inference or fine-tuning to replicate these, and having these all stored inside a feature_extractor
means we don’t have to go poking around in papers to try and work out what these values should be.
Another thing to point out is the push_to_hub
method. We can store feature_extractor
s in the hub just as we can store models and tokenizers. Having to track the appropriate pre-processing steps for an image manually is super annoying to do manually. Storing this as we do other model components is much simpler and helps avoid errors resulting from tracing these things by hand.
The __call__
method for the DetrFeatureExtractor
is what we'll use to prepare our images before we pass them into the model, let's dig more closely into this.
inspect(
feature_extractor.__call__,
)
Understanding what the __call__
method expected, and how to make sure that is what's delivered by the datasets library is the key thing I needed to work out. What does it expect:
-
images
: this can be a list or a single image (and stored in different formats) -
annotations
this should be of typeUnion[List[Dict],││ List[List[Dict]]]
.
The images
part is not too tricky to understand. We can either pass in a single image, a NumPy array representing an image or a list of images or NumPy arrays.
The annotations
part is where Python type annotations don't always do us many favours since we only know we're expecting a list of dictionaries, but we can safely assume those dictionaries probably need to have a particular format. We can try and see what happens if we pass in an image and a list of a random dictionary.
import io
import requests
from PIL import Image
im = Image.open(
io.BytesIO(
requests.get(
"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/cute-cat-photos-1593441022.jpg?crop=1.00xw:0.749xh;0,0.154xh&resize=980:*"
).content
)
)
im
labels = [
{
"bbox": [
0.0,
3,
3,
4,
]
}
]
feature_extractor(im, labels)
We can see that this raises a ValueError
. We also get some more information here that gives us a clue where we went wrong. Specifically we can see that the annotations for a single image should be a Dict
or List[Dict]
if we're using a batch of images. We also see that we should pass in this data in the COCO
format. Since our data is already in this format we should be able to pass in an example.
image = dataset["train"][0]["image"]
image
annotations = dataset["train"][0]["objects"]
annotations
feature_extractor(images=image, annotations=annotations, return_tensors="pt")
Oh no! It still doesn't work. At this point, it's we probably either want to dig into the source code to work out what we should be passing to the feature_extractor
. The relevant function is def prepare_coco_detection
.
We also have another tutorial.ipynb) to consult. In this tutorial we see that the annotations are stored in a dictionary target
with the keys image_id
and annotations
.
target = {'image_id': image_id, 'annotations': target}
encoding = self.feature_extractor(images=img, annotations=target, return_tensors="pt")
With a bit more wrangling let's see if this works.
target = {"image_id": 4, "annotations": annotations}
feature_extractor(images=image, annotations=target, return_tensors="pt")
This is looking more like it! Now we have one example working we can translate this to a function that can prepare a batch into the same format.
Since we get a batch at a time we might need to refactor things slightly. In this example I've just grabbed the relevant lists for the images, image_id and annotations. We then use a list compression to store these in the dictionary format expected by the feature_extractor
.
def transform(example_batch):
images = example_batch["image"]
ids_ = example_batch["image_id"]
objects = example_batch["objects"]
targets = [
{"image_id": id_, "annotations": object_} for id_, object_ in zip(ids_, objects)
]
return feature_extractor(images=images, annotations=targets, return_tensors="pt")
We could apply this to our data using map
but it often makes more sense to applay these on the fly using the with_transform
method.
dataset["train"] = dataset["train"].with_transform(transform)
Let's take a look at an example
dataset["train"][0]
The next thing we need to take care of is a collate function. 'Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.' source.
def collate_fn(batch):
pixel_values = [item["pixel_values"] for item in batch]
encoding = feature_extractor.pad_and_create_pixel_mask(
pixel_values, return_tensors="pt"
)
labels = [item["labels"] for item in batch]
batch = {} # collated batch
batch['pixel_values'] = encoding['pixel_values']
batch["pixel_mask"] = encoding["pixel_mask"]
batch["labels"] = labels
return batch
Avoiding ambiguous labels
We're almost at the point where we can start training the model. We just do a little bit of housekeeping to make sure our model knows what our encoded labels are. It's super annoying when you are trying a model out on the Hugging Face Hub and you get back labels, 0
or 3
with no clue what these labels refer to. We can avoid this by telling our model what labels we have. This mapping will then be bundled with the model when we push it to the hub.
id2label = dict(enumerate(dataset["train"].features["objects"][0]["category_id"].names))
label2id = {v: k for k, v in id2label.items()}
label2id
Now we can create the DetrForObjectDetection
model. This should all look familiar if you've used transformers for other tasks.
from transformers import DetrForObjectDetection
model = DetrForObjectDetection.from_pretrained(
model_checkpoint,
num_labels=1,
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True,
)
If you wanted to use another backbone you could do something like:
# from transformers import DetrForObjectDetection
# config = DetrConfig(backbone='regnetz_e8',id2label=id2label, label2id=label2id)
# model = DetrForObjectDetection(config)
We now specify our TrainingArguments
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="detr-resnet-50_fine_tuned_nls_chapbooks",
per_device_train_batch_size=8,
num_train_epochs=10,
fp16=False,
save_steps=200,
logging_steps=50,
learning_rate=1e-4,
save_total_limit=2,
remove_unused_columns=False,
push_to_hub=True,
hub_model_id="davanstrien/detr-resnet-50_fine_tuned_nls_chapbooks",
)
and create our Trainer
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=collate_fn,
train_dataset=dataset["train"],
tokenizer=feature_extractor,
)
trainer.train()