Intro

I sometimes feel like the process of me writing working code is analogous to the thing about monkeys writing Shakespeare, although in my case the end result of probably more like a Donald Trump tweet. My process for debugging or trying to work out how to implement something I'm not completely familiar with seems to not always be efficient. This is something I'm keen to work on.

This blog post is an attempt to try and do this 'in public'. What I've tried to do here is to use this blog post (which is also a notebook) as a place to record the steps I tried on the process of trying to implement a new callback in fastai. In particular I try and record my though process and the steps I took. This probably makes for an unreadable mess but I thought it might be useful as a record for myself.

This blog post isn't intended to be a guide on how to best solve a problem. It's more a cry for help 😂

If anyone does read this, I would be really grateful to get any suggestions on:

  • whether you have suggestions to tackling a problem like this
  • if you think the 'come back to it later' approach is sensible or not

A callback to generate a rule of thumb about how much more data will help

I've recently been doing a lot more annotation of data for various computer vision projects. Since I'm doing all of the annotating myself I often want to have some sense of whether it is worth trying to get some more data.

Prodigy/Prodigy has a 'recipe' for doing something that tries to answer this question. train_curve trains:

a model component with different portions of the training examples and print the accuracy figures and accuracy improvements with more data. This recipe takes pretty much the same arguments as train. --n-samples sets the number of sample models to train at different stages. For instance, 10 will train models for 10% of the examples, 20%, 30% and so on. This recipe is useful to determine the quality of the collected annotations, and whether more training examples will improve the accuracy. As a rule of thumb, if accuracy improves within the last 25%, training with more examples will likely result in better accuracy. Source There is also a nice example of this in action in this youtube video

I wanted to implement something similar in fastai. My stretch goal was to make something that would be useful enough and be implemented cleanly enough that I could make a pull request to fastai proposing this. I am not sure if this will actually get there but I wanted to have this in mind as an end goal so that I:

  • force myself to not take hacky shortcuts
  • force myself to be exposed to more of the inner functionality of fastai
  • learn about the callbacks system in a lot more detail

Regarding the second point, I am particularly keen to do this 'inside' fastai rather than 'wrapping' something around fastai. If you are familiar with callbacks already this distinction will probably be clear already. If not, hopefully it will make sense later.

Part 1...

This part goes through most of the steps I took, some of the notebook has been tidied so it's not even longer than this will already be...

from fastai.vision.all import *
from fastai.callback import *
from fastai.test_utils import *
import pandas as pd

Dynamically altering the training data size

What I want to be able to do is to dynamically alter the training data. i.e. modify the data attached to a learner dynamically rather than recreating the dataloaders outside of the training loop. Try to do this with example data

data = untar_data(URLs.IMAGEWANG_160)
dls = ImageDataLoaders.from_folder(data/'train', valid_pct=0.7, item_tfms=[Resize(64)])
len(dls.dataset)

I think train_ds is probably the the thing I want to modify

dls.train_ds
(#4401) [(PILImage mode=RGB size=213x160, TensorCategory(8)),(PILImage mode=RGB size=213x160, TensorCategory(16)),(PILImage mode=RGB size=213x160, TensorCategory(6)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=202x160, TensorCategory(14)),(PILImage mode=RGB size=240x160, TensorCategory(18)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=160x457, TensorCategory(17)),(PILImage mode=RGB size=240x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(8))...]
learn = cnn_learner(dls, squeezenet1_0, metrics=accuracy, pretrained=False)
learn.fit(1)
epoch train_loss valid_loss accuracy time
0 3.584931 3.637918 0.131671 01:05

I think it will probably be safer to make a deepcopy of things.

import copy

Note: I’m not quite sure whether this is the best way of handling this, it seems to be very hacky.
d_original = copy.deepcopy(learn.dls.dataset)
d_original
(#4401) [(PILImage mode=RGB size=213x160, TensorCategory(8)),(PILImage mode=RGB size=213x160, TensorCategory(16)),(PILImage mode=RGB size=213x160, TensorCategory(6)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=202x160, TensorCategory(14)),(PILImage mode=RGB size=240x160, TensorCategory(18)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=160x457, TensorCategory(17)),(PILImage mode=RGB size=240x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(8))...]

Grab a smaller part of the dataset

len(dls.dataset) * 0.5
2200.5

Pass this back in.

learn.dls.dataset = L(dls.dataset[:100])
learn.dls.dataset
(#100) [(PILImage mode=RGB size=213x160, TensorCategory(8)),(PILImage mode=RGB size=213x160, TensorCategory(16)),(PILImage mode=RGB size=213x160, TensorCategory(6)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=202x160, TensorCategory(14)),(PILImage mode=RGB size=240x160, TensorCategory(18)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=160x457, TensorCategory(17)),(PILImage mode=RGB size=240x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(8))...]
learn.fit(5)
epoch train_loss valid_loss accuracy time
0 3.035343 2.480581 0.215037 01:04
1 2.913207 2.479565 0.212115 01:01
2 2.817269 2.409176 0.240066 01:05
3 2.696751 2.442177 0.211044 01:02
4 2.664056 3.162136 0.254967 01:04

check our data again, (I'm paranoid)

learn.dls.dataset
(#100) [(PILImage mode=RGB size=213x160, TensorCategory(8)),(PILImage mode=RGB size=213x160, TensorCategory(16)),(PILImage mode=RGB size=213x160, TensorCategory(6)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=202x160, TensorCategory(14)),(PILImage mode=RGB size=240x160, TensorCategory(18)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=160x457, TensorCategory(17)),(PILImage mode=RGB size=240x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(8))...]

Now try and shove in the original data.

learn.dls.dataset = d_original
learn.dls.dataset
(#4401) [(PILImage mode=RGB size=213x160, TensorCategory(8)),(PILImage mode=RGB size=213x160, TensorCategory(16)),(PILImage mode=RGB size=213x160, TensorCategory(6)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=202x160, TensorCategory(14)),(PILImage mode=RGB size=240x160, TensorCategory(18)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=160x457, TensorCategory(17)),(PILImage mode=RGB size=240x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(8))...]

check that this fit still

learn.fit(1)
epoch train_loss valid_loss accuracy time
0 2.512241 2.247095 0.284184 01:02
learn.dls.dataset
(#4401) [(PILImage mode=RGB size=213x160, TensorCategory(8)),(PILImage mode=RGB size=213x160, TensorCategory(16)),(PILImage mode=RGB size=213x160, TensorCategory(6)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=202x160, TensorCategory(14)),(PILImage mode=RGB size=240x160, TensorCategory(18)),(PILImage mode=RGB size=213x160, TensorCategory(15)),(PILImage mode=RGB size=160x457, TensorCategory(17)),(PILImage mode=RGB size=240x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(8))...]

This seemed to work with a few attempts but it seemed like a bad idea. I asked if anyone could see if this was an issue in the fastai discord

I kind of felt unsure about how best to approach the data issue so for now I'll move on to playing around with the callbacks.

Warning: this seemed like a bad idea (spoiler alert it was. I’ll come back to this later...

Using callbacks

Since I'm not completely sure about how to approach the dynamic resizing of data I'll move on to getting stuck into the callbacks and coming back to this issue later.

I'm not sure if moving on when I get stuck is the best approach. I feel like I sometimes end up trying to 'brute force' a solution at some point rather thanking thinking it through carefully. I feel like coming back to it later sometimes helps but it could also be kicking the can down the road...

How to use callbacks basics

First when can where can a call back be used? The docs for the callbacks gave me a good starting point. One of the things I need to work out is where callbacks can be called, the second where I would want to use callbacks in this example.

The callbacks can helpfully be found under event

[e for e in dir(event) if not e.startswith('__')]
['after_backward',
 'after_batch',
 'after_cancel_batch',
 'after_cancel_epoch',
 'after_cancel_fit',
 'after_cancel_train',
 'after_cancel_validate',
 'after_create',
 'after_epoch',
 'after_fit',
 'after_loss',
 'after_pred',
 'after_step',
 'after_train',
 'after_validate',
 'before_backward',
 'before_batch',
 'before_epoch',
 'before_fit',
 'before_train',
 'before_validate']

Maybe we want to do our setup before fit?

My first though is that it was likely that I would want to do some stuff before fitting so I wanted to start here. The other things which the docs page helped is in understanding the attributes available using callbacks. To get more familiar I'll try printing out some of these attributes. This will hopefully also clarify they are what they think I are.

class ShowTrainInfo(Callback):
    def before_fit(self): 
        print(f"Number of Epochs:{self.n_epoch}")

This (hopefully) will print out the number of epochs before the model fit

learn.fit(1, cbs=ShowTrainInfo())
Number of Epochs:1
epoch train_loss valid_loss accuracy time
0 2.478273 2.763863 0.145014 01:04

This seems to work! Try to do something a bit closer to what I want to achieve...

One thing that I need to be able to do is manipulate the number of epochs. I want to basically run the epochs passed by the user multiple times with different subsets of the data. This will simulate running the training loop multiple times inside the training loop.

Maybe it's easier to call after_create so that instead of having to try and capture all of the info about the learner and recreating it we do the following:

after create:

  • multiply the number of epochs by the number of 'trials' (maybe with some added options later e.g. early stopping, terminate on NaN)
  • create first trial using (len(fulldata))/number of trials i.e. for 4 trials get 25% of data
  • pass first trial data to model
  • train model
  • if epoch number % epochs per trial==0:
    • record the information about best result, last result etc.
    • reset the model weights
    • create new updated size of training data i.e. next step for 4 trials = 50%
    • ? not sure how to best record that this is a different trial and record other info -> maybe store in an attribute of the callback?
    • ? not sure about what the log/print at the end, will do some baby print statements for now, this is obviously where I want to sink loads of time into choosing emojis to print...

Try out the manipulation of the number of epochs. I think it should be possible to just update this value. I'll do it in before_fit for now.

class ShowTrainInfo(Callback):
    def before_fit(self): 
        print(self.n_epoch)
        self.learn.n_epoch *=4
learn.fit(1, cbs=ShowTrainInfo())
1
100.00% [1/1 01:01<00:00]
epoch train_loss valid_loss accuracy time
0 2.445697 2.376423 0.281457 01:01
100.00% [161/161 00:27<00:00 2.3999]
200.00% [2/1 02:06<-1:58:57]
epoch train_loss valid_loss accuracy time
0 2.445697 2.376423 0.281457 01:01
1 2.399856 2.251869 0.281457 01:04
100.00% [161/161 00:27<00:00 2.3335]
epoch train_loss valid_loss accuracy time
0 2.445697 2.376423 0.281457 01:01
1 2.399856 2.251869 0.281457 01:04
2 2.333455 2.175574 0.308142 01:01
3 2.291610 2.286107 0.277854 01:04

Is this actually training four times? The progress bar makes things a little tricky to read. Let's print out the epochs to check, and get rid of the progress bar for now so it's easier to see what's going on

class ShowTrainInfo(Callback):
    def before_fit(self): 
        print(f"number epochs passed by user:{self.n_epoch}")
        self.learn.n_epoch *=4
    def after_epoch(self):
        print(f"just finished epoch:{self.epoch}")
with learn.no_bar():
    learn.fit(1, cbs=ShowTrainInfo())
number epochs passed by user:1
[0, 2.2561960220336914, 2.186687469482422, 0.31038177013397217, '01:02']
just finished epoch:0
[1, 2.263982057571411, 3.9907848834991455, 0.2859368920326233, '01:03']
just finished epoch:1
[2, 2.234278917312622, 2.299966812133789, 0.3186599016189575, '01:04']
just finished epoch:2
[3, 2.1652183532714844, 2.047651529312134, 0.36112192273139954, '01:26']
just finished epoch:3

That seems to be okay, we have 4 total epochs. I'll get back to the logging issue later...

Let's try the idea of breaking for each point in the trial i.e in this case after each epoch. Maybe should also add an init to the callback and give it a more sensible name. I will also now add something that will 'reset' the model after each trial. i.e. after the original epochs passed by the user. For now, I'll just add some print statements to see where things are being called. I'm hoping this will let me confirm I'm planning to execute callbacks in the right place.

class DSetSizeTrials(Callback):
    def __init__(self, n_trials=4): self.n_trials = n_trials
    def before_fit(self):
        self.epoch_per_trial = self.n_epoch
        print(f"number epochs passed by user:{self.n_epoch}")
        self.learn.n_epoch *=self.n_trials
    def after_epoch(self):
        print(f"just finished epoch:{self.epoch}")
        if (self.epoch+1) % self.epoch_per_trial==0:
            print('reset model')
with learn.no_bar():
    learn.fit(3, cbs=DSetSizeTrials(2))
number epochs passed by user:3
[0, 2.108391761779785, 3.002286672592163, 0.30103233456611633, '01:22']
just finished epoch:0
[1, 2.0375003814697266, 2.582606792449951, 0.2694779932498932, '01:09']
just finished epoch:1
[2, 2.0219476222991943, 2.2244386672973633, 0.3190494775772095, '01:15']
just finished epoch:2
reset model
[3, 1.9379000663757324, 2.5143325328826904, 0.37748345732688904, '01:04']
just finished epoch:3
[4, 1.939262866973877, 2.207705020904541, 0.33180755376815796, '01:07']
just finished epoch:4
[5, 1.894189715385437, 10.37093448638916, 0.3005453944206238, '01:04']
just finished epoch:5
reset model

I'm not limited by my own brain...

Getting back to the issue of whether passing in a new subset of data by doing learn.dls.dataset = L(dls.dataset[:100]) was a bad idea. It turns out it was. Although this worked without issue for the first dataset I worked with when I tried with data loaded via a DataFrame I got errors even at the indexing stage. Since I had asked the question in the discord I thought it was worth responding:

My initial response wasn't super helpful in hindsight. Saying something is a bad idea isn't very useful for anyone else. This is often super obvious when you see someone else do it but I think it's easy to forget. One of the things I love about fastai is the community around it which is super focused on helping people out. I'm glad Zach asked followed up here since this exchange also ended up being super useful. My follow up reply:

For context, what I am trying to work out is how to dynamically change my training data size during training, i.e. first use 25%, then 50% of the training data. I thought this might be possible by indexing the dls.dataset attribute and updating it to a slice of that. This worked out when the original dataloaders was defined via a from_folder. When I just tried the same thing using a dls originally created via a DataFrame I get a FileNotFoundError. I'm still trying to wrap my head around exactly why one works but not the other, but I'm assuming that the dataset attribute doesn't contain all the information that is needed to get to an item in all cases?

Tip: Say both what you are trying to do and why you are trying to do something. This gives a much better insight for people who might want to help.

Although my response doesn't say exactly what is going wrong it offer my best guess and also gives some insight into my motivations. In this case since it seems like a slightly weird thing without the wider context/motivations outlined above.

This led to a super helpful exchange with a few people in the discord which honed in on a few possible solutions to this problem.

I don't want to reproduce the whole exchange but what was super nice is that:

  • a bunch of people tried to help. Even if nothing else it gives you warm fuzzies that people are happy to help you with this kind of thing
  • people offered a bunch of different approaches all of which offered potential things for me to follow up
  • this fast tracked my progress, particularly since it might have got super annoying to be stuck on this problem on my own for ages.

Tip: other people are happy to help but you should try and make it easy for them to help you (and I should reciprocate where I can)

Following this input from other people I followed up on shuffle_fn as a way to achieve what I'm trying to do (inside of a callback). Again, I'll try modifying this outside of a callback to see how it works

dls = ImageDataLoaders.from_folder(data/'train', valid_pct=0.3, item_tfms=[Resize(64)])
dls.train_ds
(#10269) [(PILImage mode=RGB size=239x160, TensorCategory(19)),(PILImage mode=RGB size=160x240, TensorCategory(19)),(PILImage mode=RGB size=160x357, TensorCategory(17)),(PILImage mode=RGB size=213x160, TensorCategory(19)),(PILImage mode=RGB size=290x160, TensorCategory(0)),(PILImage mode=RGB size=213x160, TensorCategory(17)),(PILImage mode=RGB size=160x213, TensorCategory(17)),(PILImage mode=RGB size=160x263, TensorCategory(9)),(PILImage mode=RGB size=213x160, TensorCategory(14)),(PILImage mode=RGB size=213x160, TensorCategory(18))...]

Looking at what it does at the moment

??dls.train.shuffle_fn
Signature: dls.train.shuffle_fn(idxs)
Docstring: Returns a random permutation of `idxs`.
Source:        def shuffle_fn(self, idxs): return self.rng.sample(idxs, len(idxs))
File:      /usr/local/anaconda3/envs/blog/lib/python3.8/site-packages/fastai/data/load.py
Type:      method

We can easily patch the functionality, in this case we'll just return 128 items

@patch_to(DataLoader)
def shuffle_fn(self, idxs): return self.rng.sample(idxs, 128)

Check this has changes as expected

??dls.train.shuffle_fn
Signature: dls.train.shuffle_fn(idxs)
Docstring: <no docstring>
Source:   
@patch_to(DataLoader)
def shuffle_fn(self, idxs): return self.rng.sample(idxs, 128)
File:      ~/Documents/daniel/blog/_notebooks/<ipython-input-29-99a68b7fe249>
Type:      method

Now try training...

learn = cnn_learner(dls, squeezenet1_0)
learn.fit(1)
epoch train_loss valid_loss time
0 4.615644 6.038196 00:17

This seemed to do the training step very quickly so it's probably only getting 128 items but I don't know that for sure. If only there was some way of getting access to the training loop in fastai 😜

class PrintItems(Callback):
    def __init__(self):
        self.items_done = 0
    def after_batch(self):
        print(self.iter)
        self.items_done += (max(self.iter,1)* self.learn.dls.bs)
    def before_validate(self):
        print(self.items_done)
        raise CancelFitException('stopped before valid')
learn.fit(1, cbs=PrintItems())
epoch train_loss valid_loss time
0 4.113161 None 00:00
0
1
128

With some more guidance I update this function to accept an input parameter that we can then easily update during training.

@patch
def my_shuffle_fn(self:DataLoader, idxs, size=128):
    return self.rng.sample(idxs, size)
from functools import partial
dls.train.shuffle_fn = dls.train.my_shuffle_fn
??dls.train.shuffle_fn
Signature: dls.train.shuffle_fn(idxs, size=128)
Docstring: <no docstring>
Source:   
@patch
def my_shuffle_fn(self:DataLoader, idxs, size=128):
    return self.rng.sample(idxs, size)
File:      ~/Documents/daniel/blog/_notebooks/<ipython-input-34-95358dde961e>
Type:      method

Defining roughly what I want to do with some print statements again...

class PrintItems(Callback):
    def __init__(self):
        self.items_done = 0
    def before_train(self): 
        print('calculating partial data sizes...')
        print('updating shuffle_fn....')
        self.learn.dls.train.shuffle_fn = partial(self.learn.dls.train.my_shuffle_fn, size=128)
    def after_batch(self):
        print(self.iter)
        self.items_done += (max(self.iter,1)* self.learn.dls.bs)

    def before_validate(self):
        print(self.items_done)
        raise CancelFitException('stopped before valid')  
learn.fit(1, cbs=PrintItems())
epoch train_loss valid_loss time
0 2.966318 None 00:00
calculating partial data sizes...
updating shuffle_fn....
0
1
128

To be continued...

To avoid this becoming even longer I'll pick up the next steps of trying to get this callback working in another blog post.

Summary so far

How this doing this help with my original goals.

  • I've definitely got a much better grasp on the callback system in fastai. Even though the specific callback I'm working on is still a work in progress, I now know much better where the entry points for callbacks are and what can be accessed/modified. I would feel more confident implementing callbacks than before.

  • my 'process' is still a work in progress but having in mind that I should try and record the steps was actually super helpful and something I'll try and do better next time.

  • Asking other people for help can move progress forward quickly. I think there is a skill in asking questions in the best way possible and this is something I'll continue to work on...