Introduction to my notes

These are my notes from the full-stack deep learning course.

They have a focus on the GLAM setting, which might have different requirements, incentives and resources compared to business deployments of ML projects. These notes are for my use primarily. I am posting them to make myself more accountable for making semi-coherent notes.

These are my notes, written in a personal capacity - my employer denounces all my views

Why talk about debugging?

80-90% of time spent on debugging? Useful to have some approaches for doing this work.

Why is deep learning debugging hard?

Small implementation details

Often you may get an error that results in weird results but the code still runs. You don’t have an exception to help you debug. Often need to manually dig around to find the cause of the error. Example:

images = glob.glob('path/to/images/*')
labels = glob.glob('path/to/labels/*')

Will not return images and labels in the correct order because of how glob is implemented in python. This doesn’t throw an error but the model won’t learn anything.

Hyperparamers

Small changes to hyperparameters can make the difference between model training well and not training at all.

Data/Model fit

Performance on model on one set of images i.e. imagenet might not translate to the data you are working with. Transfer learning will often help but it might be unclear exactly how the model/new tasks translate to new types of data/problems. #

Models vs Dataset construction

In an academic setting, a lot of thought is given to choosing models/algorithms. Often this is less of a focus when deploying machine learning in a production setting where there will be many more challenges around constructing a dataset.

Notes: I think there is a shift here is going from training n a validation/training set to having in the back of your head that predictions will be made ‘in the wild’.

There can be a fair bit of variation in GLAM datasets which also potentially makes data drift hard to track

Architecture Selection

It can be overwhelming to pick between all the different types of model architectures. Suggestion to start out with a few different flavour and change if needed:

Type Model Maybe move to?
Images LeNet-like model ResNet
Sequences LSTM with one hidden layer Attention
other Fully connected neural net with one hidden layer Problem dependent

Sensible defaults

version zero of your model could start with:

  • optimizer: Adam with learning rate 3e-4
  • activations: relu (FC and Convultional models), tanh (LSTMS)
  • Intitilization: He et al. normal(relu), GLorot normal (tanh)
  • Regulirization: none
  • data normalization: none

Consider simplifying the problem

  • start with a subset of the training data
  • use a fixed number of objects, classes etc.
  • create a simpler synthetic training set

Model evaluation

tl;dr apply bias-variance decomposition

In statistics and machine learning, the bias-variance tradeoff is the property of a model that the variance of the parameter estimates across samples can be reduced by increasing the bias in the estimated parameters. The bias-variance dilemma or bias-variance problem is the conflict in trying to simultaneously minimise these two sources of error that prevent supervised learning algorithms from generalising beyond their training set:[1][2] The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data rather than the intended outputs (overfitting).
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

  • Test error = irreducible error + bias + variance + val overfitting

Here we try and see the difference in error between our training data and some kind of baseline, i.e. human performance, to see how much we are underfitting. We also compare validation to training to see how much overfitting.

This assumes training, validation, and test come from the same distribution.

Notes: tracking the impact of distribution shift on model performance helpful here.

Example

Error source value
Goal 1%
Train error 20%
Validation error 27%
Test error 28%
  • The difference between our goal and our train error shows underfitting
  • The difference between train and validation also shows we’re also overfitting
  • test and validation error difference is okay

🤷‍♂️ Both overfitting and underfitting - how to decide what to do next…

prioritizing improvements

It seems hard to both deal with under and overfitting. The suggested process is to follow the following steps:

1) address underfitting 2) address overfitting 3) address distribution shift 4) re-balance data (if applicable)

How to address under-fitting (i.e. reduce bias)

These suggestions are listed in order of what to try first.

  • Make the model bigger (i.e. resnet34, to resnet50)
  • reduce regulirization
  • error analysis
  • choose different model architecture (i.e. bump from LeNet to ResNet). This can introduce new problems, so be careful here
  • add features 😬 sometimes this can really help even in deep-learning

Addressing over-fitting

  • add more training data (if possible)
  • add normalisation
  • data augmentation
  • regularisation
  • error analysis
  • choose better model architecture
  • tune hyperparameters
  • early stopping
  • remove features
  • reduce model size

Addressing distribution shift

a) analyse test-val set errors - why might these errors be there in the validation set. b) analyse test-val set errors and synthesise more data c) domain adaptation

Error analysis

Try and find the different sources of error in test/val and see how easily they can be fixed + how much they contribute to the error.

This could be particularly important when domain expertise might help track down why some of these errors occur. This might also be used to reevaluate labels being targeted if they often cause confusion. Fastai has some nice methods for this most_confused to show which labels are often confused. If this confusion is reasonable, you may a) either not care too much about these mistakes b) collapse two labels into one label.

Domain adaptation

  • supervised: fine-tune pre-trained model
  • un-supervised: more in the research domain at the moment?

Tuning Hypermaters

Model and optimiser choices?

  • how many layers
  • kernel size
  • etc.

How to choose what to tune? some hyperparameters are more essential but its often hard to know which are going to be more important

rules of thumb of what is worth tuning

Hyperparamter likely senstivity
Learning rate high
Learning rate schedule high
Optimiser choice low
other optimiser parameters low
batch size low (what fits onto GPU)
weight initialisation medium
loss function high
model depth medium
layer size high
layer params medium
Weight of regularisation medium
nonlinearity low

Approaches to hyperparameters

Manual:

  • focus on what is likely to make a difference conceptually
  • train and evaluate model
  • guess a better parameters
  • can be good as a starting point + combined with other approaches
  • easy to implement
  • but expensive and you need to have a good sense of sensible range points
  • often better than grid search for the same number of runs
  • not very easy to interpret

Bayesian approaches

  • There are nice frameworks for doing this
  • generally most efficient t
  • can be hard to implement

Overarching notes

personally, I think some of these suggestions might be better suited in settings where ml will be the product/service. In GLAMS, this might sometimes be the case, but there is also low hanging fruit using ml as a ‘tool’. In this case, I think it might make more sense to start with the proper implementation of ResNet as a starting point rather than coding your own net from scratch.

The discussion on error analysis was excellent and is a valuable framework for working out how to prioritise improvements.