My previous series of posts/notes following the full stack course ended slightly abruptly. However, I didn’t give up on the dream of trying to get more experience of ‘deploying’ machine learning in a GLAM setting! One of the things I have been focusing on recently is a project which gave me the chance to use machine learning in somewhat of a production context.

This blog post will be a very short tl;dr on this project.

Detecting fake flysheets using computer vision

The Library has ‘legacy’ digitised content of manuscripts. Due to some limitations of the legacy system on which these manuscripts were shared many of the images have incorrect metadata. The metadata for these images is partially stored in the filename for the image. In particular many images which didn’t fit into an available category were given ‘end flysheet’ labels (this basically means part of the filename contains the string fse). These images may actually be other things like a frontcover, a scroll image, a box etc.

A screenshot of the digitised manuscript platform showing metadata about the page type of the manuscript

The library is moving to a new platform which won’t have these restrictions on possible page types. As a result there is a need/desire to find all of the images which have been given a ‘fake’ flysheet label and correct this label with the correct label.

This is a task where computer vision seems like it might be helpful.

The desired outcome

The desire of this project is to be able to use a computer vision model to check a bunch of image directories and see if there are any ‘fake’ flysheets. There are some additional constraints on the project:

  • $$$ this isn’t a funded project so can’t involve spending a bunch of money
  • related to the above, the approach to annotation has to be pragmatic - no Mechanical Turk here
  • the machine learning should fit into existing workflows (this is something we have/are spending a lot of time on)

Since this is intended to be a tl;dr I won’t go into more detail here about all of these requirements here.

flyswot

The approach we ended up with is to deploy a model using a command line tool that we’ve called flyswot. This tool can be pointed at a directory and it will recursively check for images which contain the fse pattern in the filename. These images are then checked using a computer vision model that looks check whether an image is a ‘real’ flysheet or a ‘fake’ flysheet.

What I have learned (so far)

This project has been a great way of turning some of the theory of ‘production’ ML into practice. In particular I have learned:

  • I’m super paranoid about domain drift.
  • (some) of how to use ONNX
  • More robust testing approaches
  • DVC (data version control)
  • and a bunch more things…

Most of these things are being documented elsewhere and will be available to share at some point in 2022. However, I will try and use this blog to document small things I’ve learned along the way too. These notes are mainly for myself. There are a lot of little things I’ve picked up from doing this project that I will forget if I don’t spend a bit of time writting up.