This page collects selected projects that I have worked on.

Machine learning projects

flyswot: using computer vision to detect ‘fake flysheets’

Related Posts

An increasing challenge for libraries is managing the scale of digitised material resulting from digitisation projects and ‘born digital’ materials. This project aims to detect mislabelled digitised manuscript pages.

A manuscript page with the correct label "fse.ivr"

As a result of the limitations of a previous system for hosting digitised manuscript images, many images have incorrect page metadata associated with the image. An image has been correctly labelled as an ‘end flysheet’ in the example above. This label is represented by the fse label, which is included in the filename for the image. However other types of manuscript pages also have this label incorrectly assigned, i.e. a ‘cover’ has fse in the filename. There is around a petabyte of images to review before ingesting a new library system. This project uses computer vision to support library staff in processing this collection. At the moment, this project does the following:

  • pulls in an updated dataset of training examples
  • trains a model on these images
  • the model architecture has multiple heads to allow the model to make both a ‘crude’ prediction for whether the image is incorrectly labelled and a ‘full’ prediction for the true label.
  • once a new version of the model has been trained, it is pushed to the 🤗 model hub.
  • the end-user uses the model through a command-line tool which is pointed at a directory of images to be checked.

The code for the command-line tool is available here: github.com/davanstrien/flyswot/

Some of the tools used: fastai, DVC, Weights and Biases, 🤗 model hub, pytest, nox, poetry

Book Genre Detection

This project created machine learning models which would predict whether a book was ‘fiction’ or ‘non- fiction’ based on the book title:

  • The project was developed to address a gap in metadata in a large scale digitised book collection.
  • The project used weak supervision to generate a more extensive training set beyond the initial human-generated annotations.
  • Currently, two models are publicly available, one via the 🤗 Model Hub and one via Zenodo.
  • The process of creating the models is documented in a Jupyter Book. This documentation aims to communicate the critical steps in the machine learning pipeline to aid other people in the sector develop similar models. https://huggingface.co/spaces https://huggingface.co/spaces/BritishLibraryLabs/British-Library-books-genre-classifier-v2

Some of the tools used: fastai, transformers, blurr, Hugging face model hub, Jupyter Book, Snorkel, Gradio

Datasets

British Library books

British Library Books Genre data

Datasets to support programming historian lessons

I think having more realistic datasets is important for teaching machine learning effectively. As a result, I created two datasets for two under review Programming Historian lessons.

Workshop datasets

Workshop materials

Tutorials

Code

You can view much of my code related activity on GitHub

Publications