Multi-model metadata generation
experiment in combining text and tabular models to generate web archive metadata
Learning from multiple input types
Deep learning models usually take one type of input (image, text etc.) to predict output labels (category, entities etc). This usually makes sense if the data you are using to make predictions contains a lot of information. i.e. a chunk of text from a movie review or an image.
Recently I have been playing around with a Website Classification Dataset from the UK web archive. The dataset is derived from a manually curated web archive which contains a primary and secondary category for each web page. The UK web archive has made a dataset available based on this archive which contains the manually classified subject categories alongside the page URL and the page title.
As part of playing around with this dataset I was keen to see if a multi-input model would work well. In this case exploring a model that takes both text and tabular data as input. A preview of the data:
Based on this data the UK web archive are interested:
"in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives."
This is going to be fairly tricky but offers a nice excuse to try to use models with multiple inputs to predict our categories.
len(df['Primary Category'].unique())
len(df['Secondary Category'].unique())
Predicting a 104 different labels is going to be pretty difficult so I've only used 'Primary Category' as the the y
target. What is the distribution of these categories like?
😬 We also have a fairly skewed datasets. I could drop some of rows which don't occur often but since the main objective here is to see if we can use a multi-input model we'll leave the data as it is for now.
Multi-input model
The rest of the notebook will describe some experiments with using fastai to create a model which takes tabular and text data as an input. The aim here wasn't for me to create the best model but get my head around how to combine models. I heavily relied on some existing notebooks, kaggle writeup and forum posts on the fastai forums.
Tabular model
In the dataset above we start of with two columns of data which can be used as inputs for the model. The title is fairly obviously something which we can treat like other text inputs. The URL is a little less obvious. It could be treated as a text input but an alternative is to treat a URL as parts which each contain some information which could be useful for our model.
Each part of the URL could be split into smaller parts
Whether a url has '.org' or '.uk' or '.com' could be meaningful for predicting our categories (it might also not be meaningful). It also offers us a way of taking the URLs and composing it into a format which looks more tabular.
So far I've only done this very crudely. I suspect tidying up this part of the data will help improve things. At this point though we have something which is a little more tabular looking we can pass to fastai.tabular
learner. Now we have some 'categories' rather than unique urls.
print(len(df.url3.unique()))
print(len(df.url4.unique()))
How does this tabular model do?
Once some preprocessing of the url has been done we train a model using the tabular learner. I didn't do much to try to optimize this model. Tracking best f2
score we end up with:
Better model found at epoch 36 with f_beta value: 0.17531482875347137
and an accuracy of 0.334121
How well does a text model do?
Next I tried training using the title field in a NLP model. I tried a few things here.
SentencePiece tokenization
By default fastai uses SpaCy to do tokenization with a few additional special tokens added by fastai. I wanted to see if using sentencePiece would work better for processing title fields. SentencePiece allows for various sub-word tokeinzation. This can be useful for agglutinative languages but could also be useful when you have a lot of out of vocabulary words in your corpus. I wanted to see if this also was useful for processing titles since these may contain domain specific terms. I only tried using SentencePiece with 'unigram' tokenization. The best score I got for this was:
Better model found at epoch 1 with f_beta value: 0.21195338666439056.
Default SpaCy tokenization
I compared the above to using the default fastai tokenizer which uses SpaCy. In this case the default approach worked better. This is probably because we didn't have a large pre-trained model using the SentencePiece tokenization to use as a starting point. The best score I got for this model was:
Better model found at epoch 27 with f_beta value: 0.33327043056488037.
Using the URL as text input
I wanted to do a quick comparison to the tabular model and use the URL as a text input instead. In this case I used SentencePiece with byte-pair-encoding (BPE). The best score in this case was:
Better model found at epoch 3 with f_beta value: 0.2568161189556122.
This might end up being a better approach compared to the tabular approach described above.
Combining inputs
Neither of these models is doing super well but my main question was whether combining the two would improve things at all. There are different approaches to combining these models. I followed existing examples and removed some layers from the text and tabular models which are then combined in a concat model. I won't cover all the steps here but all the notebooks can be found in this GitHub repo.
One of the things we need to do to create a model with multiple input is create a new Pytorch dataset which combines our text and tabular x
inputs with our target. This is pretty straightforward:
class ConcatDataset(Dataset):
def __init__(self, x1, x2, y):
self.x1,self.x2,self.y = x1,x2,y
def __len__(self):
return len(self.y)
def __getitem__(self, i):
return (self.x1[i], self.x2[i]), self.y[i]
One of the other pieces was creating a ConcatModel
class ConcatModel(nn.Module):
def __init__(self, model_tab, model_nlp, layers, drops):
super().__init__()
self.model_tab = model_tab
self.model_nlp = model_nlp
lst_layers = []
activs = [nn.ReLU(inplace=True),] * (len(layers)-2) + [None]
for n_in,n_out,p,actn in zip(layers[:-1], layers[1:], drops, activs):
lst_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn) # https://docs.fast.ai/layers.html#bn_drop_lin
self.layers = nn.Sequential(*lst_layers)
def forward(self, *x):
x_tab = self.model_tab(*x[0])
x_nlp = self.model_nlp(x[1])[0]
x = torch.cat([x_tab, x_nlp], dim=1)
return self.layers(x)
lst_layer
is dependent on the layers from the tabular and nlp models. This layer is manually defined at the moment, so if changes are made to the number of layers in the tab model this needs to be manually changed.
bn_drop_lin
is a fastai helper function that returns a a sequence of batch normalization, dropout and a linear layer which is the final layer of the model.
How does this combined model do? 🤷♂️
The best result I got wasf_beta value: 0.39341238141059875
with an accuracy of 0.595348
. A summary of the scores for each models:
Model | F2 score |
---|---|
SentencePiece text | 0.211 |
Spacy text | 0.333 |
Tabular | 0.175 |
Concat | 0.393 |
This provides some improvement on the tabular or nlp models on their own. I found the combined model was fairly tricky to train and suspect that there could be some improvements in how the model is set up that might improve it's performance. I am keen to try a similar approach with a dataset where there is more abundant information available to train with.
tl;dr
It wasn't possible to get a very good f2 score on this website classification dataset. As the UK web archive say:
We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future. Options include:For each site, make the titles of every page on that site available. For each site, extract a set of keywords that summarise the site, via the full-text index.
I suspect that having a either of these additional components would help improve the performance of the classifier.