Multi-model metadata generation

Experiment in combining text and tabular models to generate web archive metadata
Author

Daniel van Strien

Published

May 3, 2020

Learning from multiple input types

Deep learning models usually take one type of input (image, text etc.) to predict output labels (category, entities etc). This usually makes sense if the data you are using to make predictions contains a lot of information. i.e. a chunk of text from a movie review or an image.

Recently I have been playing around with a Website Classification Dataset from the UK web archive. The dataset is derived from a manually curated web archive which contains a primary and secondary category for each web page. The UK web archive has made a dataset available based on this archive which contains the manually classified subject categories alongside the page URL and the page title.

As part of playing around with this dataset I was keen to see if a multi-input model would work well. In this case exploring a model that takes both text and tabular data as input. A preview of the data:

#hide_input
import pandas as pd
tsv ='https://gist.githubusercontent.com/davanstrien/5e22b725046eddc2f1ee06b108f27e48/raw/71426e6b92c7fa98140a95728a5ea55171b948cd/classification.tsv'
df = pd.read_csv(tsv, error_bad_lines=False, index_col=0)
df.head()
Primary Category Secondary Category Title URL
0 Arts & Humanities Architecture 68 Dean Street http://www.sixty8.com/
1 Arts & Humanities Architecture Abandoned Communities http://www.abandonedcommunities.co.uk/
2 Arts & Humanities Architecture Alexander Thomson Society http://www.greekthomson.com/
3 Arts & Humanities Architecture Arab British Centre, The http://www.arabbritishcentre.org.uk/
4 Arts & Humanities Architecture Architectural Association School of Architecture http://www.aaschool.ac.uk/

Based on this data the UK web archive are interested: >“in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives.”

This is going to be fairly tricky but offers a nice excuse to try to use models with multiple inputs to predict our categories.

Looking at the data

Taking a closer look at the data:

#hide_input 
tsv = 'https://gist.githubusercontent.com/davanstrien/5e22b725046eddc2f1ee06b108f27e48/raw/71426e6b92c7fa98140a95728a5ea55171b948cd/classification.tsv'
df = pd.read_csv(tsv, error_bad_lines=False,)

Unique primary categories

len(df['Primary Category'].unique())
24

Unique secondary categories

len(df['Secondary Category'].unique())
104

Predicting a 104 different labels is going to be pretty difficult so I’ve only used ‘Primary Category’ as the the y target. What is the distribution of these categories like?

#hide_input
df['Primary Category'].value_counts()
Arts & Humanities                                              5299
Government, Law & Politics                                     4832
Business, Economy & Industry                                   2988
Society & Culture                                              2984
Science & Technology                                           2420
Medicine & Health                                              2164
Education & Research                                           2118
Company Web Sites                                               843
Digital Society                                                 737
Sports and Recreation                                           710
Religion                                                        417
Travel & Tourism                                                374
Social Problems and Welfare                                     270
Politics, Political Theory and Political Systems                123
Crime, Criminology, Police and Prisons                          101
Literature                                                       87
Law and Legal System                                             81
Computer Science, Information Technology and Web Technology      54
Libraries, Archives and Museums                                  52
Environment                                                      38
History                                                          34
Publishing, Printing and Bookselling                             26
Popular Science                                                  23
Life Sciences                                                    23
Name: Primary Category, dtype: int64

😬 We also have a fairly skewed datasets. I could drop some of rows which don’t occur often but since the main objective here is to see if we can use a multi-input model we’ll leave the data as it is for now.

Multi-input model

The rest of the notebook will describe some experiments with using fastai to create a model which takes tabular and text data as an input. The aim here wasn’t for me to create the best model but get my head around how to combine models. I heavily relied on some existing notebooks, kaggle writeup and forum posts on the fastai forums.

Tabular model

In the dataset above we start of with two columns of data which can be used as inputs for the model. The title is fairly obviously something which we can treat like other text inputs. The URL is a little less obvious. It could be treated as a text input but an alternative is to treat a URL as parts which each contain some information which could be useful for our model.

#hide_input
print(df.URL.sample(10).to_list()[3])
print(df.URL.sample(10).to_list()[4])
print(df.URL.sample(10).to_list()[3])
http://www.specialschool.org/
http://www.bbc.co.uk/news/health-12668398
http://www.monarchit.co.uk/

Each part of the URL could be split into smaller parts

#hide_input
print(df.URL.sample(10).to_list()[3].split('.'))
['http://www', 'darwincountry', 'org/']

Whether a url has ‘.org’ or ‘.uk’ or ‘.com’ could be meaningful for predicting our categories (it might also not be meaningful). It also offers us a way of taking the URLs and composing it into a format which looks more tabular.

#hide_input
csv ='https://gist.githubusercontent.com/davanstrien/5e22b725046eddc2f1ee06b108f27e48/raw/4c2a27772bf4d959bf3e58cfa8de9e0b9be69ca7/03_classification_valid_train.csv'
df = pd.read_csv(csv, index_col=0)
df[['scheme','url1','url3','url4','url5']].sample(5)
scheme url1 url3 url4 url5
20011 http www org NaN NaN
15825 http www com NaN NaN
6068 http www co uk NaN
16507 http www co uk NaN
9723 http www co uk NaN

So far I’ve only done this very crudely. I suspect tidying up this part of the data will help improve things. At this point though we have something which is a little more tabular looking we can pass to fastai.tabular learner. Now we have some ‘categories’ rather than unique urls.

print(len(df.url3.unique()))
print(len(df.url4.unique()))
279
56

How does this tabular model do?

Once some preprocessing of the url has been done we train a model using the tabular learner. I didn’t do much to try to optimize this model. Tracking best f2 score we end up with:

Better model found at epoch 36 with f_beta value: 0.17531482875347137 and an accuracy of 0.334121

How well does a text model do?

Next I tried training using the title field in a NLP model. I tried a few things here.

SentencePiece tokenization

By default fastai uses SpaCy to do tokenization with a few additional special tokens added by fastai. I wanted to see if using sentencePiece would work better for processing title fields. SentencePiece allows for various sub-word tokeinzation. This can be useful for agglutinative languages but could also be useful when you have a lot of out of vocabulary words in your corpus. I wanted to see if this also was useful for processing titles since these may contain domain specific terms. I only tried using SentencePiece with ‘unigram’ tokenization. The best score I got for this was:

Better model found at epoch 1 with f_beta value: 0.21195338666439056. ### Default SpaCy tokenization

I compared the above to using the default fastai tokenizer which uses SpaCy. In this case the default approach worked better. This is probably because we didn’t have a large pre-trained model using the SentencePiece tokenization to use as a starting point. The best score I got for this model was:

Better model found at epoch 27 with f_beta value: 0.33327043056488037.

Using the URL as text input

I wanted to do a quick comparison to the tabular model and use the URL as a text input instead. In this case I used SentencePiece with byte-pair-encoding (BPE). The best score in this case was:

Better model found at epoch 3 with f_beta value: 0.2568161189556122.

This might end up being a better approach compared to the tabular approach described above.

Combining inputs

Neither of these models is doing super well but my main question was whether combining the two would improve things at all. There are different approaches to combining these models. I followed existing examples and removed some layers from the text and tabular models which are then combined in a concat model. I won’t cover all the steps here but all the notebooks can be found in this GitHub repo.

#hide
from fastai.tabular import *
from pathlib import Path
import pandas as pd
from fastai import *
from fastai.tabular import *
from fastai.callbacks import *
from fastai.text import *
from fastai.metrics import accuracy, MultiLabelFbeta

One of the things we need to do to create a model with multiple input is create a new Pytorch dataset which combines our text and tabular x inputs with our target. This is pretty straightforward:

#collapse_show
class ConcatDataset(Dataset):
    def __init__(self, x1, x2, y): 
        self.x1,self.x2,self.y = x1,x2,y
    def __len__(self): 
        return len(self.y)
    def __getitem__(self, i): 
        return (self.x1[i], self.x2[i]), self.y[i]

One of the other pieces was creating a ConcatModel

#collapse_show
class ConcatModel(nn.Module):
    def __init__(self, model_tab, model_nlp, layers, drops): 
        super().__init__()
        self.model_tab = model_tab
        self.model_nlp = model_nlp
        lst_layers = []
        activs = [nn.ReLU(inplace=True),] * (len(layers)-2) + [None]
        for n_in,n_out,p,actn in zip(layers[:-1], layers[1:], drops, activs): 
            lst_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn) # https://docs.fast.ai/layers.html#bn_drop_lin
        self.layers = nn.Sequential(*lst_layers)

    def forward(self, *x):
        x_tab = self.model_tab(*x[0])
        x_nlp = self.model_nlp(x[1])[0]
        x = torch.cat([x_tab, x_nlp], dim=1)
        return self.layers(x)   

lst_layer is dependent on the layers from the tabular and nlp models. This layer is manually defined at the moment, so if changes are made to the number of layers in the tab model this needs to be manually changed.

bn_drop_lin is a fastai helper function that returns a a sequence of batch normalization, dropout and a linear layer which is the final layer of the model.

How does this combined model do? 🤷‍♂️

The best result I got wasf_beta value: 0.39341238141059875 with an accuracy of 0.595348. A summary of the scores for each models:

Model F2 score
SentencePiece text 0.211
Spacy text 0.333
Tabular 0.175
Concat 0.393

This provides some improvement on the tabular or nlp models on their own. I found the combined model was fairly tricky to train and suspect that there could be some improvements in how the model is set up that might improve it’s performance. I am keen to try a similar approach with a dataset where there is more abundant information available to train with.

tl;dr

It wasn’t possible to get a very good f2 score on this website classification dataset. As the UK web archive say:

We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future. Options include:

For each site, make the titles of every page on that site available.
For each site, extract a set of keywords that summarise the site, via the full-text index.

I suspect that having a either of these additional components would help improve the performance of the classifier.