- Learning from multiple input types
- Multi-input model
- Combining inputs
Deep learning models usually take one type of input (image, text etc.) to predict output labels (category, entities etc). This usually makes sense if the data you are using to make predictions contains a lot of information. i.e. a chunk of text from a movie review or an image.
Recently I have been playing around with a Website Classification Dataset from the UK web archive. The dataset is derived from a manually curated web archive which contains a primary and secondary category for each web page. The UK web archive has made a dataset available based on this archive which contains the manually classified subject categories alongside the page URL and the page title.
As part of playing around with this dataset I was keen to see if a multi-input model would work well. In this case exploring a model that takes both text and tabular data as input. A preview of the data:
|Primary Category||Secondary Category||Title||URL|
|0||Arts & Humanities||Architecture||68 Dean Street||http://www.sixty8.com/|
|1||Arts & Humanities||Architecture||Abandoned Communities||http://www.abandonedcommunities.co.uk/|
|2||Arts & Humanities||Architecture||Alexander Thomson Society||http://www.greekthomson.com/|
|3||Arts & Humanities||Architecture||Arab British Centre, The||http://www.arabbritishcentre.org.uk/|
|4||Arts & Humanities||Architecture||Architectural Association School of Architecture||http://www.aaschool.ac.uk/|
Based on this data the UK web archive are interested:
"in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives."
This is going to be fairly tricky but offers a nice excuse to try to use models with multiple inputs to predict our categories.
Predicting a 104 different labels is going to be pretty difficult so I've only used 'Primary Category' as the the
y target. What is the distribution of these categories like?
Arts & Humanities 5299 Government, Law & Politics 4832 Business, Economy & Industry 2988 Society & Culture 2984 Science & Technology 2420 Medicine & Health 2164 Education & Research 2118 Company Web Sites 843 Digital Society 737 Sports and Recreation 710 Religion 417 Travel & Tourism 374 Social Problems and Welfare 270 Politics, Political Theory and Political Systems 123 Crime, Criminology, Police and Prisons 101 Literature 87 Law and Legal System 81 Computer Science, Information Technology and Web Technology 54 Libraries, Archives and Museums 52 Environment 38 History 34 Publishing, Printing and Bookselling 26 Popular Science 23 Life Sciences 23 Name: Primary Category, dtype: int64
😬 We also have a fairly skewed datasets. I could drop some of rows which don't occur often but since the main objective here is to see if we can use a multi-input model we'll leave the data as it is for now.
The rest of the notebook will describe some experiments with using fastai to create a model which takes tabular and text data as an input. The aim here wasn't for me to create the best model but get my head around how to combine models. I heavily relied on some existing notebooks, kaggle writeup and forum posts on the fastai forums.
In the dataset above we start of with two columns of data which can be used as inputs for the model. The title is fairly obviously something which we can treat like other text inputs. The URL is a little less obvious. It could be treated as a text input but an alternative is to treat a URL as parts which each contain some information which could be useful for our model.
http://www.specialschool.org/ http://www.bbc.co.uk/news/health-12668398 http://www.monarchit.co.uk/
Each part of the URL could be split into smaller parts
['http://www', 'darwincountry', 'org/']
Whether a url has '.org' or '.uk' or '.com' could be meaningful for predicting our categories (it might also not be meaningful). It also offers us a way of taking the URLs and composing it into a format which looks more tabular.
So far I've only done this very crudely. I suspect tidying up this part of the data will help improve things. At this point though we have something which is a little more tabular looking we can pass to
fastai.tabular learner. Now we have some 'categories' rather than unique urls.
Once some preprocessing of the url has been done we train a model using the tabular learner. I didn't do much to try to optimize this model. Tracking best
f2 score we end up with:
Better model found at epoch 36 with f_beta value: 0.17531482875347137 and an accuracy of
Next I tried training using the title field in a NLP model. I tried a few things here.
By default fastai uses SpaCy to do tokenization with a few additional special tokens added by fastai. I wanted to see if using sentencePiece would work better for processing title fields. SentencePiece allows for various sub-word tokeinzation. This can be useful for agglutinative languages but could also be useful when you have a lot of out of vocabulary words in your corpus. I wanted to see if this also was useful for processing titles since these may contain domain specific terms. I only tried using SentencePiece with 'unigram' tokenization. The best score I got for this was:
Better model found at epoch 1 with f_beta value: 0.21195338666439056.
I compared the above to using the default fastai tokenizer which uses SpaCy. In this case the default approach worked better. This is probably because we didn't have a large pre-trained model using the SentencePiece tokenization to use as a starting point. The best score I got for this model was:
Better model found at epoch 27 with f_beta value: 0.33327043056488037.
I wanted to do a quick comparison to the tabular model and use the URL as a text input instead. In this case I used SentencePiece with byte-pair-encoding (BPE). The best score in this case was:
Better model found at epoch 3 with f_beta value: 0.2568161189556122.
This might end up being a better approach compared to the tabular approach described above.
Neither of these models is doing super well but my main question was whether combining the two would improve things at all. There are different approaches to combining these models. I followed existing examples and removed some layers from the text and tabular models which are then combined in a concat model. I won't cover all the steps here but all the notebooks can be found in this GitHub repo.
One of the things we need to do to create a model with multiple input is create a new Pytorch dataset which combines our text and tabular
x inputs with our target. This is pretty straightforward:
class ConcatDataset(Dataset): def __init__(self, x1, x2, y): self.x1,self.x2,self.y = x1,x2,y def __len__(self): return len(self.y) def __getitem__(self, i): return (self.x1[i], self.x2[i]), self.y[i]
One of the other pieces was creating a
class ConcatModel(nn.Module): def __init__(self, model_tab, model_nlp, layers, drops): super().__init__() self.model_tab = model_tab self.model_nlp = model_nlp lst_layers =  activs = [nn.ReLU(inplace=True),] * (len(layers)-2) + [None] for n_in,n_out,p,actn in zip(layers[:-1], layers[1:], drops, activs): lst_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn) # https://docs.fast.ai/layers.html#bn_drop_lin self.layers = nn.Sequential(*lst_layers) def forward(self, *x): x_tab = self.model_tab(*x) x_nlp = self.model_nlp(x) x = torch.cat([x_tab, x_nlp], dim=1) return self.layers(x)
lst_layer is dependent on the layers from the tabular and nlp models. This layer is manually defined at the moment, so if changes are made to the number of layers in the tab model this needs to be manually changed.
bn_drop_lin is a fastai helper function that returns a a sequence of batch normalization, dropout and a linear layer which is the final layer of the model.
The best result I got was
f_beta value: 0.39341238141059875 with an accuracy of
0.595348. A summary of the scores for each models:
This provides some improvement on the tabular or nlp models on their own. I found the combined model was fairly tricky to train and suspect that there could be some improvements in how the model is set up that might improve it's performance. I am keen to try a similar approach with a dataset where there is more abundant information available to train with.
It wasn't possible to get a very good f2 score on this website classification dataset. As the UK web archive say:
We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future. Options include:For each site, make the titles of every page on that site available. For each site, extract a set of keywords that summarise the site, via the full-text index.
I suspect that having a either of these additional components would help improve the performance of the classifier.