Hygge Data - Cozy Content Filtering for a finer Scandinavian FineWeb

polars

huggingface

fineweb

datasets

Training lightweight, disposable web scale data curation models for Scandinavian language texts using the FineWeb-c dataset

Author

Daniel van Strien

Published

January 9, 2025

How and why to curate Web Scale Data

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. - FineWeb blog post

Whilst quality is important, the quantity of data we start with is significant. For example the original FineWeb-2 dataset contains 15 trillion tokens (44TB on disk).

The challenge is then: how can we curate data for LLMs in a scalable way? One possible approach is to use LLMs to help with labelling text. Whilst LLMs can do well on this kind of task (especially when using Structured Generation), scaling LLM labeling to web scale is a challenge.

Last year, FineWeb-Edu showed that filtering data for educational quality could yield improvements for LLMs trained on this filtered data. The approach they took was to first use an LLM to label a subset of data and then use a fine-tuned a much smaller BERT-based model to filter the data.

The publication of ModernBert has shown that there is still a lot of excitement around the use of smaller encoder-based models for labelling tasks, indeed one of the examples they cite in their blog post is the cost of creating FineWeb-Edu if they had used a decoder-only model.

An interesting example is FineWeb-Edu, where model-based quality filtering had to be performed over 15 trillion tokens. The FineWeb-Edu team chose to generate annotations with a decoder-only model, Llama-3-70b-Instruct, and perform the bulk of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, which, at HuggingFace Inference Endpoints’ pricing of $10/hour, comes to a total of $60,000. On the other hand, feeding 15 trillion tokens to popular decoder-only models, even with the lowest-cost option of using Google’s Gemini Flash and its low inference cost of $0.075/million tokens, would cost over one million dollars! ModernBert blog post

The ideal approach then seems to be something like:

Use an LLM to label a subset of data and use these labels to train a smaller model
Use a smaller model to make prediction on a large dataset and use these predictions to filter the data
Profit!

While this may work well for English data, many papers have shown that the performance of LLMs for these kinds of tasks can be much worse than for English.

This is where the FineWeb-c dataset comes in.

The FineWeb-C dataset

FineWeb-C is a community-driven initiative to help create high-quality datasets for training LLMs in multiple languages. Rather than relying solely on LLM-based labeling (which may perform poorly in non-English languages), the FineWeb-C project is based on the Hugging Face community submitting annotations for the educational quality of texts in different languages. The project has seen significant growth, achieving:

46,457 total annotations
Coverage of 114 different languages
Contributions from 408 annotators

You can contribute yourself using your Hugging Face account here: https://huggingface.co/spaces/data-is-better-together/fineweb-c

The project has already released numbers versions of the dataset currently covering ₁₇ 18 languages that reached the 1,000 annotations threshold (this blog post is getting out of date even as I write this thanks to the speed of the community!). This community-driven approach helps ensure that data quality assessment isn’t limited by the capabilities of existing LLMs, particularly for lower-resource languages.

In a previous blog post I discussed how you could use Polars to filter out problematic content from the FineWeb-c dataset. This relied on rules and other heuristics. Whilst this can make sense if you already have good rules. I also a look at the current annotated dataset in this post. This post showed that for some languages we may already have a sufficiently diverse dataset to train a model to label the data.

In this post we’ll see if we can already start using the FineWeb-c dataset to train a model to help curate the FineWeb-2 dataset.

Training a classifier to help curate the FineWeb-2 dataset

The long term goal of FineWeb-C is to create a dataset that can help reproduce FineWeb-Edu for many languages. This basically means we need some data for training a model(s) to label the educational quality of text data. Let’s look at the data the community has created so far to see what we can already do with the data.

First we install the necessary libraries.

%pip install polars datasets accelerate evaluate transformers torch huggingface_hub scikit-learn tensorboard wandb --upgrade

import numpy as np
import polars as pl
from scipy.special import softmax
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    average_precision_score,
    confusion_matrix,
    precision_recall_curve
)
from sklearn.model_selection import train_test_split
from transformers import (
    AutoModelForSequenceClassification, 
    AutoTokenizer,
    TrainingArguments,
    Trainer, 
    EarlyStoppingCallback
)
from datasets import Dataset, DatasetDict
from huggingface_hub import list_repo_files

Understand the data

We’ll load the dataset from the Hugging Face Hub using Polars which supports loading data from the Hub.

In this case, we’ll load the “default” config which contains all the languages that have reached the 1,000 annotations threshold.

df = pl.read_parquet("hf://datasets/data-is-better-together/fineweb-c/data/*.parquet")
df = df.lazy()

df.select("language_code").unique().collect().to_series().to_list()

['lvs_Latn',
 'arb_Arab',
 'asm_Latn',
 'swe_Latn',
 'dan_Latn',
 'vie_Latn',
 'gmh_Latn',
 'bar_Latn',
 'tat_Cyrl',
 'fas_Arab',
 'cmn_Hani',
 'slk_Latn',
 'ukr_Cyrl',
 'fin_Latn',
 'arz_Arab',
 'fra_Latn',
 'rus_Cyrl',
 'ary_Arab',
 'hin_Deva',
 'fil_Latn']

Can we train a model to work with a subset of languages?

So far we have 18 languages in the dataset. Currently over a 100 languages have some level of annotation so eventually we hope the community will create a dataset for ma ne model per language.

Claude gave me the following visual for languages groups in this dataset (which isn’t super accurate but gives a rough idea of the language families represented in the dataset).

Color-coded grid visualization of 20 languages organized by 8 language families (Indo-European, Sino-Tibetan, Austronesian, Uralic, Turkic, Austro-Asiatic, Afroasiatic, Indo-Aryan) and their subgroups, with language codes shown in monospace font — Figure 1: Color-coded grid visualization of 20 languages organized by 8 language families

We can see a few possible language groups that it could make sense to train a model on. One potential group is the Germanic languages. Let’s look at the data for these languages.

Note

I focused on Germanic languages to start with as if I squint I can get a rough understanding of the language. I would be very excited to see the community begin to explore all of the languages in the dataset.

germanic_languages = ["gmh_Latn", "dan_Latn", "swe_Latn", "bar_Latn", "lvs_Latn"]

What do we want to label?

While the overarching goal of the FineWeb-C project is to create a dataset for training models to label the educational quality of text, in order to effectively train this kind of model we need a reasonable distribution of labels.

Problematic content?

The annotation interface for FineWeb-c looks something like this (using Scots as an example).

For this example, the annotator could mostly focus on how educational the text is. However, this is the web we’re annotating so we will sometimes come across “problematic” content. This content is usually content in the wrong language i.e. the language predicted during the FineWeb-2 extraction process is incorrect (this happens quite a lot for some languages and much less for others) or the content is garbled in some way. An example of this kind of content is shown below.

We can see that the content is in the wrong language (English) the text is also somewhat garbled

Note

Why aren’t these just labeled as None i.e. no educational value? Annotating data always comes with some ambiguity. In this case, we added a problematic label to try and make it easier for the community to flag content that was incorrect in some other way. We could have added a whole separate label for this and for the different possible types of issue but this adds extra cognitive load for the annotator. We can also deal with possible overlap in usage of these labels in other ways.

Before we dive into training an educational quality classifier, let’s look at the data for the Germanic languages and see how often this problematic content is present. We’re using Polars in Lazy mode but it’s not really necessary for the size of the dataset at the moment. In the future the dataset might become large enough that this can be more important.

df_germanic = df.filter(pl.col("language_code").is_in(germanic_languages))

Let’s start by seeing the percentage of problematic content for each language in the “germanic” group.

(
    df_germanic.group_by("language_code")
    .agg(
        [
            (
                pl.col("problematic_content_label_present").sum()
                / pl.col("problematic_content_label_present").count()
                * 100
            ).alias("problematic_percentage")
        ]
    )
    .sort("problematic_percentage", descending=True)
).collect()

shape: (5, 2)

language_code	problematic_percentage
str	f64
"gmh_Latn"	98.3
"bar_Latn"	77.0
"dan_Latn"	19.4
"swe_Latn"	8.8
"lvs_Latn"	8.7

We can see that for some of the languages in this group the percentage of problematic content is very high. In particular Bavarian and Middle High German have a very high percentage of problematic content. The other languages have a much lower percentage of problematic content.

Making the lives of annotators easier and getting more educational?

While we could jump to training a model to label the educational quality of the text, we may wan to start in a more modest way. In my previous blog post I generated this plot showing the distribution of educational value for languages in the dataset.

While some languages have a fairly good distribution of educational value labels, most have very few examples of “Excellent” educational quality content. This is not surprising – the majority of the web is not educational…

For languages with very few examples of any educational content training a classifier to label the educational quality of the text is not going to work well. For these languages we probably want to first focus on being able to remove problematic content so we reduce the amount of “noise” annotators need to spend time on and then can instead focus on labelling content that is more likely to be educational.

For this we’ll train a model to label problematic content. We’ll start with the Scandinavian languages since we have Danish and Swedish datasets completed and these languages are somewhat similar (don’t come at me Danish and Swedish speaker – I’ve watched the Bridge!)

Note

This is not the only approach we could take but it could be a good starting point. Even if we don’t end up with sufficient data to train a model to label the educational quality of the text, we can still use this model to remove problematic content and improve the quality of the data we have for some languages.

Training out hygge model

Let’s start by loading the data for the Scandinavian languages.

scandinavian_languages = ["swe_Latn", "dan_Latn"]

df_scandinavian = df.filter(pl.col("language_code").is_in(scandinavian_languages))

We can see that we have 1,000 annotations for each language.

df_scandinavian.collect().shape

(2000, 8)

Now let’s get a better understanding of the data. Swedish and Danish both have multiple annotations for many of the texts. This means multiple annotators have looked at the same text and given their assessment of the educational value. Let’s take a look at some examples where the annotators disagree i.e. gave different labels.

df_scandinavian.filter(
    pl.col("educational_value_labels").list.unique().list.len() > 1
).collect()

shape: (524, 8)

id	text	educational_value_labels	annotator_ids	problematic_content_label_present	problematic_content_label_agreement	language_names	language_code
str	str	list[str]	list[str]	bool	f64	str	str
"d0347a16-14a6-40c7-a1b8-3d27b9…	"Virkelige får de vil ha sjæl o…	["❗ Problematic Content ❗", "None", "❗ Problematic Content ❗"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "85ac8d54-89c5-4473-95c4-797366f03cd0", "e9f72b47-2af5-4b06-90f2-7163de147a1d"]	true	0.666667	"dan_Latn"	"dan_Latn"
"ec7699c9-78e2-48ef-945e-9b0a71…	"Alle drømmer om den store gevi…	["None", "Minimal", "Basic"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "85ac8d54-89c5-4473-95c4-797366f03cd0", "9987848b-debb-4ed3-a97b-14eb9b3c4322"]	false	1.0	"dan_Latn"	"dan_Latn"
"49de8369-2b33-47d2-a877-2fe32b…	"Der er en elektrisk forbindels…	["Basic", "None", "None"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "b98b0144-391d-4e70-bae0-743ce94e6314", "85ac8d54-89c5-4473-95c4-797366f03cd0"]	false	1.0	"dan_Latn"	"dan_Latn"
"b20402d1-8250-410c-b177-966b8b…	"Online shopping råd - Levering…	["Basic", "Minimal"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "85ac8d54-89c5-4473-95c4-797366f03cd0"]	false	1.0	"dan_Latn"	"dan_Latn"
"dc34ee93-74e2-48ba-9b5c-9d4dfb…	"Morgenmad er det vigtigste mål…	["Basic", "Basic", "None"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "85ac8d54-89c5-4473-95c4-797366f03cd0", "4e0a264e-6445-495f-ae54-8e0755b8ebd0"]	false	1.0	"dan_Latn"	"dan_Latn"
…	…	…	…	…	…	…	…
"aee9b105-18c3-455d-b07c-545eb2…	"Dronning Victoria afskyede sin…	["Minimal", "None", "❗ Problematic Content ❗"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "9987848b-debb-4ed3-a97b-14eb9b3c4322", "85ac8d54-89c5-4473-95c4-797366f03cd0"]	true	0.333333	"dan_Latn"	"dan_Latn"
"42abb527-d3b8-4b23-b88a-d3df06…	"Smarte Opbevaringsløsninger ti…	["Minimal", "None"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "85ac8d54-89c5-4473-95c4-797366f03cd0"]	false	1.0	"dan_Latn"	"dan_Latn"
"7076624a-7b72-4534-bfdc-a4b6fc…	"Power Automate er under udbred…	["Basic", "None", "Minimal"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "740270b9-61bf-4d85-a495-9e37270f7257", "82197ecd-6d0b-400a-834a-703da28164ae"]	false	1.0	"dan_Latn"	"dan_Latn"
"a97595d9-61b4-4ae7-953c-654a85…	"Kl. 17.30 - 19.30 Home Concert…	["None", "Minimal", "None"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "9987848b-debb-4ed3-a97b-14eb9b3c4322", "85ac8d54-89c5-4473-95c4-797366f03cd0"]	false	1.0	"dan_Latn"	"dan_Latn"
"d3c4b487-6976-45bf-9a81-2c8208…	"J. S. RASCH VIN & SPIRITUS J. …	["Basic", "Minimal", "None"]	["a0585a5c-b72f-4c3a-a2a3-17e8e0b4ea4f", "82197ecd-6d0b-400a-834a-703da28164ae", "85ac8d54-89c5-4473-95c4-797366f03cd0"]	false	1.0	"dan_Latn"	"dan_Latn"

Even with a quick eyeball we can see that often “problematic” or None are used to label the same text. Similarly “minimal” or “none” are used to label the same text. This isn’t so surprising since the educational quality is fairly subjective. The main thing we probably don’t want too much of is content with very extreme labels i.e None vs Excellent. We can take a closer look at the combinations of labels that are used. Let’s take a look at the unique combinations of non agreeing labels.

Code

(
    df_scandinavian.filter(
        pl.col("educational_value_labels").list.unique().list.len() > 1
    )
    .select(
        pl.col("educational_value_labels").list.sort().alias("educational_value_labels")
    )
    .unique()
).collect()

shape: (37, 1)

educational_value_labels
list[str]
["Basic", "Good", "Minimal"]
["Good", "Minimal"]
["Basic", "Good"]
["Minimal", "Minimal", "❗ Problematic Content ❗"]
["Basic", "None", "❗ Problematic Content ❗"]
…
["Good", "None", "None"]
["None", "❗ Problematic Content ❗", "❗ Problematic Content ❗"]
["Good", "Good", "Minimal"]
["Basic", "Minimal", "❗ Problematic Content ❗"]
["Basic", "Basic", "Good"]

From a quick eyeball we don’t seem to have disagreement that is too extreme. Let’s get a better understanding of the co-occurrence of labels when annotators disagree.

Code

combinations = (
    (
        df_scandinavian.filter(
            pl.col("educational_value_labels").list.unique().list.len() > 1
        )
        .select(
            pl.col("educational_value_labels")
            .list.unique()
            .list.sort()
            .alias("educational_value_labels")
        )
        .collect()
    )
    .to_series()
    .to_list()
)
combinations[:10]

[['None', '❗ Problematic Content ❗'],
 ['Basic', 'Minimal', 'None'],
 ['Basic', 'None'],
 ['Basic', 'Minimal'],
 ['Basic', 'None'],
 ['Basic', 'None'],
 ['Basic', 'Minimal', 'None'],
 ['None', '❗ Problematic Content ❗'],
 ['None', '❗ Problematic Content ❗'],
 ['Minimal', 'None']]

We can plot this using some code Claude gave me.

Code

import pandas as pd
import numpy as np

# First, let's get all unique labels that appear
all_labels = set()
for combo in combinations:  # combinations is your list of lists
    all_labels.update(combo)
all_labels = sorted(list(all_labels))

# Create a co-occurrence matrix
cooc_matrix = pd.DataFrame(0, index=all_labels, columns=all_labels)

# Fill the matrix
for combo in combinations:
    for label1 in combo:
        for label2 in combo:
            if label1 != label2:
                cooc_matrix.loc[label1, label2] += 1

# Convert to percentage of times labels co-occur
total_occurrences = cooc_matrix.sum().sum()
cooc_matrix_pct = cooc_matrix / total_occurrences * 100

# Print most common co-occurrences
pairs = []
for i in range(len(all_labels)):
    for j in range(i + 1, len(all_labels)):
        label1, label2 = all_labels[i], all_labels[j]
        count = cooc_matrix.loc[label1, label2]
        if count > 0:
            pairs.append((label1, label2, count))

# Sort by count
pairs.sort(key=lambda x: x[2], reverse=True)

# Print top co-occurrences
print("Most common label combinations:")
for label1, label2, count in pairs[:10]:
    print(f"{label1} + {label2}: {count} occurrences")

Most common label combinations:
Minimal + None: 305 occurrences
Basic + Minimal: 96 occurrences
None + ❗ Problematic Content ❗: 84 occurrences
Basic + None: 69 occurrences
Good + Minimal: 24 occurrences
Minimal + ❗ Problematic Content ❗: 23 occurrences
Good + None: 12 occurrences
Basic + Good: 11 occurrences
Basic + ❗ Problematic Content ❗: 7 occurrences
Basic + Excellent: 5 occurrences

We see here that Minimal and None are the most common labels when annotators disagree. We also see some “problematic” labels with None. In the FineWeb-c dataset the problematic_content_label_present is a boolean column that is True if any of the annotators labeled the text as problematic. We want to check that this wouldn’t capture too many examples where another annotator would rate the text highly. If we train a classifier to remove problematic content is may also remove some examples which would be labelled None or possibly minimal but since we’re mostly seeking to get higher educational quality data this isn’t really a problem.

Preparing the data for training

Let’s remind ourselves of the percentage of problematic content for each language we’re working with.

Code

(
    df_scandinavian.group_by("language_code").agg(
        [
            (
                pl.col("problematic_content_label_present").sum()
                / pl.col("problematic_content_label_present").count()
                * 100
            ).alias("problematic_percentage")
        ]
    )
).collect()

shape: (2, 2)

language_code	problematic_percentage
str	f64
"swe_Latn"	8.8
"dan_Latn"	19.4

Let’s now convert our LazyFrame to a Polars DataFrame so it’s a bit easier to pass to other libraries.

df_scandinavian = df_scandinavian.collect()

Train / test split

Creating a good train/test split is important for making sure we train a model that generalizes well. We’ll use a stratified split to ensure that the train and test set have a similar distribution of labels. Since we’re working with two language we probably want to stratify on language too.

Code

# Create stratification column
df_scandinavian = df_scandinavian.with_columns(
    strat_col=pl.col("language_code")
    + "_"
    + pl.col("problematic_content_label_present").cast(pl.Utf8)
)

# Convert to numpy for sklearn
X = df_scandinavian.select(["id", "text"]).to_numpy()  # Including id for tracking
y = df_scandinavian.select("problematic_content_label_present").to_numpy()
strat = df_scandinavian.select("strat_col").to_numpy()

# Create stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=strat, random_state=42
)

# Convert back to Polars DataFrames with all relevant columns
train_indices = set(X_train[:, 0])  # Assuming first column is id
test_indices = set(X_test[:, 0])

train_df = df_scandinavian.filter(pl.col("id").is_in(train_indices))
test_df = df_scandinavian.filter(pl.col("id").is_in(test_indices))

Let’s take a look at the distribution of labels in the train and test set by language and label (problematic or not).

Code

print("\nTrain Set:")
# Problematic content percentage by language
print("Label distribution within each language:")
print(
    (
        train_df.group_by("language_code")
        .agg(
            [
                (
                    pl.col("problematic_content_label_present").sum()
                    / pl.col("problematic_content_label_present").count()
                    * 100
                ).alias("problematic_percentage"),
                pl.col("problematic_content_label_present")
                .count()
                .alias("total_count"),
            ]
        )
        .sort("language_code")
        .with_columns(pl.col("problematic_percentage").round(2))
    )
)

# Language distribution
print("\nLanguage distribution in train set:")
print(
    (
        train_df.group_by("language_code")
        .agg(
            (pl.len() / train_df.height * 100).alias("percentage_of_split"),
            pl.len().alias("count"),
        )
        .sort("language_code")
        .with_columns(pl.col("percentage_of_split").round(2))
    )
)

print("\nTest Set:")
# Problematic content percentage by language
print("Label distribution within each language:")
print(
    (
        test_df.group_by("language_code")
        .agg(
            [
                (
                    pl.col("problematic_content_label_present").sum()
                    / pl.col("problematic_content_label_present").count()
                    * 100
                ).alias("problematic_percentage"),
                pl.col("problematic_content_label_present")
                .count()
                .alias("total_count"),
            ]
        )
        .sort("language_code")
        .with_columns(pl.col("problematic_percentage").round(2))
    )
)

# Language distribution
print("\nLanguage distribution in test set:")
print(
    (
        test_df.group_by("language_code")
        .agg(
            (pl.len() / test_df.height * 100).alias("percentage_of_split"),
            pl.len().alias("count"),
        )
        .sort("language_code")
        .with_columns(pl.col("percentage_of_split").round(2))
    )
)


Train Set:
Label distribution within each language:
shape: (2, 3)
┌───────────────┬────────────────────────┬─────────────┐
│ language_code ┆ problematic_percentage ┆ total_count │
│ ---           ┆ ---                    ┆ ---         │
│ str           ┆ f64                    ┆ u32         │
╞═══════════════╪════════════════════════╪═════════════╡
│ dan_Latn      ┆ 19.38                  ┆ 800         │
│ swe_Latn      ┆ 8.75                   ┆ 800         │
└───────────────┴────────────────────────┴─────────────┘

Language distribution in train set:
shape: (2, 3)
┌───────────────┬─────────────────────┬───────┐
│ language_code ┆ percentage_of_split ┆ count │
│ ---           ┆ ---                 ┆ ---   │
│ str           ┆ f64                 ┆ u32   │
╞═══════════════╪═════════════════════╪═══════╡
│ dan_Latn      ┆ 50.0                ┆ 800   │
│ swe_Latn      ┆ 50.0                ┆ 800   │
└───────────────┴─────────────────────┴───────┘

Test Set:
Label distribution within each language:
shape: (2, 3)
┌───────────────┬────────────────────────┬─────────────┐
│ language_code ┆ problematic_percentage ┆ total_count │
│ ---           ┆ ---                    ┆ ---         │
│ str           ┆ f64                    ┆ u32         │
╞═══════════════╪════════════════════════╪═════════════╡
│ dan_Latn      ┆ 19.5                   ┆ 200         │
│ swe_Latn      ┆ 9.0                    ┆ 200         │
└───────────────┴────────────────────────┴─────────────┘

Language distribution in test set:
shape: (2, 3)
┌───────────────┬─────────────────────┬───────┐
│ language_code ┆ percentage_of_split ┆ count │
│ ---           ┆ ---                 ┆ ---   │
│ str           ┆ f64                 ┆ u32   │
╞═══════════════╪═════════════════════╪═══════╡
│ dan_Latn      ┆ 50.0                ┆ 200   │
│ swe_Latn      ┆ 50.0                ┆ 200   │
└───────────────┴─────────────────────┴───────┘

Loading as HuggingFace Dataset

We’ll now load the data as a HuggingFace Dataset. We’ll first convert the problematic_content_label_present column to an integer column.


train_df = train_df.with_columns(
    pl.col("problematic_content_label_present").cast(pl.Int32)
)
test_df = test_df.with_columns(
    pl.col("problematic_content_label_present").cast(pl.Int32)
)

train_ds = Dataset.from_polars(train_df)
test_ds = Dataset.from_polars(test_df)

We rename the problematic_content_label_present column to labels to match the expected column name for the Transformers Trainer.

train_ds = train_ds.rename_column("problematic_content_label_present", "labels")
test_ds = test_ds.rename_column("problematic_content_label_present", "labels")

Fine tuning a model

We’ll now fine tune a model to predict the problematic_content_label_present column. To do this we’ll want a fill-mask model which supports the languages we’re working with. We can find these models using the HuggingFace Hub using this url:

https://huggingface.co/models?pipeline_tag=fill-mask&language=da,sv&sort=trending We can try out a few options but we’ll start with the FacebookAI/xlm-roberta-base model.

Defining metrics

We’ll define a function to compute the metrics we want to use to evaluate the model. Since we’re working with an imbalanced dataset we’ll want to use a few different metrics. We’re probably going a bit overboard here but since the dataset is small it can be useful to have a few more metrics to look at to understand the model’s performance.

def compute_metrics(pred):
    """
    Compute metrics including AUC-ROC for the minority class.
    """
    # Get labels
    labels = pred.label_ids

    # Convert logits to probabilities using softmax 
    probs = softmax(pred.predictions, axis=1)
    # Get probability scores for the minority class (assuming it's label 1)
    minority_probs = probs[:, 1]

    # Get predicted class (argmax of logits)
    preds = np.argmax(pred.predictions, axis=1)

    # Calculate standard metrics
    precision = precision_score(labels, preds)
    recall = recall_score(labels, preds)
    f1 = f1_score(labels, preds)
    
    # Calculate additional metrics for imbalanced classification
    cm = confusion_matrix(labels, preds)
    tn, fp, fn, tp = cm.ravel()
    specificity = tn / (tn + fp)  # True negative rate
    balanced_acc = (recall + specificity) / 2  # Balanced accuracy
    auc_roc = roc_auc_score(labels, minority_probs)
    avg_precision = average_precision_score(labels, minority_probs)  # Area under PR curve

    return {
        "precision": precision,
        "recall": recall, 
        "f1": f1,
        "auc_roc": auc_roc,
        "balanced_accuracy": balanced_acc,
        "average_precision": avg_precision
    }

Setting up the training

I find it nice to have a mapping between the labels and the ids so later I don’t need to remember which label is which id.

possible_labels = (
    df_scandinavian.select("problematic_content_label_present")
    .unique()
    .to_series()
    .to_list()
)
possible_labels

[False, True]

label2id = {label: i for i, label in enumerate(possible_labels)}
id2label = {0: "not_problematic", 1: "problematic"}

Authenticating with HuggingFace

We’ll need to authenticate with HuggingFace to push the model to the Hub.

from huggingface_hub import login

login()

Logging with Weights & Biases

We’ll also log the training with Weights & Biases.

import wandb

wandb.login()

wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)

wandb: You can find your API key in your browser here: https://wandb.ai/authorize

wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········

wandb: Appending key for api.wandb.ai to your netrc file: /home/user/.netrc

True

The training code is not super interesting or particularly elegant. I just wanted to get something working.

Code

def train_model(
    train_ds,
    test_ds,
    hub_model_id,
    pre_trained_model_name="distilbert/distilbert-base-multilingual-cased",
    num_epochs=20,
    batch_size=128,
    label2id=None,
    id2label=None,
):
    """
    Train and evaluate the model with additional metrics for imbalanced classification.

    Args:
        train_ds: Training dataset
        test_ds: Test dataset
        hub_model_id: Model ID for pushing to HuggingFace Hub
        pre_trained_model_name: Name of pretrained model to use
        num_epochs: Number of training epochs
        batch_size: Batch size for training
        label2id: Dictionary mapping labels to IDs
        id2label: Dictionary mapping IDs to labels
    """
    tokenizer = AutoTokenizer.from_pretrained(pre_trained_model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        pre_trained_model_name,
        num_labels=2,  # Binary classification
        label2id=label2id,
        id2label=id2label,
    )

    def tokenize_function(examples):
        """
        Tokenize the text data with proper padding and truncation.
        """
        return tokenizer(
            examples["text"], padding=True, truncation=True, max_length=512
        )

    split_dataset = DatasetDict({"train": train_ds, "test": test_ds})

    # Tokenize datasets
    tokenized_train = split_dataset["train"].map(tokenize_function, batched=True)
    tokenized_val = split_dataset["test"].map(tokenize_function, batched=True)

    print(f"Tokenized train dataset: {tokenized_train}")
    print(f"Tokenized val dataset: {tokenized_val}")

    # Set up training arguments
    training_args = TrainingArguments(
        output_dir="/data/results",
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=2e-5,
        weight_decay=0.01,
        push_to_hub=True,
        eval_strategy="steps",
        eval_steps=100,
        logging_steps=100,
        save_strategy="steps",
        load_best_model_at_end=True,
        metric_for_best_model="auc_roc",  # Using AUC-ROC for model selection
        greater_is_better=True,
        save_total_limit=20,
        hub_model_id=hub_model_id,
        fp16=True,
        save_safetensors=False,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        tokenizer=tokenizer,  # Using tokenizer instead of processing_class
        train_dataset=tokenized_train,
        eval_dataset=tokenized_val,
        compute_metrics=compute_metrics,
        callbacks=[
            EarlyStoppingCallback(
                early_stopping_patience=8, early_stopping_threshold=0.001
            )
        ],
    )

    # Train the model
    trainer.train()

    # Evaluate the model
    eval_results = trainer.evaluate()

    return trainer, eval_results


def main(
    train_ds,
    test_ds,
    hub_model_id,
    pre_trained_model_name="distilbert/distilbert-base-multilingual-cased",
    num_epochs=20,
    batch_size=128,
    label2id=None,
    id2label=None,
):
    """
    Main training function that handles model training and evaluation.

    Args:
        train_ds: Training dataset
        test_ds: Test dataset
        hub_model_id: Model ID for pushing to HuggingFace Hub
        pre_trained_model_name: Name of pretrained model to use
        num_epochs: Number of training epochs
        batch_size: Batch size for training
        label2id: Dictionary mapping labels to IDs
        id2label: Dictionary mapping IDs to labels
    """
    # Train and evaluate the model
    trainer, eval_results = train_model(
        train_ds=train_ds,
        test_ds=test_ds,
        hub_model_id=hub_model_id,
        pre_trained_model_name=pre_trained_model_name,
        num_epochs=num_epochs,
        batch_size=batch_size,
        label2id=label2id,
        id2label=id2label,
    )

    # Print evaluation results with all metrics
    print("\nEvaluation Results:")
    print(f"F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"Precision: {eval_results['eval_precision']:.4f}")
    print(f"Recall: {eval_results['eval_recall']:.4f}")
    print(f"AUC-ROC (minority class): {eval_results['eval_auc_roc']:.4f}")
    print(
        f"Average Precision (minority class): {eval_results['eval_average_precision']:.4f}"
    )
    print(f"Balanced Accuracy: {eval_results['eval_balanced_accuracy']:.4f}")

    return trainer, eval_results

trainer, results = main(
    train_ds,
    test_ds,
    hub_model_id="davanstrien/scandi-fine-web-cleaner",
    pre_trained_model_name="FacebookAI/xlm-roberta-base",
    num_epochs=30,
    batch_size=16,
    label2id=label2id,
    id2label=id2label,
)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Tokenized train dataset: Dataset({
    features: ['id', 'text', 'educational_value_labels', 'annotator_ids', 'labels', 'problematic_content_label_agreement', 'language_names', 'language_code', 'strat_col', 'input_ids', 'attention_mask'],
    num_rows: 1600
})
Tokenized val dataset: Dataset({
    features: ['id', 'text', 'educational_value_labels', 'annotator_ids', 'labels', 'problematic_content_label_agreement', 'language_names', 'language_code', 'strat_col', 'input_ids', 'attention_mask'],
    num_rows: 400
})

/home/user/miniconda/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/tmp/ipykernel_77/3658588944.py:156: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(

[1000/3000 02:00 < 04:01, 8.29 it/s, Epoch 10/30]

Step	Training Loss	Validation Loss	Precision	Recall	F1	Specificity	Npv	Auc Roc	Average Precision	True Positives	False Positives	True Negatives	False Negatives	Minority Class Ratio	Predicted Minority Ratio	Best F1 Threshold	Best F1 Score	Default Precision	Default Recall	Default F1	Balanced Accuracy
100	0.282200	0.233291	0.851064	0.701754	0.769231	0.979592	0.951841	0.912562	0.808400	40	7	336	17	0.142500	0.117500	0.710879	0.769231	0.800000	0.701754	0.747664	0.840673
200	0.176000	0.255909	0.941176	0.561404	0.703297	0.994169	0.931694	0.932229	0.821460	32	2	341	25	0.142500	0.085000	0.060919	0.766355	0.942857	0.578947	0.717391	0.777786
300	0.155400	0.284004	0.948718	0.649123	0.770833	0.994169	0.944598	0.904225	0.812967	37	2	341	20	0.142500	0.097500	0.792384	0.778947	0.948718	0.649123	0.770833	0.821646
400	0.141300	0.293026	0.816327	0.701754	0.754717	0.973761	0.951567	0.930004	0.826913	40	9	334	17	0.142500	0.122500	0.938463	0.778947	0.816327	0.701754	0.754717	0.837758
500	0.113500	0.284364	0.972222	0.614035	0.752688	0.997085	0.939560	0.927190	0.824227	35	1	342	22	0.142500	0.090000	0.013480	0.784314	0.945946	0.614035	0.744681	0.805560
600	0.088200	0.365192	0.880952	0.649123	0.747475	0.985423	0.944134	0.923252	0.818148	37	5	338	20	0.142500	0.105000	0.989013	0.760870	0.880952	0.649123	0.747475	0.817273
700	0.078500	0.360336	0.972222	0.614035	0.752688	0.997085	0.939560	0.923047	0.819323	35	1	342	22	0.142500	0.090000	0.196286	0.770833	0.945946	0.614035	0.744681	0.805560
800	0.030700	0.381501	0.759259	0.719298	0.738739	0.962099	0.953757	0.918853	0.826950	41	13	330	16	0.142500	0.135000	0.998579	0.787234	0.745455	0.719298	0.732143	0.840699
900	0.029300	0.486526	0.971429	0.596491	0.739130	0.997085	0.936986	0.912894	0.812492	34	1	342	23	0.142500	0.087500	0.006314	0.770833	0.972222	0.614035	0.752688	0.796788
1000	0.015900	0.468468	1.000000	0.649123	0.787234	1.000000	0.944904	0.909672	0.804402	37	0	343	20	0.142500	0.092500	0.644980	0.787234	1.000000	0.649123	0.787234	0.824561

Could not locate the best model at /data/results/checkpoint-200/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.


Evaluation Results:
F1 Score: 0.7872
Precision: 1.0000
Recall: 0.6491
AUC-ROC (minority class): 0.9097
Average Precision (minority class): 0.8044

!pip install matplotlib

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Requirement already satisfied: matplotlib in /home/user/miniconda/lib/python3.11/site-packages (3.10.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: numpy>=1.23 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (2.2.1)
Requirement already satisfied: packaging>=20.0 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /home/user/miniconda/lib/python3.11/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /home/user/miniconda/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)

def analyze_thresholds(trainer, min_precision=0.9, min_threshold=0.5, fig_size=(15, 5)):
    """
    Analyze model performance across different thresholds using the evaluation dataset.
    Finds the lowest threshold that maintains the minimum precision requirement while
    staying above a minimum threshold floor.

    Args:
        trainer: HuggingFace Trainer instance
        min_precision: Minimum precision requirement (default: 0.9)
        min_threshold: Minimum allowed threshold for binary classification (default: 0.5)
        fig_size: Figure size for plots (default: (15, 5))

    Returns:
        dict: Dictionary containing optimal threshold metrics and probability statistics
    """
    import numpy as np
    from scipy.special import softmax
    from sklearn.metrics import (
        precision_score,
        recall_score,
        f1_score,
        precision_recall_curve,
    )
    import matplotlib.pyplot as plt

    def calculate_metrics_at_threshold(probs, true_labels, threshold):
        """Helper function to calculate metrics at a given threshold"""
        preds = (probs >= threshold).astype(int)
        prec = precision_score(true_labels, preds, zero_division=0)
        rec = recall_score(true_labels, preds, zero_division=0)
        f1 = 2 * (prec * rec) / (prec + rec) if (prec + rec) > 0 else 0
        return prec, rec, f1

    # Get predictions
    predictions = trainer.predict(trainer.eval_dataset)
    probs = softmax(predictions.predictions, axis=1)
    minority_probs = probs[:, 1]  # Probabilities for positive class
    true_labels = predictions.label_ids

    # Calculate precision-recall curve
    precisions, recalls, thresholds = precision_recall_curve(
        true_labels, minority_probs
    )

    # Find optimal threshold meeting both minimum precision and threshold requirements
    valid_indices = np.where(
        (precisions[:-1] >= min_precision) & (thresholds >= min_threshold)
    )[0]

    if len(valid_indices) > 0:
        # Take lowest threshold that meets both criteria
        optimal_idx = valid_indices[0]
        optimal_threshold = thresholds[optimal_idx]
        optimal_precision = precisions[optimal_idx]
        optimal_recall = recalls[optimal_idx]
    else:
        # If no threshold meets both criteria, find best precision among valid thresholds
        valid_thresholds_idx = np.where(thresholds >= min_threshold)[0]
        if len(valid_thresholds_idx) > 0:
            optimal_idx = valid_thresholds_idx[
                np.argmax(precisions[valid_thresholds_idx])
            ]
            optimal_threshold = thresholds[optimal_idx]
            optimal_precision = precisions[optimal_idx]
            optimal_recall = recalls[optimal_idx]
        else:
            # Fallback to minimum threshold if no valid thresholds found
            optimal_threshold = min_threshold
            optimal_preds = (minority_probs >= min_threshold).astype(int)
            optimal_precision = precision_score(
                true_labels, optimal_preds, zero_division=0
            )
            optimal_recall = recall_score(true_labels, optimal_preds, zero_division=0)

    # Create plots
    plt.figure(figsize=fig_size)

    # Plot 1: Precision-Recall curve
    plt.subplot(1, 2, 1)
    plt.plot(recalls, precisions, label="Precision-Recall Curve")
    plt.scatter(
        [optimal_recall],
        [optimal_precision],
        color="red",
        label=f"Threshold={optimal_threshold:.2f}\nPrecision={optimal_precision:.2f}\nRecall={optimal_recall:.2f}",
    )
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve")
    plt.grid(True)
    plt.legend()

    # Plot 2: Metrics vs Threshold
    max_prob = np.max(minority_probs)
    min_prob = np.min(minority_probs)

    # Create threshold range with denser sampling near optimal point
    margin = 0.1
    threshold_range = np.unique(
        np.concatenate(
            [
                np.linspace(min_threshold, optimal_threshold - margin, 40),
                np.linspace(optimal_threshold - margin, optimal_threshold + margin, 20),
                np.linspace(optimal_threshold + margin, max_prob, 40),
            ]
        )
    )
    threshold_range = np.clip(threshold_range, min_threshold, max_prob)

    # Calculate metrics for each threshold
    metrics = [
        calculate_metrics_at_threshold(minority_probs, true_labels, t)
        for t in threshold_range
    ]
    precisions_plot, recalls_plot, f1_scores = zip(*metrics)

    plt.subplot(1, 2, 2)
    plt.plot(threshold_range, precisions_plot, label="Precision")
    plt.plot(threshold_range, recalls_plot, label="Recall")
    plt.plot(threshold_range, f1_scores, label="F1", linestyle="--")
    plt.axvline(
        x=optimal_threshold,
        color="red",
        linestyle="--",
        label=f"Optimal Threshold={optimal_threshold:.2f}",
    )
    plt.axvline(
        x=min_threshold,
        color="gray",
        linestyle=":",
        label=f"Min Threshold={min_threshold:.2f}",
    )
    plt.xlabel("Threshold")
    plt.ylabel("Score")
    plt.title("Metrics vs Threshold")
    plt.grid(True)
    plt.legend()

    plt.tight_layout()
    plt.show()

    # Calculate final metrics and probability statistics
    optimal_preds = (minority_probs >= optimal_threshold).astype(int)
    f1 = f1_score(true_labels, optimal_preds)
    mean_prob = np.mean(minority_probs)

    print(f"\nProbability Distribution:")
    print(f"Min probability: {min_prob:.3f}")
    print(f"Max probability: {max_prob:.3f}")
    print(f"Mean probability: {mean_prob:.3f}")

    return {
        "optimal_threshold": optimal_threshold,
        "optimal_precision": optimal_precision,
        "optimal_recall": optimal_recall,
        "optimal_f1": f1,
        "min_prob": min_prob,
        "max_prob": max_prob,
        "mean_prob": mean_prob,
    }


# Example usage
results = analyze_thresholds(
    trainer,
    min_precision=0.9,
    min_threshold=0.5,  # Enforce minimum threshold of 0.5
)


Probability Distribution:
Min probability: 0.000
Max probability: 1.000
Mean probability: 0.091

# Get label distribution
from collections import Counter

label_counts = Counter(trainer.eval_dataset["labels"])
print(label_counts)  # Should show more 0s than 1s if 1 is minority class

Counter({0: 343, 1: 57})

Push the model to the hub

{False: 0, True: 1}

trainer.evaluate()

{'eval_loss': 0.28473082184791565,
 'eval_f1': 0.7578947368421053,
 'eval_precision': 0.9473684210526315,
 'eval_recall': 0.631578947368421,
 'eval_runtime': 0.6817,
 'eval_samples_per_second': 586.729,
 'eval_steps_per_second': 36.671,
 'epoch': 13.0}

trainer.push_to_hub(dataset=["data-is-better-together/fineweb-c"])

CommitInfo(commit_url='https://huggingface.co/davanstrien/scandi-fine-web-cleaner/commit/51487ef5c06440aa26c260621fdacaf5645cbfdd', commit_message='End of training', commit_description='', oid='51487ef5c06440aa26c260621fdacaf5645cbfdd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/davanstrien/scandi-fine-web-cleaner', endpoint='https://huggingface.co', repo_type='model', repo_id='davanstrien/scandi-fine-web-cleaner'), pr_revision=None, pr_num=None)

Using the model to filter FineWeb-2

paths = list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
paths[:10]

['.gitattributes',
 'README.md',
 'data/aai_Latn/test/000_00000.parquet',
 'data/aai_Latn/train/000_00000.parquet',
 'data/aai_Latn_removed/train/000_00000.parquet',
 'data/aak_Latn/test/000_00000.parquet',
 'data/aak_Latn/train/000_00000.parquet',
 'data/aak_Latn_removed/train/000_00000.parquet',
 'data/aau_Latn/test/000_00000.parquet',
 'data/aau_Latn/train/000_00000.parquet']

danish = [
    f for f in paths if ("dan" in f and f.endswith("parquet") and "removed" not in f)
]
swedish = [
    f for f in paths if ("swe" in f and f.endswith("parquet") and "removed" not in f)
]

danish_lf = pl.scan_parquet(
    [f"hf://datasets/HuggingFaceFW/fineweb-2/{f}" for f in danish]
)

danish_df = danish_lf.head(10_000).collect()
danish_df

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="davanstrien/scandi-fine-web-cleaner",
    truncation=True,  # Enable truncation
    max_length=512,  # Set maximum length
    batch_size=32,
)

Device set to use cuda:0

texts = danish_df.select("text").to_series().to_list()

pipe(texts[0])

len(texts)

Let’s see how long it takes to predict on 10000 texts. While I used an A100 Hugging Face Jupyter Notebook Space for the model training, I’m using my 2021 MacBook Pro M1 for this part.

%%time
predictions = pipe(texts)

CPU times: user 26.6 s, sys: 16.1 s, total: 42.7 s
Wall time: 4min 22s

predictions[0]

{'label': 'LABEL_0', 'score': 0.9997074007987976}

df_results = pl.DataFrame(predictions).rename(
    {
        "label": "problematic_content_label_present",
        "score": "problematic_content_label_present_score",
    }
)
df_results

shape: (10_000, 2)

problematic_content_label_present	problematic_content_label_present_score
str	f64
"LABEL_0"	0.999707
"LABEL_0"	0.99975
"LABEL_0"	0.999737
"LABEL_0"	0.999724
"LABEL_0"	0.999745
…	…
"LABEL_0"	0.999757
"LABEL_0"	0.999758
"LABEL_0"	0.999744
"LABEL_0"	0.991189
"LABEL_0"	0.999684

df_with_labels = pl.concat([danish_df, df_results], how="horizontal")
df_with_labels.head(2)

shape: (2, 13)

text	id	dump	url	date	file_path	language	language_score	language_script	minhash_cluster_size	top_langs	problematic_content_label_present	problematic_content_label_present_score
str	str	str	str	str	str	str	f64	str	i64	str	str	f64
"Tema: Ankomster “Hele tiden åd…	"<urn:uuid:0796b04c-c1bf-418b-b…	"CC-MAIN-2014-42"	"http://www.copenhagen.dk/dk/de…	"2014-10-30T18:10:47Z"	"s3://commoncrawl/crawl-data/CC…	"dan"	0.999933	"Latn"	26	"{"dan_Latn_score": 0.999932765…	"LABEL_0"	0.999707
"Hiddensees mangfoldige skønhed…	"<urn:uuid:5f7751e9-981d-4cfe-9…	"CC-MAIN-2016-07"	"http://www.germany.travel/dk/f…	"2016-02-07T03:49:50Z"	"s3://commoncrawl/crawl-data/CC…	"dan"	0.999974	"Latn"	116	"{"dan_Latn_score": 0.999974370…	"LABEL_0"	0.99975

df_with_labels.select("problematic_content_label_present").to_series().value_counts(
    normalize=True
)

shape: (2, 2)

problematic_content_label_present	proportion
str	f64
"LABEL_1"	0.0698
"LABEL_0"	0.9302

Taking a look at the problematic texts, even with my imperfect Danish I can see why these have been labelled as problematic.

from rich import print as rprint

rprint(
    [
        text[:1000]
        for text in df_with_labels.filter(
            pl.col("problematic_content_label_present") == "LABEL_1"
        )
        .head(2)
        .select("text")
        .to_series()
        .to_list()
    ]
)

[
'Layered haircuts ser altid elegant, ikke bare på lange hår, men også på kort håret. These haircuts ser godt ud
på kvinder og på mænd. Zac Efron og Keith Urban mænds frisurer kan blive henvist til klippe hår i lag for mænd.
These frisurer kan lette at vedligeholde og stylet på forskellige måder. Du kan også prøve forskellige hårfarve
ideer om lagdelte haircuts til at give et unikt look. Du kan nyde det med side feje pandehår, eller en stump
frynser. Her er en guide til hvordan du klippe håret i lag, som vil hjælpe dig med at klippe hår i lag
derhjemme.\nHvordan man kan skære Hår i Layers derhjemme\nTing du behøver, Her er en liste over almindelige ting,
som du kan finde derhjemme selv:; Et godt saks, en kam, hår børste, føntørrer, to spejle, A dyse vand flaske, for
at opretholde fugtigt hår; mange hår skruetvinger, forberede dit hår; Vask dit hår rene med en shampoo, og
håndklæde tørre dem. Må ikke helt tør dem, holde dit hår fugtige, da det vil blive lettere at skære dem. Hvis du
har tør',
'Homo bordel herrer body to body massage sjællandFitness world forum åbningstider liderlige kællinger Posted by
fitness world forum åbningstider liderlige kællinger on Fodmassage frederiksberg thai massage i hjørring Posted by
fodmassage frederiksberg thai massage i hjørring on Sex Film ålerne Jeg vil være diskret i Italien. Silkeborg
karina hot and sexy Massage og Escort: Islington, London Thai traditional massage. Asian escort copenhagen dansk
gay porn Posted by asian escort copenhagen dansk gay porn on\nJeg praktiserer den traditionelle thailandske massage
i min klinik i Hjørring. En såkaldt slikkelap er en ny form for prævention til kvinder, der beskytter mod
kønssygdomme ved oralsex. Vi boede på landet og havde en del fjerkræ, som vi slagtede, gjorde i stand og spiste
eller vi solgte dem til venner og bekendte. Thai-massage inkluderer ofte happy ending, og nogle steder er der
mulighed for. Top thai massage vejle kiss porn - sammenlignede med Skriv en mail til ungtlahme gmail. Massag'
]

Since we have confidence scores, we can see how confident the model is in its predictions and potentially only use the predictions with a confidence score above a certain threshold.

df_with_labels.select("problematic_content_label_present_score").describe()

shape: (9, 2)

statistic	problematic_content_label_present_score
str	f64
"count"	10000.0
"null_count"	0.0
"mean"	0.996547
"std"	0.027983
"min"	0.503509
"25%"	0.999564
"50%"	0.999691
"75%"	0.999733
"max"	0.999781

df_with_labels.filter(pl.col("problematic_content_label_present_score") < 0.9).shape

(98, 13)

df_with_labels.filter(pl.col("problematic_content_label_present_score") < 0.8).shape

(54, 13)

Conclusion: Data curation using semi disposable models

Since the goal for this kind of model is mostly to do some initial cleaning we don’t have to be too perfect. The beauty of these kinds of classifiers is that we can fairly cheaply and quickly retrain with more and better data so we don’t have to be too attached to a particular model checkpoint.

Appendix: Running on the full fineweb-2 dataset for Danish

Because this is network bound I did this part on an A100 on HF which has a very fast connnection

danish_lf = pl.scan_parquet(
    [f"hf://datasets/HuggingFaceFW/fineweb-2/{f}" for f in danish]
)

danish_lf.head(1).collect()

shape: (1, 11)

text	id	dump	url	date	file_path	language	language_score	language_script	minhash_cluster_size	top_langs
str	str	str	str	str	str	str	f64	str	i64	str
"Tema: Ankomster “Hele tiden åd…	"<urn:uuid:0796b04c-c1bf-418b-b…	"CC-MAIN-2014-42"	"http://www.copenhagen.dk/dk/de…	"2014-10-30T18:10:47Z"	"s3://commoncrawl/crawl-data/CC…	"dan"	0.999933	"Latn"	26	"{"dan_Latn_score": 0.999932765…

%%time
danish_lf.select("language_score").describe()

We don’t need all of the column for doing inference so lets grab just the text and id.

danish_df_for_prediction = danish_lf.select(["id", "text"])

danish_df_for_prediction.sink_parquet("dan.parquet")

df_pred = pl.scan_parquet("dan.parquet")

df_pred.select(pl.len()).collect()

shape: (1, 1)

len
u32
43002078

from datasets import Dataset

ds = Dataset.from_parquet("dan.parquet")

ds

Dataset({
    features: ['id', 'text'],
    num_rows: 43002078
})

sample = ds.shuffle().take(10_000)

sample

Dataset({
    features: ['id', 'text'],
    num_rows: 10000
})

from tqdm.auto import tqdm
from transformers.pipelines.pt_utils import KeyDataset

results = []
for out in tqdm(pipe(KeyDataset(sample, "text")), total=len(sample)):
    results.append(out)

results[0]

{'label': 'not_problematic', 'score': 0.9998955726623535}

labels = [x["label"] for x in results]
scores = [x["score"] for x in results]

labels[:3], scores[:3]

(['not_problematic', 'not_problematic', 'not_problematic'],
 [0.9998955726623535, 0.9998942613601685, 0.9998948574066162])

sample = sample.add_column("problematic_label", labels)

sample = sample.add_column("problematic_label_score", scores)

sample[0]["problematic_label"]

'not_problematic'

clean_ds = sample.filter(lambda x: x["problematic_label"] == "not_problematic")

clean_ds.push_to_hub("davanstrien/fineweb2-danish-cleaned")

CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/fineweb2-danish-cleaned/commit/f0f86b883ee6ff82f91aed4d55ac1026e70dd473', commit_message='Upload dataset', commit_description='', oid='f0f86b883ee6ff82f91aed4d55ac1026e70dd473', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davanstrien/fineweb2-danish-cleaned', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davanstrien/fineweb2-danish-cleaned'), pr_revision=None, pr_num=None)