%pip install polars seaborn matplotlib huggingface_hub
How Fine is FineWeb2?
Recently FineWeb2 was released, adding multilingual data to the original FineWeb dataset. As part of the FineWeb-C community initiative, contributors have been annotating samples from this dataset to assess educational quality across different languages.
The goal of this effort is to reproduce something similar to the original FineWeb-Edu dataset for all languages. You can read more about why this is important in the FineWeb-C blog post.
Analyzing the FineWeb-C Community Annotations
This post examines the annotation data collected so far through the FineWeb-C project. For most languages, community members have annotated random samples of approximately 1,000 examples from FineWeb2. The annotations focus on:
- Educational value assessment
- Language identification verification
- Problematic content flagging
Let’s explore what the community annotations tell us about the quality and characteristics of content across different languages in FineWeb2.
We’ll start by installing the necessary libraries.
import polars as pl
from huggingface_hub import list_repo_files
# bump up the number of rows in the polars table so we can see more data
100) pl.Config.set_tbl_rows(
We’ll use Polars to load the annotations but we could also load in Pandas, Dask, DuckDB, PySpark, or the Hugging Face Datasets library!
The fineweb-c dataset has a config per language but also a general config. We’ll load all the parquet files for each language, which we can get using the list_repo_files
function from the huggingface_hub
library.
= list_repo_files("data-is-better-together/fineweb-c", repo_type="dataset")
paths = [f for f in paths if f.endswith(".parquet") and "data" not in f]
paths paths
['arb_Arab/train-00000-of-00001.parquet',
'ary_Arab/train-00000-of-00001.parquet',
'arz_Arab/train-00000-of-00001.parquet',
'asm_Latn/train-00000-of-00001.parquet',
'bar_Latn/train-00000-of-00001.parquet',
'cmn_Hani/train-00000-of-00001.parquet',
'dan_Latn/train-00000-of-00001.parquet',
'fas_Arab/train-00000-of-00001.parquet',
'fil_Latn/train-00000-of-00001.parquet',
'fin_Latn/train-00000-of-00001.parquet',
'fra_Latn/train-00000-of-00001.parquet',
'gmh_Latn/train-00000-of-00001.parquet',
'gsw_Latn/train-00000-of-00001.parquet',
'hin_Deva/train-00000-of-00001.parquet',
'lvs_Latn/train-00000-of-00001.parquet',
'rus_Cyrl/train-00000-of-00001.parquet',
'slk_Latn/train-00000-of-00001.parquet',
'spa_Latn/train-00000-of-00001.parquet',
'swe_Latn/train-00000-of-00001.parquet',
'tat_Cyrl/train-00000-of-00001.parquet',
'ukr_Cyrl/train-00000-of-00001.parquet',
'vie_Latn/train-00000-of-00001.parquet',
'yue_Hani/train-00000-of-00001.parquet',
'zsm_Latn/train-00000-of-00001.parquet']
= [
keep_columns "id",
"text",
"educational_value_labels",
"annotator_ids",
"problematic_content_label_present",
"problematic_content_label_agreement",
"language_names",
"language_code",
]
= pl.scan_parquet(
df f"hf://datasets/data-is-better-together/fineweb-c/{p}" for p in paths]
[ ).select(keep_columns)
How many languages are included in the dataset so far?
"language_code")).unique().collect().count() df.select(pl.col(
language_code |
---|
u32 |
24 |
We can also see how many rows there are per language.
"language_code"]).agg(pl.col("id").len().alias("n_rows")).sort(
df.group_by(["n_rows", descending=True
).collect()
language_code | n_rows |
---|---|
str | u32 |
"tat_Cyrl" | 1557 |
"slk_Latn" | 1000 |
"gsw_Latn" | 1000 |
"vie_Latn" | 1000 |
"cmn_Hani" | 1000 |
"spa_Latn" | 1000 |
"swe_Latn" | 1000 |
"arz_Arab" | 1000 |
"bar_Latn" | 1000 |
"yue_Hani" | 1000 |
"fil_Latn" | 1000 |
"gmh_Latn" | 1000 |
"fin_Latn" | 1000 |
"rus_Cyrl" | 1000 |
"dan_Latn" | 1000 |
"zsm_Latn" | 1000 |
"asm_Latn" | 1000 |
"ary_Arab" | 1000 |
"hin_Deva" | 1000 |
"lvs_Latn" | 1000 |
"ukr_Cyrl" | 1000 |
"arb_Arab" | 1000 |
"fra_Latn" | 1000 |
"fas_Arab" | 1000 |
We could also directly use the datasets viewer to get this information! This can be a really useful way to explore the data very quickly without having to load it locally.
Number of annotators
Since starting this project we’ve found some languages have had more problematic data, including languages where a lot of data was identified in the wrong language. Because of this for some languages we’ve set a higher or lower “retirement” threshold before retiring a sample. This means some languages have only a single annotator labeling each example whilst others have had multiple people annotate each example.
Let’s look at the mean number of annotators per language.
"language_code"]).agg(
df.group_by(["educational_value_labels").list.len().mean().alias("n_annotators")
pl.col("n_annotators", descending=True).collect() ).sort(
language_code | n_annotators |
---|---|
str | f64 |
"vie_Latn" | 2.869 |
"dan_Latn" | 2.573 |
"spa_Latn" | 2.126 |
"ary_Arab" | 1.019 |
"fin_Latn" | 1.017 |
"fra_Latn" | 1.012 |
"swe_Latn" | 1.006 |
"arz_Arab" | 1.006 |
"slk_Latn" | 1.006 |
"rus_Cyrl" | 1.003 |
"lvs_Latn" | 1.003 |
"tat_Cyrl" | 1.002569 |
"asm_Latn" | 1.002 |
"arb_Arab" | 1.002 |
"fas_Arab" | 1.001 |
"gmh_Latn" | 1.0 |
"bar_Latn" | 1.0 |
"zsm_Latn" | 1.0 |
"fil_Latn" | 1.0 |
"ukr_Cyrl" | 1.0 |
"hin_Deva" | 1.0 |
"yue_Hani" | 1.0 |
"gsw_Latn" | 1.0 |
"cmn_Hani" | 1.0 |
Ideally as we progress with the project we’ll increase the overlap at least for some time to see how much annotators agree (more on this below) but we also don’t want people to spend a lot of time annotating the same sample if it’s of low quality.
Look at the distribution of labels
No we’ve got a sense of how many annotators there are per language, let’s look at the distribution of labels. Since we sometimes have multiple labels for each sample we’ll simplify things a bit to just look at the first label for each sample. For some languages this will be the only label anyway. If we work with a specific language for a longer time we’ll also want to consider the agreement between annotators.
= df.with_columns(
df "educational_value_labels").list.first().alias("label")
pl.col(
).collect()1) df.head(
id | text | educational_value_labels | annotator_ids | problematic_content_label_present | problematic_content_label_agreement | language_names | language_code | label |
---|---|---|---|---|---|---|---|---|
str | str | list[str] | list[str] | bool | f64 | str | str | str |
"78889051-5d7e-4d33-9d3f-6414ba… | "المؤتمر العالمي للغابات يجتمع … | ["None"] | ["9e779e60-6719-401a-ab47-05aafb94be65"] | false | 1.0 | "arb_Arab" | "arb_Arab" | "None" |
Click to show/hide code
= (
label_counts "language_code", "label"])
df.group_by([len()
.
.pivot(="len",
values="language_code",
index="label",
on="sum",
aggregate_function
)0)
.fill_null(
)
# Calculate row totals
= label_counts.select(pl.exclude("language_code")).sum_horizontal()
row_totals
# Calculate percentages
= (
label_percentages
label_counts.with_columns(/ row_totals * 100
pl.col(col) for col in label_counts.columns
if col != "language_code"
)"language_code", pl.all().exclude("language_code").round(2)])
.select(["language_code")
.sort(
)
label_percentages
language_code | Minimal | Good | ❗ Problematic Content ❗ | None | Excellent | Basic | ❗ Wrong language ❗ |
---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"arb_Arab" | 15.6 | 2.5 | 25.3 | 51.0 | 0.7 | 4.9 | 0.0 |
"ary_Arab" | 1.2 | 1.4 | 94.6 | 1.1 | 1.0 | 0.7 | 0.0 |
"arz_Arab" | 3.7 | 4.0 | 73.5 | 12.7 | 2.0 | 4.1 | 0.0 |
"asm_Latn" | 5.0 | 0.0 | 19.1 | 64.7 | 0.0 | 0.3 | 10.9 |
"bar_Latn" | 2.4 | 0.0 | 77.0 | 16.5 | 0.0 | 4.1 | 0.0 |
"cmn_Hani" | 36.2 | 9.7 | 2.5 | 24.2 | 1.4 | 25.1 | 0.9 |
"dan_Latn" | 28.2 | 1.2 | 14.5 | 47.6 | 0.1 | 8.4 | 0.0 |
"fas_Arab" | 18.7 | 8.9 | 20.5 | 39.6 | 2.3 | 10.0 | 0.0 |
"fil_Latn" | 27.6 | 2.6 | 37.9 | 24.1 | 0.8 | 5.1 | 1.9 |
"fin_Latn" | 28.5 | 18.2 | 8.5 | 19.7 | 2.6 | 20.6 | 1.9 |
"fra_Latn" | 23.8 | 5.7 | 4.8 | 54.3 | 2.8 | 8.6 | 0.0 |
"gmh_Latn" | 0.5 | 0.0 | 98.3 | 1.0 | 0.0 | 0.2 | 0.0 |
"gsw_Latn" | 2.5 | 3.6 | 1.3 | 29.3 | 1.1 | 2.2 | 60.0 |
"hin_Deva" | 11.2 | 0.2 | 0.3 | 87.1 | 0.3 | 0.9 | 0.0 |
"lvs_Latn" | 23.8 | 1.4 | 8.7 | 54.7 | 0.0 | 11.4 | 0.0 |
"rus_Cyrl" | 13.9 | 3.0 | 7.6 | 68.3 | 1.1 | 6.1 | 0.0 |
"slk_Latn" | 21.6 | 3.8 | 4.7 | 60.4 | 1.5 | 8.0 | 0.0 |
"spa_Latn" | 24.5 | 2.9 | 3.9 | 60.7 | 0.7 | 7.3 | 0.0 |
"swe_Latn" | 24.1 | 3.3 | 8.5 | 54.5 | 1.1 | 8.5 | 0.0 |
"tat_Cyrl" | 21.97 | 3.85 | 2.5 | 36.61 | 1.28 | 33.78 | 0.0 |
"ukr_Cyrl" | 8.6 | 7.7 | 1.4 | 73.6 | 2.3 | 6.4 | 0.0 |
"vie_Latn" | 33.9 | 6.4 | 9.5 | 35.9 | 1.2 | 13.1 | 0.0 |
"yue_Hani" | 13.8 | 5.5 | 0.9 | 68.4 | 1.8 | 9.6 | 0.0 |
"zsm_Latn" | 37.3 | 4.5 | 0.4 | 44.9 | 2.7 | 10.1 | 0.1 |
This is a bit hard to read so let’s plot it as a heatmap.
Click to show/hide code
import matplotlib.pyplot as plt
import seaborn as sns
# Convert Polars DataFrame to pandas for seaborn
= label_percentages.to_pandas()
label_percentages_pd
# Clean up column names by replacing the problematic symbols
= [
label_percentages_pd.columns "❗", "!") for col in label_percentages_pd.columns
col.replace(
]
# Define the desired column order
= [
column_order "! Problematic Content !",
"! Wrong language !",
"None",
"Basic",
"Minimal",
"Good",
"Excellent",
]
# Reorder the columns (excluding 'language_code' which will be the index)
= label_percentages_pd.set_index("language_code")[column_order]
label_percentages_pd
=(15, 10))
plt.figure(figsize
sns.heatmap(
label_percentages_pd,=True,
annot=".1f", # Keep one decimal place
fmt="YlOrRd",
cmap={"label": "Percentage (%)"},
cbar_kws=False, # Changed from True to allow rectangular cells
square
)
=45, ha="right")
plt.xticks(rotation=0)
plt.yticks(rotation
# Add more padding
"Distribution of Labels by Language (%)", pad=5)
plt.title(
plt.tight_layout() plt.show()
We can also look at the distribution of labels as a bar chart.
Click to show/hide code
import altair as alt
# Convert the wide-format data to long format for Altair
= label_percentages.unpivot(
plot_df =["language_code"],
index=label_percentages.columns[1:], # All columns except language_code
on="label",
variable_name="percentage",
value_name
)
# Create the Altair chart
= (
chart
alt.Chart(plot_df)
.mark_bar()
.encode(="language_code",
x=alt.Y("percentage", stack="zero"),
y="label",
color=["language_code", "label", "percentage"],
tooltip
)
.properties(=1000,
width=500,
height="Label Distribution by Language",
title
)=45)
.configure_axis(labelAngle
)
chart
Refining FineWeb2 for more languages aka next steps
We can see that the distribution of labels varies a fair amount between languages. Some are mostly problematic labels whilst others have a much better distribution of labels. This can also inform what is the best next step for a languages as part of the FineWeb-C project.
Better starting data for annotators aka more filtering of problematic data
FineWeb2 already has a lot of filters to try and identify high quality data, however this is challenging to do across many many languages. This is why the community is so important.
- If a language has a lot of problematic or None labels it makes sense to focus on getting higher quality data filtered for that language.
- In a previous blog post I showed some approaches to how this can be done for a single language. The best approach will depend on the language but developing better heuristics for more languages can be very impactful. This is another area where the community can help!
- Some languages have a lot of data that has been incorrectly identified as the wrong language. This is a problem for the community as it means annotators have to spend a lot of time annotating data that is not useful. For these languages if the community has better heuristics or models for identifying the wrong language we can filter out more data.
Evaluating agreement between annotators
If a language has a good distribution of labels the next step is to try and look at the agreement between annotators. This can help us understand how much we can trust the labels and also help us understand where the labels are ambiguous.
Evaluating LLMs to build training data for a classifier?
For languages where we have fairly good agreement (or at least a subset of good agreement) we can already start to evaluate how well LLMs can do at identifying educational content. Remembering back to the overall goal of this effort, is to reproduce something similar to FineWeb-Edu. This is a subset of the original FineWeb dataset which has been filtered for educational content. This filtering was done by first training a classifier on the original FineWeb-Edu dataset and then using this classifier to filter the original FineWeb dataset. That classifier was training on labels generated by Llama. For some languages it might be possible to take a similar approach. We can do something like:
- Evaluate how well an LLM does in comparison to our annotators. This might require experimenting with different LLMs and prompting approaches. I’m working on a notebook to do this but this is another area where the community can help.
- If the LLM does a good job we can then use it to create enough data to start to train a classifier.
Training educational quality classifiers
Some languages already have quite a lot of data with a fairly good distribution of labels. For these languages we can already see if we have sufficient data to train a classifier. It’s likely for most languages we’d need more data but starting to train a classifier can be a good next step. Again this is another area where the community can help!
As you may have noticed, I’ve been using the word “community” a lot in this post. This is because I think this is a really important part of the FineWeb-C project. We have a Discord channel where we can discuss the project and share ideas. If you’re interested in contributing to the project please join the channel and let’s chat!