Filtering FineWeb2 using Polars

polars
huggingface
Using Polars to filter the FineWeb2 dataset and other large Hugging Face datasets
Author

Daniel van Strien

Published

December 30, 2024

Recently FineWeb2 was released. FineWeb2 builds on the previous FineWeb dataset to add data for many languages. Building on this work we recently launched a community effort to build educational quality filters for many languages. See this blog post for more details.

Filtering the FineWeb2 dataset to improve educational quality and or filtering for a language

One of the goals of the FineWeb-c project is to build educational quality filters for many languages. To do this the community has been annotating the data with educational quality scores. So far, the majority of the datasets for each language consist of a random sample of 1,000 examples from FineWeb2 for that lanugage. However, for some languages, the community has found that: - the language identification is not always correct - the educational quality is very low in the sample

For these languages, we want to enable the community to create extra filters to either help with the language identification or to filter for educational quality. This blog post shows some ways in which you can use Polars to filter the FineWeb2 dataset to improve educational quality and or filtering for a language.

First we’ll install the necessary libraries. We’ll use polars for the data manipulation and huggingface_hub to interact with the Hugging Face Hub. The dask library is another good option for working with large datasets.

# %pip install polars huggingface_hub tld rich tqdm --upgrade
from huggingface_hub import list_repo_files, hf_hub_download
import polars as pl
from tld import get_tld
from pathlib import Path
from tqdm.auto import tqdm
import os
# increase amount of data polars shows
pl.Config.set_tbl_rows(100)
polars.config.Config

Many large datasets on the Hub will be organised into different configurations. These configurations are often named after the language they are in. For example, the FineWeb dataset is organised into different languages. Many large datasets will either have structured folders or the names of files can be used to filter the dataset. Let’s look at the FineWeb dataset. We can use the wonderful huggingface_hub library to list the files in a repository.

paths = list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
paths[:10]
['.gitattributes',
 'README.md',
 'data/aai_Latn/test/000_00000.parquet',
 'data/aai_Latn/train/000_00000.parquet',
 'data/aai_Latn_removed/train/000_00000.parquet',
 'data/aak_Latn/test/000_00000.parquet',
 'data/aak_Latn/train/000_00000.parquet',
 'data/aak_Latn_removed/train/000_00000.parquet',
 'data/aau_Latn/test/000_00000.parquet',
 'data/aau_Latn/train/000_00000.parquet']

You can see we have a hidden .gitattributes file and a README.md + a data directory containing parquet files organised into different subdirectories. Since these names are very clear we create a very simple filter to get the scots language files we’re interested in. We’ll look for scots in the file name and make sure it ends with .parquet and doesn’t have removed in the file name since these are files that were removed in the FineWeb2 filtering process.

scots = [
    f for f in paths if ("sco" in f and f.endswith("parquet") and "removed" not in f)
]
scots
['data/sco_Latn/test/000_00000.parquet',
 'data/sco_Latn/train/000_00000.parquet']

Loading the data in Polars

We can directly load the data from the Hugging Face Hub using the hf:// protocol. In this case we’ll just load the train file for the scots language. We’ll use read_parquet to load the data for now but we’ll see below a better way to load the data if you are working with large datasets.

df = pl.read_parquet(f"hf://datasets/HuggingFaceFW/fineweb-2/{scots[-1]}")

Let’s take a look at the data. We can see we have a number of columns including the actual text but also some other metadata fields that could be useful for filtering.

df.head(5)
shape: (5, 11)
text id dump url date file_path language language_score language_script minhash_cluster_size top_langs
str str str str str str str f64 str i64 str
"2010 All Ford Mustangs Car Sho… "<urn:uuid:06f10aff-f1da-4d33-b… "CC-MAIN-2013-20" "http://www.allfordmustangs.com… "2013-05-23T16:34:05Z" "s3://commoncrawl/crawl-data/CC… "sco" 0.764794 "Latn" 1258 "{"sco_Latn_score": 0.764793634…
"Interested in France? We'll se… "<urn:uuid:abc6bfe8-7af5-40b9-9… "CC-MAIN-2013-20" "http://www.tripadvisor.com/All… "2013-05-23T16:36:10Z" "s3://commoncrawl/crawl-data/CC… "sco" 0.651096 "Latn" 12 "{"sco_Latn_score": 0.651095628…
"Sherlock Holmes Sherlock Holme… "<urn:uuid:11ceff04-f5f5-418c-8… "CC-MAIN-2014-10" "http://sco.wikipedia.org/wiki/… "2014-03-08T05:12:30Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.000008 "Latn" 58 "{"sco_Latn_score": 1.000008225…
"Munster History[eedit | eedit … "<urn:uuid:5fd5fa85-72b1-43d3-b… "CC-MAIN-2014-15" "http://sco.wikipedia.org/wiki/… "2014-04-19T09:31:48Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 79 "{"sco_Latn_score": 1.000009536…
"Snawbuirdin Frae Wikipedia Sna… "<urn:uuid:72c97fcb-4820-4a52-b… "CC-MAIN-2014-15" "http://sco.wikipedia.org/wiki/… "2014-04-19T09:31:00Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 66 "{"sco_Latn_score": 1.000010013…

We can do some simple EDA style analysis if we want. For example, we can look at the distribution of the language scores.

df.select(pl.col("language_score")).describe()
shape: (9, 2)
statistic language_score
str f64
"count" 75821.0
"null_count" 0.0
"mean" 0.537262
"std" 0.214123
"min" 0.300002
"25%" 0.371339
"50%" 0.465798
"75%" 0.634602
"max" 1.00001

Do a groupby year of dump and language score and plot a bar chart to see if there is a trend.

df.with_columns(
    pl.col("dump").str.extract(r"(\d{4})").cast(pl.Utf8).alias("year")
).group_by("year").agg(pl.col("language_score").mean()).sort(
    "year", descending=True
).plot.bar(x="year", y="language_score")

Heuristics for filtering for higher educational quality in FineWeb2

Whilst the authors of FineWeb2 aimed to do general quality filtering, there are often additional heuristics that can be used to filter for higher educational quality. For example, we can use the tld to filter for higher quality websites. We can also use the url to filter for higher quality websites. Many of these heuristics will require some domain knowledge for a particular language and the web ecosystem for tha language.

The top level domain (tld) is a good heuristic for filtering for higher quality websites. The top level domain is the part of the url that is after the last dot. For example, the tld of https://www.wikipedia.org/ is org. This often corresponds to a country or organization. For example, ac.uk is the UK’s higher education domain. We can use this to filter for higher quality websites.

We can do this by mapping the url column to the tld and then filtering for the tlds we’re interested in. Let’s add a new column with the tld and then filter for the tlds we’re interested in.

df = df.with_columns(
    pl.col("url").map_elements(lambda x: get_tld(x), return_dtype=pl.Utf8).alias("tld")
)
import altair as alt

df.select("tld").to_series().value_counts(sort=True).sort(
    "count", descending=True
).head(20).plot.bar(
    x=alt.X("tld", sort="-y"),  # Sort x-axis based on y values in descending order
    y="count",
)

We may already have some knowledge or intuitions about the tlds that are more likely to be higher quality. For example .us is relatively high, this is likely partially due this domain being more present in the Web generally. We may also see some personal blogs using this domain. Let’s take a look at a few examples.

df.filter(pl.col("tld").str.contains("us")).sort(
    "language_score", descending=True
).select("url").to_series().to_list()[:30]
['https://coremc.us/forno-microonde-incasso.html',
 'https://www.awesomedownloadfilestoday.us/1141-haircut-places-near-my-location.html',
 'https://www.awesomedownloadfilestoday.us/1376-haircut-near-my-location.html',
 'https://www.awesomedownloadfilestoday.us/1857-short-haircuts-for-fine-straight-hair.html',
 'https://www.awesomedownloadfilestoday.us/2081-twa-styles-4c-hair.html',
 'http://winserver.us/mid-century-modern-front-door-colors/mid-century-modern-front-door-colors-mid-century-modern-front-doors-door-colors-handles-mi-mid-century-modern-front-door-colours/',
 'https://www.awesomedownloadfilestoday.us/3450-hair-styles-for-thick-short-hair.html',
 'https://notwttodaytes.us/casa-mezcal-mexican-grill-cantina.html',
 'https://www.awesomedownloadfilestoday.us/1857-short-haircuts-for-fine-straight-hair.html',
 'https://www.awesomedownloadfilestoday.us/1737-short-haircuts-for-curly-thick-hair.html',
 'http://uggbootsclearanceoutlet.us/jaguar-xj-sport-2003-2003-jaguar-xj-car-for-sale-in.html',
 'https://www.awesomedownloadfilestoday.us/5489-hair-colour-and-styles-for-short-hair.html',
 'https://www.awesomedownloadfilestoday.us/6213-short-haircuts-for-thick-hair-pictures.html',
 'https://www.awesomedownloadfilestoday.us/4993-short-haircuts-for-thick-curly-frizzy-hair.html',
 'https://www.awesomedownloadfilestoday.us/1773-short-hair-styles-for-thick-wavy-hair.html',
 'https://www.awesomedownloadfilestoday.us/6129-short-haircuts-for-very-fine-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/1808-haircuts-for-straight-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/4478-best-short-haircuts-for-thick-coarse-hair.html',
 'https://www.awesomedownloadfilestoday.us/9513-short-haircuts-for-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/3414-short-sassy-haircuts-for-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/4630-very-short-haircuts-for-thin-hair.html',
 'https://joesstuff.us/search.php?mfr=141',
 'https://www.awesomedownloadfilestoday.us/3738-short-hair-styles-for-thick-curly-hair.html',
 'https://layartancep.us/twa-hairstyles.html',
 'https://www.awesomedownloadfilestoday.us/1101-best-haircuts-for-fine-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/4963-short-haircut-for-fine-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/4963-short-haircut-for-fine-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/5010-haircuts-for-coarse-wavy-hair.html',
 'https://www.awesomedownloadfilestoday.us/5201-best-short-haircuts-for-fine-thin-hair.html',
 'https://www.awesomedownloadfilestoday.us/1911-haircut-for-round-face-and-thin-hair.html']

These don’t look super promising! Some domains where we might expect higher quality text for scots are the .sco domain which is a domain for websites relating to Scotland.

df.filter(pl.col("tld").str.contains("sco")).sort(
    "language_score", descending=True
).select("url").to_series().to_list()[:30]
['https://stormplay.scot/sco/aboot.html',
 'https://www.makforrit.scot/2020/08/29/anent-the-scots-wikipedia-an-sundays-editathon/',
 'https://www.makforrit.scot/2018/12/23/daein-it-yersel/',
 'https://www.makforrit.scot/2019/09/22/uisin-oor-vyce-hou-we-can-gar-political-action-on-scots-inevitable/',
 'https://www.makforrit.scot/2018/02/03/naewey-tae-bide/',
 'https://www.makforrit.scot/',
 'https://www.makforrit.scot/2018/01/27/than-an-nou-poverty-makkin-dae-an-leukin-out-for-ilk-ither/',
 'https://www.makforrit.scot/',
 'https://salvo.scot/the-scottis-constitutional-covin/',
 'https://amylord.scot/gd/hello-welcome/',
 'https://www.makforrit.scot/category/scotland/',
 'https://projects.handsupfortrad.scot/scotslanguageawards/gies-a-scots-phrase-day-2021/',
 'https://scoblog.stormplay.scot/t3ngist-is-gaunae-need-tae-be-delayed.html',
 'https://www.makforrit.scot/2019/10/29/halloween/',
 'https://www.makforrit.scot/category/history/',
 'https://www.makforrit.scot/2018/11/19/three-days-in-october/',
 'http://mindyerlanguage.scot/teachin',
 'http://mindyerlanguage.scot/category/video',
 'https://scoblog.stormplay.scot/rossies-3d-an-t3ngist-are-gaunae-be-delayed.html',
 'https://www.gov.scot/publications/consultation-scots-government-commitments-tae-gaelic-scots-scots-langages-bill/pages/2/',
 'https://projects.handsupfortrad.scot/scotslanguageawards/nominations-open-for-scots-language-awards-2020/',
 'https://scoblog.stormplay.scot/were-gaunae-mak-big-chynges-tae-wir-main-wabsteid-an-a-cynge-o-hou-we-dae-things.html',
 'https://newsnet.scot/archive/mono-or-stereo/',
 'https://scoblog.stormplay.scot/happy-birthday-gamerstorm.html',
 'http://mindyerlanguage.scot/aboot',
 'https://www.makforrit.scot/category/opinion/',
 'https://projects.handsupfortrad.scot/scotslanguageawards/gies-a-scots-phrase-day-2022/',
 'https://newsnet.scot/archive/scots-railweys-scots-leids-an-scots-cairtes/',
 'https://www.makforrit.scot/author/jamie/',
 'https://stormplay.scot/games/pinkeye/sco/scotland.html']

Even inside these URLs we can see some scots language so this is promising.

One of the issues with some of the Scots data in FineWeb2 is that it is in the wrong language. One way we can try and get a sense of where better language data might be in FineWeb2 is to look at the tlds that have the highest language scores. We can do this by grouping by tld and then taking the mean of the language scores. We can then filter for the tlds that have more than 50 row to make sure we’re considering the tlds that have a good amount of data.

(
    df.group_by("tld")
    .agg(
        [
            pl.col("language_score").count().alias("count"),
            pl.col("language_score").mean().alias("language_score"),
        ]
    )
    .filter(pl.col("count") > 50)  # Replace n with your desired minimum count
    .sort("language_score", descending=True)
)
shape: (41, 3)
tld count language_score
str u32 f64
"scot" 102 0.998978
"ac.uk" 255 0.95732
"org.uk" 267 0.926128
"org" 8806 0.814764
"co.uk" 659 0.770529
"blogspot.com" 561 0.65765
"top" 85 0.581157
"eu" 275 0.558302
"de" 362 0.544635
"club" 807 0.543638
"co" 2258 0.530558
"nl" 1247 0.52335
"ca" 75 0.521135
"com.co" 54 0.520628
"ie" 52 0.514327
"info" 15820 0.506197
"com.au" 88 0.505842
"com" 25240 0.500029
"me" 4114 0.490317
"online" 182 0.489941
"mobi" 183 0.484193
"pl" 78 0.475017
"net" 3416 0.473797
"es" 56 0.466831
"tk" 53 0.461307
"fr" 141 0.461128
"it" 79 0.460331
"site" 832 0.456072
"xyz" 122 0.450356
"co.za" 66 0.450212
"store" 119 0.443785
"in" 135 0.439045
"co.ke" 135 0.433248
"us" 4655 0.43162
"pro" 131 0.431553
"pages.dev" 92 0.422582
"ru" 2110 0.416347
"live" 64 0.404111
"edu" 102 0.386884
"website" 54 0.371444
"cn" 97 0.366857

We can see some other potentially promising tlds. For example, ac.uk is the UK’s higher education domain. We can take a look at the urls that have this tld.

df.filter(pl.col("tld").str.contains("ac.uk")).sort(
    "language_score", descending=True
).select("url").to_series().to_list()[:30]
['https://www.scottishcorpus.ac.uk/document/?documentid=1699',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1759',
 'https://www.abdn.ac.uk/elphinstone/kist/search/display.php?sblk65.dat',
 'https://www.abdn.ac.uk/elphinstone/kist/display/folk-history/357/',
 'https://scotslanguagepolicy.ac.uk/warkshoaps/',
 'https://scotslanguagepolicy.ac.uk/survey-final-weekend/',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?fhrg01.dat',
 'https://www.abdn.ac.uk/elphinstone/kist/display/761/',
 'https://scotslanguagepolicy.glasgow.ac.uk/hae-yer-say/',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?lwee66.dat',
 'https://scotslanguagepolicy.ac.uk/jist-fir-burns-nicht/',
 'https://scotslanguagepolicy.ac.uk/aboot/',
 'https://www.scottishcorpus.ac.uk/document/?documentid=122',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?bgre04.dat',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?arob01.dat',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1713&highlight=athort',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1695&highlight=projeck',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1709&highlight=projeck',
 'https://www.abdn.ac.uk/elphinstone/kist/display/work/938/',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1714',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1739',
 'https://www.abdn.ac.uk/elphinstone/kist/display/work/341/',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?acru01.dat',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1697&highlight=aroon',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1742&highlight=projeck',
 'https://www.abdn.ac.uk/elphinstone/kist/search/display.php?kmac01.dat',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1704&highlight=projeck',
 'https://scottishcorpus.ac.uk/document/?documentid=1710',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1725&highlight=direck',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1715&highlight=projeck']

In this case using some EDA and domain knowledge we can filter for the tlds which are likely to be:

  • in the scots language
  • higher quality educational websites

We can reduce the FineWeb2 dataset to only include the rows that have these tlds.

good_tlds = ["sco", "ac.uk", "org.uk", "org"]
df.filter(pl.col("tld").is_in(good_tlds)).sort("language_score", descending=True).head(
    5
)
shape: (5, 12)
text id dump url date file_path language language_score language_script minhash_cluster_size top_langs tld
str str str str str str str f64 str i64 str str
"Snawbuirdin Frae Wikipedia Sna… "<urn:uuid:72c97fcb-4820-4a52-b… "CC-MAIN-2014-15" "http://sco.wikipedia.org/wiki/… "2014-04-19T09:31:00Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 66 "{"sco_Latn_score": 1.000010013… "org"
"Banner o the Sahrawi Arab Demo… "<urn:uuid:67052692-6020-4870-9… "CC-MAIN-2014-15" "http://sco.wikipedia.org/wiki/… "2014-04-24T06:38:13Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 27 "{"sco_Latn_score": 1.000010013… "org"
"Potosí is a ceety an the caipi… "<urn:uuid:e49b07bb-d7c9-4905-b… "CC-MAIN-2014-15" "http://sco.wikipedia.org/wiki/… "2014-04-21T15:05:27Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 34 "{"sco_Latn_score": 1.000010013… "org"
"Port Moresby Port Moresby (Ing… "<urn:uuid:bb6b995d-b3e8-4dcd-9… "CC-MAIN-2014-35" "http://sco.wikipedia.org/wiki/… "2014-08-30T16:16:49Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 80 "{"sco_Latn_score": 1.000010013… "org"
"Seville Seville is a ceety in … "<urn:uuid:cdcca31a-693e-463b-a… "CC-MAIN-2014-42" "http://sco.wikipedia.org/wiki/… "2014-10-22T21:45:17Z" "s3://commoncrawl/crawl-data/CC… "sco" 1.00001 "Latn" 31 "{"sco_Latn_score": 1.000010013… "org"
filtered_df = df.filter(pl.col("tld").is_in(good_tlds)).sort(
    "language_score", descending=True
)

We can now save the filtered data to a new file. We’ll save the ids of the rows that are in the filtered dataset to a file. These ids can then be used to upload additional filtered data to the Argilla dataset for the language we’re working on.

with open("good_ids", "w") as f:
    for id in filtered_df.select("id").to_series().to_list():
        f.write(f"{id}\n")

Filtering other languages

We can also use the same techniques to filter other languages. Some languages have a lot of data and so we can use the scan_parquet function to create a LazyFrame this will avoid loading all the data into memory. In addition, Polars will perform query optimizations on the LazyFrame. This will make the code we use for filtering more efficient without much work on our part.

def get_paths_for_language(language: str):
    return [
        path
        for path in list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
        if path.endswith("parquet")
        and "removed" not in path
        and "train" in path
        and language in path
    ]

Filtering with a higher language score

Some language in fineweb2 are not identified as the correct language. Language identification is still not a “solved” problem but we may be able to use a higher confidence filter to get a set of data that is more likely to be the correct language. We can then label this data for the educational quality of the text without having to remove as many examples as being in the incorrect language.

paths = get_paths_for_language("asm")
paths
['data/asm_Beng/train/000_00000.parquet',
 'data/asm_Latn/train/000_00000.parquet']

Let’s load the data for the Assamese language using only the train file.

df = pl.read_parquet(f"hf://datasets/HuggingFaceFW/fineweb-2/{paths[-1]}")
df.shape
(1104, 11)

We can use the describe function to get a sense of the distribution of the language scores.

df.select("language_score").describe()
shape: (9, 2)
statistic language_score
str f64
"count" 1104.0
"null_count" 0.0
"mean" 0.829071
"std" 0.231866
"min" 0.303687
"25%" 0.660899
"50%" 0.970777
"75%" 0.995925
"max" 0.999965

You can see that compared to some other languages the mean language score is quite low. We might be able to get a better subset of data by filtering for a higher language score. Let’s take a look at some examples of the text that have a high language score. This can help give us a sense of what threshold might have less false positives.

from rich import print as rprint

examples_to_show = 3

rprint(
    df.filter(pl.col("language_score") > 0.9)
    .head(examples_to_show)
    .select("text")
    .to_series()
    .to_list()
)
[
    'eitu ajir pora 2 bosor agor kotha moi NIT r pora pass out hoisu just. Vaal job nopowa baabe moi keidinmaanor 
babe temporary hisape eta national company t humai asilu. Tate moi taniya k log palu , tai tate as aadvise r hisape
humaise.Prothom dekhate taik kiba vaal lagi gol. Dekha t tai bor dhuniya, mihi gulopia gaal, dudu keita niomia 
akaror, khali olop sapor. jetiya tai r logot kotha pati thaku, tetiya moi issa koi pen tu tolot palai dio, aru tai 
jetiya pen tu uthaboloi hauli diye moi tair dudu keitar dorshon koru aru gharat jai tair kotha vabi pura haat 
maaru. Eitu mur kiba eta routine r dore hoi goisil .\nEdin moi duporia duty t kaam koi vagor logat olop rest lobor 
babe kahote thoka rest roomot goi bohilu, ami praye kaam koi koi boar hole rest roomot goi olop hui ahi fresh hoi 
kamot lagi jau. Restroom mane eta medium size r room aru ekhon soki aru bisona. Hadharonote ejon rest koi thoka 
homoyot oin manuh restroomoloi nahe. Heidinau moi tenekoi ahi restroomor bisonat bagor dilu, hui thakute ketiya 
tuponi ahi gol gomei napalu. Hothate kiba eta gaat loga jen pai soku khul khale, dekhilu taniyai mur pantor uporote
tika khon sui ase, tai gomei pua nasil moi har pua buli, lahe lahe tai haat khon mur jounangor uporot muharibo 
dhorile, tai pura habodhane moi jate haar napau tenekoi sui asil, mur bor moja lagil, vabilu aji hopun pura 
hobo.Moi suddenly huar pora uthi bohi golu, tai sok khai uthil aru lajot ronga poi gol, tai haat khon mur baridalor
uporor pora putkoi atorai dile aru muk sorry sorry buli kobo dhorile. Moi kolu sorry kole ki hobo tumi mur tika aru
lingo duita sula dekhun moiu tumar bostu bilaak soom.Tai sage bhobai nasil enekua situation ahi jabo pare buli , 
heye tai thor lagi mur pine saie thakil.Moi bisonat bohiye bhori duta bisonar pora nomai dilu aru dui hatere dhori 
taik mur usorot loi ahilu, tai kole ki koribo khujisa rubul. Moi kolu moi just badla lom tumi mur ji ji suisa moiu 
tumaar soom buli koi haat duta tair pithir pora nomai thoi dui tika r uporot tholu aru jure jure tika khon tipibo 
dhorilu, ki mast gand asil tair mur bari dall pantor vitorot ei for forai asil.Lahe lahe moi hat dukhonere tair 
tika r major angso tu anguli etare subo dhorilu aru anguli tu aru honmukhor fale koi diyat moi tair boos khon feel 
koibo dhoilu . Mor tair boos khon suar bohut mon hobo dhorile aru tair salwaror pant tu tolfale tanibo dhorilu 
kintu kokalot rosire gathi thua babe tolfale nahil. Moi eibaar tair kurta tu uporoloi uthai dilu aru tair salwaror 
rosi dal khuli dilu ,nije nije eibaar pant tu tololoi naami ahil. moi tair boga vorikhon sai aru thakibo nuarilu , 
bisonar pora nami tair thigh khonot pura suma khabo dhorilu, aru maje maje kamuribo dhorilu, laheke aru olop 
uporoloi juat tair heujiya rongor panty tu dekha palu , panty r uporote moi tair boos khonot suma khabo dhorilu, 
olop homoi teneke thokar pisot moi tair panty tu tolfale nomai dilu aru taniya r mukholoi salu , tai mur kando 
karkhana bilakor pora moja pai ase buli gam palu. Tair booskhon gulopia boosor uporor sulikhini shave kora kintu 
kahe kahe thoka suli khini thaki goisil. Bor moja gundho eta ahisil booskhonor pora, moi tairboos khon dui hatere 
meli dhori boosor uporor angso tu jivare subo dhorilu, tair gaat current loga r nisinake jopiai uthil , tair val 
loga buli gam pai moi aru jure jure boos khon supibo dhorilu, taik moi bisonar uporot uthai dilu, salwaror pant 
korobat poi rol , aru tair kurta tu petoloike uthai akou moi tair boos khonot mukh maribo dhorilu, jivare boos khon
seleki seleki , right handor anguli eta tair pokoror futat humai dilu, ki masti lagisil kintu tai bohut dukh pale 
karone moi pokoror pora anguli tu ulai dibo loga hol.\nEibaar moi mur pantu khuli bari dal ulai hatere sui sui 
boosot humuar babe ready koibo dhorilu. Moi tair kurta tu ekebare khuli dilu, aru boga bra tu tololoi tani dilu, 
tair dudu duta moi bhobatkoi olop horu, holeo moi dui hater pura tipibo dhorilu, aru nipple tu maje maje suhibo 
dhorilu. Hothate mur eta B.F t dekha pose monot poile aru moi tair bukut bohi lolu aru mur 7 inchi r bari daal tair
dudu dutar majot rakhi dui hatere dudu keitare sepi dhorilu, dudu dutar majot ji olop thai baki rol heikonokei boos
morar nisinake kokal maribo dhorilu , eibar aru olop uporoloi uthi bari daal taniya r mukhor vitorot humuai dilu, 
tai ready hua nasil jodiyo jur jobordosti humuai dilu mukhot, olop apsot tai adjust hol aru mur lund daal bor 
moromere ice cream khuar nisinake suhibo dhorile, tai mur guti duta eta hatere sui, eta hater bari dal muthi mari 
pura jure jure suhi asil.\nMur ulabo hua jen logat tair mukhor pora bari daal ulialu aru eibaar tair boosor kahot 
goi boosot baari daal thoi thela marilu. kintu tair boos khon pura tight thoka babe moi bohut sesta koiu humaabo 
nuarilu, Tetiya mur monot poril uth fota babe maie muk vaseline r tema eta disil. ajihe vaseline tur asol upojug 
hobo buli moi poketor pora vaseline tema tu uliyai olop maan vaseline mur baari dalor agtut aru tair boosot hani 
dilu, eibar eke thelate fir firkoi baari daal humai gol, taar pasot aru kune pai pura tika dangi dangi taik sudibo 
dhorilu. Dui hater bike r handle dhorar dore tair dudu keita dhori taik sudilu. 45 min maan sudar pasot maalpani 
ulai gol, boosor vitorote uliai dilu, vabilu ji hoi hobo sala imaan controll koibo nuari. Olop homoy teneke hui 
thaki dui jone dress thik koi rest room r pora ulai duty korat lagilu.\nEtiya ami prai hodayei restroomot sex koru,
ami duijone rest room tur naam sex room thoisu. Taniya mur girlfriend nohol jodiyo, mur best friend hoi thakil, 
etiya ami prayei kotha patu aru phonote bea bea kotha pati thaku.',
    "Front Page\nHome\nChat\nEntertainment Page\nAssamese Music\nAssamese Lyrics\nBest viewed @ 1024x768 
resolution\nXoixobote Dhemalite - Lyrics of Assamese Songs\nSinger : Dr. Bhupen Hazarika\nMusic Director : Dr. 
Bhupen Hazarika\nLyricist : Dr. Bhupen Hazarika\nXoixobote(Shoishobot) Dhemalite tumaare umola monot aase..\nBohag 
maahor luit khonot duyu xaature monot aase\nJowbonote duyure dehaar laajuki porox monot aase\nMur abihone sip jori 
loba buli kuwa mur monot aase..\nBohag maahor logote edin aahil bordoisila\nXei dhumuhat kaarubar xote tumi dekhu 
gusi gola\nMonor goraaki eri tumi Dhonor goraaki bolila\nDhon Dhon buli dhonor premere swarup prakax korila\nBohu 
din gol hothaate xidina tumak dekh monot aase..\naatoror pora dekhilu tumar swarna gohona jilikise\nTumar abihone 
sip jori lom buli tumi bhaabisa\nBhul korisa xunjoni mur\nAleek xapun dekhisa\nJiyaai thaakim..\nJiyaai thaaki 
ekhon xomaj gorhibor mur mon aase\nJot xunot koiu manuhor damm olop holeu besi aase!!\n*** 'X' stands for assamese 
phonetic letter whose pronounciation is in between 'H' and 'KH'.\nYou can listen to this song in the\nAssamese 
Songs Playlist\nDiscuss, query and comments on Assamese songs and music in this\nforum\n.\nCopyright © 2007-2008 
onlinesivasagar.com(Abhijit Borah)",
    '\n\n#1\n\n\nAssamese Sex Stories\nMOI ARU MAUSUMI (Assamese Sex Stories)\nHi friends aitu mur first post, akha
koru apunalukor val lagibo... mur nam Momi moi aji apunalukok mur nijor first sex\nexperienceor bikhoya jonabo 
bisarisu. Moi akhon grils collegeot TDC\n2nd year scienceot porhi asu. Ghorot khub strict babe muk porhat 
bhal\nholeo muk bahirot porhibo jabo nidile. Ghorot muk okole kotu jabo\nnidea, korbat gole maa nohole papa nohole 
dada logot jai, kintu mur\nchildhood friend Mausumi,k amar ghorot khub trust kore babe tair logot\nkorbat gole 
badha nidiya. Mausumi mur akai batchor kintu tai artsot\nakai collegeot porha. ami duyu khub bhal friend, ami duyu 
hokolu kotha\nshare koru, even tair boyfriendor logot kora romanceor bikhoye hokolu\nkotha. Tair boyfriend notunke 
Bankot join korise, tar nam Pallab.mur\nhihotor romanceor kotha huni khub bhal lage. Katiyaba bhabu muru 
jodi\nkunuba boyfriend thakil haten moyu sage tair nisina romance koribo\nparilu haten, kintu mur ja ulai juar 
chance nai. Mausumir kintu ghorot\nkunu restriction nai tai bindas ghuribo pare. Tai bohudin amar\nghororloi oha 
nasil haibabe adin collegot taik ghoroloi ahibo kolu\n(actually tair romanceor kotha hunibole bor mon goisil). Tai 
ata\nSunday ahibo buli kole. Mur montu bhal lagi gol, moi bor akhare Sunday\nloi wait koribo dhorilu..\nSunday 
hunkale uthi moi hokolu kam hekh kori tair babe wait koribo\ndhorilu tair dari hua babe abar taik fon kori hunkale 
matilu, tai ahi\nthoka buli kole aru thik 15 minitor pasot tai amar ghorot palehi, amar\nghorot coffee khai 
hokolure logot kotha pati ami mur roomot humalu…\nkisu homoi kotha potar pasot moi tair boyfriendor logot hua 
romanceor\nkotha hudhilu.. tai koi jabo dhorile… hihotor first sex lifeor kotha….\nLove huar pasot hihot priya 
outing jai jodeo hihotor majot sex hua\nnasil… last month first weekor kotha hihote outing plan korisil.\nPallabe 
taik collegor pora pickup kori lunchor babe resturentor\ncabinot loi gol, waiterok lunch order kori pallabe bahiror
pora\nnadekhakoi cabinor porda tani dile, tar pasot hi tair usorot bohi tair\nhatkhon nijor hatot loi kotha patibo 
dhorile… Pallabe tair hatkhon\nmuhari taik koisil – tumak aji bor dhunia lagise mousumi….. tumak pai\nmoi bohut 
lucky…. Moi tumak bohut bhal pao…. Aibuli pallabe tar akhon\nhat tair galot moromera muharibo dhorile.. mausumie 
mathu tar kotha\nhuni tak support kori asil…. Aibar hi mousumik kobo dhorile mousumi\ntumar gal dukhon gulopia hoi 
goise, voi lagise niki ? voi koribo\nnalage moi tumak bhal pao…. Mousumie kole … nai voi loga nai tumak\nkiyo voi 
korim?... moiu tumak bahut val pao…. Pallabe kole… mausumi\ntumar gulopia galkhonot ata kiss korile beya paba niki?
Mousumia kole\nnai kintu kunuba ahibo para… pallabe kole… kunu nahe, order ahibo dari\nase…. Aibuli koi pallabe tar
mukhon mousumir galot logai kiss koribo\ndhorile.. mousumir tar hator touch pai aru tar moromor kiss pai\nuttejona 
barhibo dhorile… taiu pallabok support kori tair duyukhon\nhat tar chulit khamus mari dhori tair lips kait pallabor
lipsor fale\naag borhai dile… pallabe chance api tair lipsot kiss kori tair lips\nduta chupibo dhorile… aidore kisu
homoi kiss kori thakute hothat\ncabinor doorot knock poril. Hunkale duyu easy hoi waitarok vitoroloi\nahibo kole, 
waiter ahi lunch di gol.. waiter juar pasot aibar bipllobe\ncabinor door khon bandh kori di tair usorot bohi loi 
taik haboti loi\npunor lipsot koribo dhrile aru aibar olop aag barhi goi akhon hatare\ntair boobs ata laheka 
kapuror uporote laheka muharibo dhorile mousumir\nprotest korar sťance nasil.. tairu bhal lagi tak support 
koribo\ndhorile… pallabe gom pale ja mousumie tar pora aru bisarise … aibar hi\nakhon hat tair thaisot thoi 
mousumir laggingor uporedi laheka muharibo\ndhorile…. Mousumi uttajonat aku kobo pora nasil mathu choku kaita 
mudi\npleasure loi asil…. Pallabe chance pai tair dangor dangor thais kaita\nmuhari muhari hat khon duyu thaisor 
major pussy area loi nibor babe\nthais kaita fak koribo dhorile… mousumie kunu protest nokori vori\nkaita fak kori 
dile… pallabe hubidha pai tar hatkhon tair pussy loi ni\nkapuror uporote pussy muharibo dhorile…. Pussyt hat porar 
loge loge\nmousumir mukhera aahhh… aaahhhh… ulai gol , tai uttajonat robo pora\nnasil tai tak support kori tar 
hatkhon khamusi dhori aru jure pussy\nhasi dile…. Pallabe tair condition buji tair kane kane kole mousumi\nplease 
tumar legging tu khuli diyana… mousumie kole nai pallab ai\nresturent amar babe safe nohoi… pallabe kole….. aku 
nohoi cabinor\ndorja lock ase… kunu nahe .. aibuli koi hi mousumir legging or\nelastic dalot anguli duta vorai 
legging tu knee level loika tani nomai\ndile…. Mousumieu badha nidi tak legging tu toloi nomuat hohai kori\ndile… 
tai ata off white color panty pindhi asil … tair panty tu dekhi\nhi robo nuari tair pantyr uporote pussy khon 
khumusi dhori muharibo\ndhorile… aidore kori abar pallabe tar hatkhon tair pantyr vitoroloi ni\ntair pussy fakot 
ata anguli thoi lahekoi pussyr vitoroloi vorai de\nangulitu ulua humua koribo dhorile… aidore korat mousumi robo 
nuari\ntai pussyra pani uliai panty aru pallabor hat tiyai dile… aibar\npallabe robo nuari mausumik kole … mousumi 
tumar pantytu khuli\ndibaniki? Mausumie kole … etiya nalage eaat ami safe nohoi… balag adin\nkorim.. pallab manti 
hoi kole… thik ase mousumi tumak aru alop basi\nmorom koribo mon goise, please tumi olop uthi diyana tumak valke 
morom\nkoru…. mausumia chairor pora uthi dile.. loge loge pallabe taik\nhoabti dhori kiss kori kori akhon hatare 
tar pantor chain dal khuli\ntar erect pains tu uliai mausumir hatot dhorai dile…. Aru mausumir\nkanot kole… please 
mausomi tumar pussyt logai diyana…. Mousumie kole\nnalage pallab please aibur nokoriba…. Pallabe kole … nohoi 
mousumi\nnakoru nohoi tumar pantyr uporot pussyt logoi diya… Mausumie manti\nhol… tar panes tu tair pantyr uporedi 
pussyt logai dile…. Pallabe tar\npanes tu tair pussyt gohabo dhorile…. Aidore korat mausumi uttajonat\npallabok 
haboti dhori tak support koribo dhorile aru 2nd time pani\nulieai dele…. Aidore kisu homoi korat pallaboru spurm 
ulai gol aru\nmousumir panty tar spurme tiay dile….. tar pasot duyu lunch kori ghuri\nahil……..\nAi story huni muru 
excitement barhi goisil aru……..\nMausumie muk hudhile kua Momi storitu huni kanakua lagil…. Moi kolu …\nval lagil …
tai kole okol val lagil ne aru kiba hol …. moi kolu.. mur\nexperience nai babe moi tumar feelings pabo pora nai…. 
mur jodi kunuba\nboyfriend thakil haten moiu tumar dore enjoy kkoribo parilu haten….\ntai kole thik ase tumar 
boyfriend nathkileo moi tumak real sex felling\nkanekua dibo paru….. moi hudhilu kintu atiya kanakoi muk sex 
feelings\ndiba… tai kole first tumi rumor door khon valka lock kora jate kunu\nahibo nuare… moi kolu tumi chinta 
nokoriba door lock ase aru atiya\nrumoloi kunu nahe… tai kole thik ase tumi bisonat mur logot boha aru\nchoku kaita
mudi diya aru feel kora ja moi tumar boyfriend… moi\nbisonat tair usorot bohi choku kaita mudi dilu aru mind tu 
concentrate\nkori bhabibo lolu ja mur boyfriend mur kakhot bohi ase… aibar tai muk\nlahikoi kokalot hat thoi 
muharibo dhorile aru mur lipsot kiss koribo\ndhorile… prothombar karubar kiss pai mur ata ojan feelings 
ahibo\ndhorile aru moyu taik lips kiss koribo dhorilu… tai kiss kori kori\ntair hatkhon lahekoi mur boobs kaita 
muharibo dhorile…. Boobsot tair\nhat porar loge loge moi pagolor dore hobo dhorilu aru taik haboti\ndhori kiss 
koribo dhorilu… aibar tai kurtistur vitoriloi hat vorai\nbratu uporoloi dangi mur boobs kaita muharibo dhorile… mur
excitement\naru barhi gol… tai boobs kaita tapi tapi mur nipple kaita muhari diat\nmoi uttajonat 1st time pussyre 
pani uleai dilu……. aibar tai toptu\nboobsor uporoloika uthai tair mukhon boobsoloi ni nipple kaita suck\nkoribo 
dhorile… moi uttajonat tair murot hat thoi murtu jurke chapi\ndhorilu… aibar tai muk bisonat huai di boobs kaita 
suck kori tair\nhatkhon mur Thaisor uporot thoi kapuror uporote muharibo dhorile…\naidore kori tair hatkhon lahekoi
mur kapuror uporote mur pussy\nmuharibo dhorile aru angulire mur pussy fakot press dibo dhorile….\nUttajonat moi 
dui vorire tair hatkhon chapa mari dhorilu… aibar tair\nhatkhon mur salwar pantor rosidalot gol…. Laheka tai pantor
rosidal\nkhuli pantu tololoi nomabo dhorile…. Moi taik pantu khulat help kori\ndilu.. pantu khuli tai mur vori 
kaita fak kori panty tur uporote\nhatere muhari angulire pussy hole tu press kori muharibo dhorile…\nhosakoi… 
jibonot eman hukh moi katiyau pua nasilu… aibar tai muk\nbisonar pora uthibo kole … moi bisonar pora uthat tai mur 
toptu khuli\nbra aru panty khuli muk naked kori dile…. Mousumiru sex fellings\nhoisil.. muk naked kori tai nijor 
t-shirt tu muror uporere khuli loi\nbra tur huk khuli bratu khuli palala tar pasot tai jeans tu khulibo\ndhorile.. 
jeans tu khulat tai pindhi thoka light pink color panty tu\nmur chokut poril.. tair panty tu titi asil.. aibar tai 
panty tu khuli\npalai complete naked hoi gol…… tar pasot tai muk bisonat huai di mur\nvori kaita fak kori mur 
pussyt tair mukhon ni mur pussy suck koribo\ndhorile.. aibar uttajonat mur mukhere aahhhh… aahhhh…moaning 
ulabo\ndhorile.. tai suck kori kori tair jivakhon mur pussy fakot vorai diat\nmur uttajona barhi gol aru 2nd time 
pani ulai tair mukh tiaye dile…\ntai aidore kisu homoi korar pasot muk kole … momi moi aibar tumak\ncomplete 
satisfaction dim… tumar eat candle ase niki..? moi kolu\ntableor drawerot ase tai draweror pora adal candal ani mur
pussyt\nlaheke logai press dibo dhorile… aru vitoriloi press koribo dhorile…\nhosai mur ane lagisil jan kunuba 
loraya muk fuck korise… first tai\nlahekoi candledal vitoroloi ni pasot jure ulua humua koribo dhorat moi\nrobo 
nuari 3rd time… pani uleai dilu…… already tairu uttajona barhi\ngoisil… aru tai robo nuari candle dal mur pussyr 
pora uleai tair nijor\npussyt vorai in-out kori pani uleai dile… ami duyu bahut tired hoi\ngoisilu, ami duyu bohu 
homoi naked hoi bisonat pori thokar pasot dress\nkori lolu…..\nAidore ji dhorone tai muk sex satisfaction dile… moi
jibonot pahoribo\nnuaru…..\napunalukor hohari pale pasot aru potham..\nLast edited by sourav002 : 8th June 2013 at 
02:15 PM.\n|Thread Tools||Search this Thread|\n\n'
]

If we find a better language score we can filter for this. For example, we can filter for the language score to be greater than 0.95.

df_filtered = df.filter(pl.col("language_score") > 0.95)
(697, 11)
with open("good_ids", "w") as f:
    for id in df_filtered.select("id").to_series().to_list():
        f.write(f"{id}\n")

Filtering bigger languages

Some languages have a lot of data and so we can use the scan_parquet function to create a LazyFrame. Let’s see how we can do this for the Japanese language.

paths = get_paths_for_language("jpn")
len(paths)
148

You can see here we have many more files. If you have a lot of memory, you could use the standard read_parquet function. However, if you don’t have a lot of memory, you could use the scan_parquet function. This will read the data in chunks and is more memory efficient. Even with this we might want to start with a subset of the data to experiment with and then work with the full dataset once we’re confident in our filtering.

import random

random.seed(42)

sample_paths = random.choices(paths, k=2)
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

Path("temp_data").mkdir(exist_ok=True)

for path in tqdm(sample_paths):
    hf_hub_download(
        repo_id="HuggingFaceFW/fineweb-2",
        repo_type="dataset",
        filename=path,
        local_dir="temp_data",
    )
  0%|          | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:00<00:00,  5.87it/s]
df = pl.scan_parquet("temp_data/**/*.parquet")
df.head(5).collect()
shape: (5, 11)
text id dump url date file_path language language_score language_script minhash_cluster_size top_langs
str str str str str str str f64 str i64 str
"欲しかった車を探せるサイト 独身時代は、ただ乗れればいいと思… "<urn:uuid:9221bbac-4ab3-4d7b-9… "CC-MAIN-2013-20" "http://careerspaceezine.com/" "2013-05-20T01:19:20Z" "s3://commoncrawl/crawl-data/CC… "jpn" 1.000009 "Jpan" 1 "{"jpn_Jpan_score": 1.000009059…
" ふくむすめどうわしゅう(Hukumusume fairy… "<urn:uuid:d03fc65f-99bb-4095-b… "CC-MAIN-2013-20" "http://hukumusume.com/douwa/En… "2013-05-20T01:18:14Z" "s3://commoncrawl/crawl-data/CC… "jpn" 0.992212 "Jpan" 2 "{"jpn_Jpan_score": 0.992212295…
"家電通信をお届けします 家電は一度購入したら、何年も使い続け… "<urn:uuid:89b3dae5-8a49-4d51-a… "CC-MAIN-2013-20" "http://wnclivehosting.com/inde… "2013-05-20T01:57:39Z" "s3://commoncrawl/crawl-data/CC… "jpn" 1.00001 "Jpan" 1 "{"jpn_Jpan_score": 1.000010013…
"出版社からのコメント MovableTypeの特徴のひとつと… "<urn:uuid:84019b07-0424-4d79-b… "CC-MAIN-2013-20" "http://www.amazon.co.jp/MOVABL… "2013-05-20T01:50:55Z" "s3://commoncrawl/crawl-data/CC… "jpn" 1.000009 "Jpan" 2 "{"jpn_Jpan_score": 1.000008940…
"FrontPage 私も結婚することで、今の保険に入ろうか考… "<urn:uuid:3fc5c2a5-c3a7-409c-b… "CC-MAIN-2013-20" "http://www.christian-louboutin… "2013-05-20T01:59:19Z" "s3://commoncrawl/crawl-data/CC… "jpn" 1.00001 "Jpan" 15 "{"jpn_Jpan_score": 1.000009894…
df.select("language_score").describe()
shape: (9, 2)
statistic language_score
str f64
"count" 3.3735e7
"null_count" 0.0
"mean" 0.999791
"std" 0.002776
"min" 0.886358
"25%" 0.999996
"50%" 1.000007
"75%" 1.000009
"max" 1.00001
df.filter(pl.col("url").str.contains("wikipedia")).count().collect(streaming=True)
shape: (1, 11)
text id dump url date file_path language language_score language_script minhash_cluster_size top_langs
u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
55053 55053 55053 55053 55053 55053 55053 55053 55053 55053 55053
japanese_edu_domains = [
    "http://www.asagaku.com/",
    "www3.nhk.or.jp/news/easy/",
    "http://kids.yahoo.co.jp/",
]
df.filter(pl.col("url").is_in(japanese_edu_domains)).count().collect(streaming=True)
shape: (1, 11)
text id dump url date file_path language language_score language_script minhash_cluster_size top_langs
u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
3 3 3 3 3 3 3 3 3 3 3

We’d obviously want to expand this list to include more domains but you can see how we can still use the same techniques to filter very large datasets without running out of memory.