# %pip install polars huggingface_hub tld rich tqdm --upgrade
Filtering FineWeb2 using Polars
Recently FineWeb2 was released. FineWeb2 builds on the previous FineWeb dataset to add data for many languages. Building on this work we recently launched a community effort to build educational quality filters for many languages. See this blog post for more details.
Filtering the FineWeb2 dataset to improve educational quality and or filtering for a language
One of the goals of the FineWeb-c project is to build educational quality filters for many languages. To do this the community has been annotating the data with educational quality scores. So far, the majority of the datasets for each language consist of a random sample of 1,000 examples from FineWeb2 for that lanugage. However, for some languages, the community has found that: - the language identification is not always correct - the educational quality is very low in the sample
For these languages, we want to enable the community to create extra filters to either help with the language identification or to filter for educational quality. This blog post shows some ways in which you can use Polars to filter the FineWeb2 dataset to improve educational quality and or filtering for a language.
First we’ll install the necessary libraries. We’ll use polars
for the data manipulation and huggingface_hub
to interact with the Hugging Face Hub. The dask
library is another good option for working with large datasets.
from huggingface_hub import list_repo_files, hf_hub_download
import polars as pl
from tld import get_tld
from pathlib import Path
from tqdm.auto import tqdm
import os
# increase amount of data polars shows
100) pl.Config.set_tbl_rows(
polars.config.Config
Many large datasets on the Hub will be organised into different configurations. These configurations are often named after the language they are in. For example, the FineWeb dataset is organised into different languages. Many large datasets will either have structured folders or the names of files can be used to filter the dataset. Let’s look at the FineWeb dataset. We can use the wonderful huggingface_hub
library to list the files in a repository.
= list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
paths 10] paths[:
['.gitattributes',
'README.md',
'data/aai_Latn/test/000_00000.parquet',
'data/aai_Latn/train/000_00000.parquet',
'data/aai_Latn_removed/train/000_00000.parquet',
'data/aak_Latn/test/000_00000.parquet',
'data/aak_Latn/train/000_00000.parquet',
'data/aak_Latn_removed/train/000_00000.parquet',
'data/aau_Latn/test/000_00000.parquet',
'data/aau_Latn/train/000_00000.parquet']
You can see we have a hidden .gitattributes
file and a README.md + a data
directory containing parquet files organised into different subdirectories. Since these names are very clear we create a very simple filter to get the scots language files we’re interested in. We’ll look for scots
in the file name and make sure it ends with .parquet
and doesn’t have removed
in the file name since these are files that were removed in the FineWeb2 filtering process.
= [
scots for f in paths if ("sco" in f and f.endswith("parquet") and "removed" not in f)
f
] scots
['data/sco_Latn/test/000_00000.parquet',
'data/sco_Latn/train/000_00000.parquet']
Loading the data in Polars
We can directly load the data from the Hugging Face Hub using the hf://
protocol. In this case we’ll just load the train
file for the scots language. We’ll use read_parquet
to load the data for now but we’ll see below a better way to load the data if you are working with large datasets.
= pl.read_parquet(f"hf://datasets/HuggingFaceFW/fineweb-2/{scots[-1]}") df
Let’s take a look at the data. We can see we have a number of columns including the actual text but also some other metadata fields that could be useful for filtering.
5) df.head(
text | id | dump | url | date | file_path | language | language_score | language_script | minhash_cluster_size | top_langs |
---|---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | f64 | str | i64 | str |
"2010 All Ford Mustangs Car Sho… | "<urn:uuid:06f10aff-f1da-4d33-b… | "CC-MAIN-2013-20" | "http://www.allfordmustangs.com… | "2013-05-23T16:34:05Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 0.764794 | "Latn" | 1258 | "{"sco_Latn_score": 0.764793634… |
"Interested in France? We'll se… | "<urn:uuid:abc6bfe8-7af5-40b9-9… | "CC-MAIN-2013-20" | "http://www.tripadvisor.com/All… | "2013-05-23T16:36:10Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 0.651096 | "Latn" | 12 | "{"sco_Latn_score": 0.651095628… |
"Sherlock Holmes Sherlock Holme… | "<urn:uuid:11ceff04-f5f5-418c-8… | "CC-MAIN-2014-10" | "http://sco.wikipedia.org/wiki/… | "2014-03-08T05:12:30Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.000008 | "Latn" | 58 | "{"sco_Latn_score": 1.000008225… |
"Munster History[eedit | eedit … | "<urn:uuid:5fd5fa85-72b1-43d3-b… | "CC-MAIN-2014-15" | "http://sco.wikipedia.org/wiki/… | "2014-04-19T09:31:48Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 79 | "{"sco_Latn_score": 1.000009536… |
"Snawbuirdin Frae Wikipedia Sna… | "<urn:uuid:72c97fcb-4820-4a52-b… | "CC-MAIN-2014-15" | "http://sco.wikipedia.org/wiki/… | "2014-04-19T09:31:00Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 66 | "{"sco_Latn_score": 1.000010013… |
We can do some simple EDA style analysis if we want. For example, we can look at the distribution of the language scores.
"language_score")).describe() df.select(pl.col(
statistic | language_score |
---|---|
str | f64 |
"count" | 75821.0 |
"null_count" | 0.0 |
"mean" | 0.537262 |
"std" | 0.214123 |
"min" | 0.300002 |
"25%" | 0.371339 |
"50%" | 0.465798 |
"75%" | 0.634602 |
"max" | 1.00001 |
Do a groupby year of dump and language score and plot a bar chart to see if there is a trend.
df.with_columns("dump").str.extract(r"(\d{4})").cast(pl.Utf8).alias("year")
pl.col("year").agg(pl.col("language_score").mean()).sort(
).group_by("year", descending=True
="year", y="language_score") ).plot.bar(x
Heuristics for filtering for higher educational quality in FineWeb2
Whilst the authors of FineWeb2 aimed to do general quality filtering, there are often additional heuristics that can be used to filter for higher educational quality. For example, we can use the tld
to filter for higher quality websites. We can also use the url
to filter for higher quality websites. Many of these heuristics will require some domain knowledge for a particular language and the web ecosystem for tha language.
The top level domain (tld) is a good heuristic for filtering for higher quality websites. The top level domain is the part of the url that is after the last dot. For example, the tld of https://www.wikipedia.org/
is org
. This often corresponds to a country or organization. For example, ac.uk
is the UK’s higher education domain. We can use this to filter for higher quality websites.
We can do this by mapping the url
column to the tld and then filtering for the tlds we’re interested in. Let’s add a new column with the tld and then filter for the tlds we’re interested in.
= df.with_columns(
df "url").map_elements(lambda x: get_tld(x), return_dtype=pl.Utf8).alias("tld")
pl.col( )
import altair as alt
"tld").to_series().value_counts(sort=True).sort(
df.select("count", descending=True
20).plot.bar(
).head(=alt.X("tld", sort="-y"), # Sort x-axis based on y values in descending order
x="count",
y )
We may already have some knowledge or intuitions about the tlds that are more likely to be higher quality. For example .us
is relatively high, this is likely partially due this domain being more present in the Web generally. We may also see some personal blogs using this domain. Let’s take a look at a few examples.
filter(pl.col("tld").str.contains("us")).sort(
df."language_score", descending=True
"url").to_series().to_list()[:30] ).select(
['https://coremc.us/forno-microonde-incasso.html',
'https://www.awesomedownloadfilestoday.us/1141-haircut-places-near-my-location.html',
'https://www.awesomedownloadfilestoday.us/1376-haircut-near-my-location.html',
'https://www.awesomedownloadfilestoday.us/1857-short-haircuts-for-fine-straight-hair.html',
'https://www.awesomedownloadfilestoday.us/2081-twa-styles-4c-hair.html',
'http://winserver.us/mid-century-modern-front-door-colors/mid-century-modern-front-door-colors-mid-century-modern-front-doors-door-colors-handles-mi-mid-century-modern-front-door-colours/',
'https://www.awesomedownloadfilestoday.us/3450-hair-styles-for-thick-short-hair.html',
'https://notwttodaytes.us/casa-mezcal-mexican-grill-cantina.html',
'https://www.awesomedownloadfilestoday.us/1857-short-haircuts-for-fine-straight-hair.html',
'https://www.awesomedownloadfilestoday.us/1737-short-haircuts-for-curly-thick-hair.html',
'http://uggbootsclearanceoutlet.us/jaguar-xj-sport-2003-2003-jaguar-xj-car-for-sale-in.html',
'https://www.awesomedownloadfilestoday.us/5489-hair-colour-and-styles-for-short-hair.html',
'https://www.awesomedownloadfilestoday.us/6213-short-haircuts-for-thick-hair-pictures.html',
'https://www.awesomedownloadfilestoday.us/4993-short-haircuts-for-thick-curly-frizzy-hair.html',
'https://www.awesomedownloadfilestoday.us/1773-short-hair-styles-for-thick-wavy-hair.html',
'https://www.awesomedownloadfilestoday.us/6129-short-haircuts-for-very-fine-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/1808-haircuts-for-straight-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/4478-best-short-haircuts-for-thick-coarse-hair.html',
'https://www.awesomedownloadfilestoday.us/9513-short-haircuts-for-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/3414-short-sassy-haircuts-for-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/4630-very-short-haircuts-for-thin-hair.html',
'https://joesstuff.us/search.php?mfr=141',
'https://www.awesomedownloadfilestoday.us/3738-short-hair-styles-for-thick-curly-hair.html',
'https://layartancep.us/twa-hairstyles.html',
'https://www.awesomedownloadfilestoday.us/1101-best-haircuts-for-fine-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/4963-short-haircut-for-fine-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/4963-short-haircut-for-fine-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/5010-haircuts-for-coarse-wavy-hair.html',
'https://www.awesomedownloadfilestoday.us/5201-best-short-haircuts-for-fine-thin-hair.html',
'https://www.awesomedownloadfilestoday.us/1911-haircut-for-round-face-and-thin-hair.html']
These don’t look super promising! Some domains where we might expect higher quality text for scots are the .sco
domain which is a domain for websites relating to Scotland.
filter(pl.col("tld").str.contains("sco")).sort(
df."language_score", descending=True
"url").to_series().to_list()[:30] ).select(
['https://stormplay.scot/sco/aboot.html',
'https://www.makforrit.scot/2020/08/29/anent-the-scots-wikipedia-an-sundays-editathon/',
'https://www.makforrit.scot/2018/12/23/daein-it-yersel/',
'https://www.makforrit.scot/2019/09/22/uisin-oor-vyce-hou-we-can-gar-political-action-on-scots-inevitable/',
'https://www.makforrit.scot/2018/02/03/naewey-tae-bide/',
'https://www.makforrit.scot/',
'https://www.makforrit.scot/2018/01/27/than-an-nou-poverty-makkin-dae-an-leukin-out-for-ilk-ither/',
'https://www.makforrit.scot/',
'https://salvo.scot/the-scottis-constitutional-covin/',
'https://amylord.scot/gd/hello-welcome/',
'https://www.makforrit.scot/category/scotland/',
'https://projects.handsupfortrad.scot/scotslanguageawards/gies-a-scots-phrase-day-2021/',
'https://scoblog.stormplay.scot/t3ngist-is-gaunae-need-tae-be-delayed.html',
'https://www.makforrit.scot/2019/10/29/halloween/',
'https://www.makforrit.scot/category/history/',
'https://www.makforrit.scot/2018/11/19/three-days-in-october/',
'http://mindyerlanguage.scot/teachin',
'http://mindyerlanguage.scot/category/video',
'https://scoblog.stormplay.scot/rossies-3d-an-t3ngist-are-gaunae-be-delayed.html',
'https://www.gov.scot/publications/consultation-scots-government-commitments-tae-gaelic-scots-scots-langages-bill/pages/2/',
'https://projects.handsupfortrad.scot/scotslanguageawards/nominations-open-for-scots-language-awards-2020/',
'https://scoblog.stormplay.scot/were-gaunae-mak-big-chynges-tae-wir-main-wabsteid-an-a-cynge-o-hou-we-dae-things.html',
'https://newsnet.scot/archive/mono-or-stereo/',
'https://scoblog.stormplay.scot/happy-birthday-gamerstorm.html',
'http://mindyerlanguage.scot/aboot',
'https://www.makforrit.scot/category/opinion/',
'https://projects.handsupfortrad.scot/scotslanguageawards/gies-a-scots-phrase-day-2022/',
'https://newsnet.scot/archive/scots-railweys-scots-leids-an-scots-cairtes/',
'https://www.makforrit.scot/author/jamie/',
'https://stormplay.scot/games/pinkeye/sco/scotland.html']
Even inside these URLs we can see some scots language so this is promising.
One of the issues with some of the Scots data in FineWeb2 is that it is in the wrong language. One way we can try and get a sense of where better language data might be in FineWeb2 is to look at the tlds that have the highest language scores. We can do this by grouping by tld and then taking the mean of the language scores. We can then filter for the tlds that have more than 50 row to make sure we’re considering the tlds that have a good amount of data.
("tld")
df.group_by(
.agg(
["language_score").count().alias("count"),
pl.col("language_score").mean().alias("language_score"),
pl.col(
]
)filter(pl.col("count") > 50) # Replace n with your desired minimum count
."language_score", descending=True)
.sort( )
tld | count | language_score |
---|---|---|
str | u32 | f64 |
"scot" | 102 | 0.998978 |
"ac.uk" | 255 | 0.95732 |
"org.uk" | 267 | 0.926128 |
"org" | 8806 | 0.814764 |
"co.uk" | 659 | 0.770529 |
"blogspot.com" | 561 | 0.65765 |
"top" | 85 | 0.581157 |
"eu" | 275 | 0.558302 |
"de" | 362 | 0.544635 |
"club" | 807 | 0.543638 |
"co" | 2258 | 0.530558 |
"nl" | 1247 | 0.52335 |
"ca" | 75 | 0.521135 |
"com.co" | 54 | 0.520628 |
"ie" | 52 | 0.514327 |
"info" | 15820 | 0.506197 |
"com.au" | 88 | 0.505842 |
"com" | 25240 | 0.500029 |
"me" | 4114 | 0.490317 |
"online" | 182 | 0.489941 |
"mobi" | 183 | 0.484193 |
"pl" | 78 | 0.475017 |
"net" | 3416 | 0.473797 |
"es" | 56 | 0.466831 |
"tk" | 53 | 0.461307 |
"fr" | 141 | 0.461128 |
"it" | 79 | 0.460331 |
"site" | 832 | 0.456072 |
"xyz" | 122 | 0.450356 |
"co.za" | 66 | 0.450212 |
"store" | 119 | 0.443785 |
"in" | 135 | 0.439045 |
"co.ke" | 135 | 0.433248 |
"us" | 4655 | 0.43162 |
"pro" | 131 | 0.431553 |
"pages.dev" | 92 | 0.422582 |
"ru" | 2110 | 0.416347 |
"live" | 64 | 0.404111 |
"edu" | 102 | 0.386884 |
"website" | 54 | 0.371444 |
"cn" | 97 | 0.366857 |
We can see some other potentially promising tlds. For example, ac.uk
is the UK’s higher education domain. We can take a look at the urls that have this tld.
filter(pl.col("tld").str.contains("ac.uk")).sort(
df."language_score", descending=True
"url").to_series().to_list()[:30] ).select(
['https://www.scottishcorpus.ac.uk/document/?documentid=1699',
'https://www.scottishcorpus.ac.uk/document/?documentid=1759',
'https://www.abdn.ac.uk/elphinstone/kist/search/display.php?sblk65.dat',
'https://www.abdn.ac.uk/elphinstone/kist/display/folk-history/357/',
'https://scotslanguagepolicy.ac.uk/warkshoaps/',
'https://scotslanguagepolicy.ac.uk/survey-final-weekend/',
'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?fhrg01.dat',
'https://www.abdn.ac.uk/elphinstone/kist/display/761/',
'https://scotslanguagepolicy.glasgow.ac.uk/hae-yer-say/',
'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?lwee66.dat',
'https://scotslanguagepolicy.ac.uk/jist-fir-burns-nicht/',
'https://scotslanguagepolicy.ac.uk/aboot/',
'https://www.scottishcorpus.ac.uk/document/?documentid=122',
'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?bgre04.dat',
'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?arob01.dat',
'https://www.scottishcorpus.ac.uk/document/?documentid=1713&highlight=athort',
'https://www.scottishcorpus.ac.uk/document/?documentid=1695&highlight=projeck',
'https://www.scottishcorpus.ac.uk/document/?documentid=1709&highlight=projeck',
'https://www.abdn.ac.uk/elphinstone/kist/display/work/938/',
'https://www.scottishcorpus.ac.uk/document/?documentid=1714',
'https://www.scottishcorpus.ac.uk/document/?documentid=1739',
'https://www.abdn.ac.uk/elphinstone/kist/display/work/341/',
'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?acru01.dat',
'https://www.scottishcorpus.ac.uk/document/?documentid=1697&highlight=aroon',
'https://www.scottishcorpus.ac.uk/document/?documentid=1742&highlight=projeck',
'https://www.abdn.ac.uk/elphinstone/kist/search/display.php?kmac01.dat',
'https://www.scottishcorpus.ac.uk/document/?documentid=1704&highlight=projeck',
'https://scottishcorpus.ac.uk/document/?documentid=1710',
'https://www.scottishcorpus.ac.uk/document/?documentid=1725&highlight=direck',
'https://www.scottishcorpus.ac.uk/document/?documentid=1715&highlight=projeck']
In this case using some EDA and domain knowledge we can filter for the tlds which are likely to be:
- in the scots language
- higher quality educational websites
We can reduce the FineWeb2 dataset to only include the rows that have these tlds.
= ["sco", "ac.uk", "org.uk", "org"] good_tlds
filter(pl.col("tld").is_in(good_tlds)).sort("language_score", descending=True).head(
df.5
)
text | id | dump | url | date | file_path | language | language_score | language_script | minhash_cluster_size | top_langs | tld |
---|---|---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | f64 | str | i64 | str | str |
"Snawbuirdin Frae Wikipedia Sna… | "<urn:uuid:72c97fcb-4820-4a52-b… | "CC-MAIN-2014-15" | "http://sco.wikipedia.org/wiki/… | "2014-04-19T09:31:00Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 66 | "{"sco_Latn_score": 1.000010013… | "org" |
"Banner o the Sahrawi Arab Demo… | "<urn:uuid:67052692-6020-4870-9… | "CC-MAIN-2014-15" | "http://sco.wikipedia.org/wiki/… | "2014-04-24T06:38:13Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 27 | "{"sco_Latn_score": 1.000010013… | "org" |
"Potosí is a ceety an the caipi… | "<urn:uuid:e49b07bb-d7c9-4905-b… | "CC-MAIN-2014-15" | "http://sco.wikipedia.org/wiki/… | "2014-04-21T15:05:27Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 34 | "{"sco_Latn_score": 1.000010013… | "org" |
"Port Moresby Port Moresby (Ing… | "<urn:uuid:bb6b995d-b3e8-4dcd-9… | "CC-MAIN-2014-35" | "http://sco.wikipedia.org/wiki/… | "2014-08-30T16:16:49Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 80 | "{"sco_Latn_score": 1.000010013… | "org" |
"Seville Seville is a ceety in … | "<urn:uuid:cdcca31a-693e-463b-a… | "CC-MAIN-2014-42" | "http://sco.wikipedia.org/wiki/… | "2014-10-22T21:45:17Z" | "s3://commoncrawl/crawl-data/CC… | "sco" | 1.00001 | "Latn" | 31 | "{"sco_Latn_score": 1.000010013… | "org" |
= df.filter(pl.col("tld").is_in(good_tlds)).sort(
filtered_df "language_score", descending=True
)
We can now save the filtered data to a new file. We’ll save the ids of the rows that are in the filtered dataset to a file. These ids can then be used to upload additional filtered data to the Argilla dataset for the language we’re working on.
with open("good_ids", "w") as f:
for id in filtered_df.select("id").to_series().to_list():
f"{id}\n") f.write(
Filtering other languages
We can also use the same techniques to filter other languages. Some languages have a lot of data and so we can use the scan_parquet
function to create a LazyFrame
this will avoid loading all the data into memory. In addition, Polars will perform query optimizations on the LazyFrame
. This will make the code we use for filtering more efficient without much work on our part.
def get_paths_for_language(language: str):
return [
pathfor path in list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
if path.endswith("parquet")
and "removed" not in path
and "train" in path
and language in path
]
Filtering with a higher language score
Some language in fineweb2 are not identified as the correct language. Language identification is still not a “solved” problem but we may be able to use a higher confidence filter to get a set of data that is more likely to be the correct language. We can then label this data for the educational quality of the text without having to remove as many examples as being in the incorrect language.
= get_paths_for_language("asm")
paths paths
['data/asm_Beng/train/000_00000.parquet',
'data/asm_Latn/train/000_00000.parquet']
Let’s load the data for the Assamese language using only the train
file.
= pl.read_parquet(f"hf://datasets/HuggingFaceFW/fineweb-2/{paths[-1]}") df
df.shape
(1104, 11)
We can use the describe
function to get a sense of the distribution of the language scores.
"language_score").describe() df.select(
statistic | language_score |
---|---|
str | f64 |
"count" | 1104.0 |
"null_count" | 0.0 |
"mean" | 0.829071 |
"std" | 0.231866 |
"min" | 0.303687 |
"25%" | 0.660899 |
"50%" | 0.970777 |
"75%" | 0.995925 |
"max" | 0.999965 |
You can see that compared to some other languages the mean language score is quite low. We might be able to get a better subset of data by filtering for a higher language score. Let’s take a look at some examples of the text that have a high language score. This can help give us a sense of what threshold might have less false positives.
from rich import print as rprint
= 3
examples_to_show
rprint(filter(pl.col("language_score") > 0.9)
df.
.head(examples_to_show)"text")
.select(
.to_series()
.to_list() )
[ 'eitu ajir pora 2 bosor agor kotha moi NIT r pora pass out hoisu just. Vaal job nopowa baabe moi keidinmaanor babe temporary hisape eta national company t humai asilu. Tate moi taniya k log palu , tai tate as aadvise r hisape humaise.Prothom dekhate taik kiba vaal lagi gol. Dekha t tai bor dhuniya, mihi gulopia gaal, dudu keita niomia akaror, khali olop sapor. jetiya tai r logot kotha pati thaku, tetiya moi issa koi pen tu tolot palai dio, aru tai jetiya pen tu uthaboloi hauli diye moi tair dudu keitar dorshon koru aru gharat jai tair kotha vabi pura haat maaru. Eitu mur kiba eta routine r dore hoi goisil .\nEdin moi duporia duty t kaam koi vagor logat olop rest lobor babe kahote thoka rest roomot goi bohilu, ami praye kaam koi koi boar hole rest roomot goi olop hui ahi fresh hoi kamot lagi jau. Restroom mane eta medium size r room aru ekhon soki aru bisona. Hadharonote ejon rest koi thoka homoyot oin manuh restroomoloi nahe. Heidinau moi tenekoi ahi restroomor bisonat bagor dilu, hui thakute ketiya tuponi ahi gol gomei napalu. Hothate kiba eta gaat loga jen pai soku khul khale, dekhilu taniyai mur pantor uporote tika khon sui ase, tai gomei pua nasil moi har pua buli, lahe lahe tai haat khon mur jounangor uporot muharibo dhorile, tai pura habodhane moi jate haar napau tenekoi sui asil, mur bor moja lagil, vabilu aji hopun pura hobo.Moi suddenly huar pora uthi bohi golu, tai sok khai uthil aru lajot ronga poi gol, tai haat khon mur baridalor uporor pora putkoi atorai dile aru muk sorry sorry buli kobo dhorile. Moi kolu sorry kole ki hobo tumi mur tika aru lingo duita sula dekhun moiu tumar bostu bilaak soom.Tai sage bhobai nasil enekua situation ahi jabo pare buli , heye tai thor lagi mur pine saie thakil.Moi bisonat bohiye bhori duta bisonar pora nomai dilu aru dui hatere dhori taik mur usorot loi ahilu, tai kole ki koribo khujisa rubul. Moi kolu moi just badla lom tumi mur ji ji suisa moiu tumaar soom buli koi haat duta tair pithir pora nomai thoi dui tika r uporot tholu aru jure jure tika khon tipibo dhorilu, ki mast gand asil tair mur bari dall pantor vitorot ei for forai asil.Lahe lahe moi hat dukhonere tair tika r major angso tu anguli etare subo dhorilu aru anguli tu aru honmukhor fale koi diyat moi tair boos khon feel koibo dhoilu . Mor tair boos khon suar bohut mon hobo dhorile aru tair salwaror pant tu tolfale tanibo dhorilu kintu kokalot rosire gathi thua babe tolfale nahil. Moi eibaar tair kurta tu uporoloi uthai dilu aru tair salwaror rosi dal khuli dilu ,nije nije eibaar pant tu tololoi naami ahil. moi tair boga vorikhon sai aru thakibo nuarilu , bisonar pora nami tair thigh khonot pura suma khabo dhorilu, aru maje maje kamuribo dhorilu, laheke aru olop uporoloi juat tair heujiya rongor panty tu dekha palu , panty r uporote moi tair boos khonot suma khabo dhorilu, olop homoi teneke thokar pisot moi tair panty tu tolfale nomai dilu aru taniya r mukholoi salu , tai mur kando karkhana bilakor pora moja pai ase buli gam palu. Tair booskhon gulopia boosor uporor sulikhini shave kora kintu kahe kahe thoka suli khini thaki goisil. Bor moja gundho eta ahisil booskhonor pora, moi tairboos khon dui hatere meli dhori boosor uporor angso tu jivare subo dhorilu, tair gaat current loga r nisinake jopiai uthil , tair val loga buli gam pai moi aru jure jure boos khon supibo dhorilu, taik moi bisonar uporot uthai dilu, salwaror pant korobat poi rol , aru tair kurta tu petoloike uthai akou moi tair boos khonot mukh maribo dhorilu, jivare boos khon seleki seleki , right handor anguli eta tair pokoror futat humai dilu, ki masti lagisil kintu tai bohut dukh pale karone moi pokoror pora anguli tu ulai dibo loga hol.\nEibaar moi mur pantu khuli bari dal ulai hatere sui sui boosot humuar babe ready koibo dhorilu. Moi tair kurta tu ekebare khuli dilu, aru boga bra tu tololoi tani dilu, tair dudu duta moi bhobatkoi olop horu, holeo moi dui hater pura tipibo dhorilu, aru nipple tu maje maje suhibo dhorilu. Hothate mur eta B.F t dekha pose monot poile aru moi tair bukut bohi lolu aru mur 7 inchi r bari daal tair dudu dutar majot rakhi dui hatere dudu keitare sepi dhorilu, dudu dutar majot ji olop thai baki rol heikonokei boos morar nisinake kokal maribo dhorilu , eibar aru olop uporoloi uthi bari daal taniya r mukhor vitorot humuai dilu, tai ready hua nasil jodiyo jur jobordosti humuai dilu mukhot, olop apsot tai adjust hol aru mur lund daal bor moromere ice cream khuar nisinake suhibo dhorile, tai mur guti duta eta hatere sui, eta hater bari dal muthi mari pura jure jure suhi asil.\nMur ulabo hua jen logat tair mukhor pora bari daal ulialu aru eibaar tair boosor kahot goi boosot baari daal thoi thela marilu. kintu tair boos khon pura tight thoka babe moi bohut sesta koiu humaabo nuarilu, Tetiya mur monot poril uth fota babe maie muk vaseline r tema eta disil. ajihe vaseline tur asol upojug hobo buli moi poketor pora vaseline tema tu uliyai olop maan vaseline mur baari dalor agtut aru tair boosot hani dilu, eibar eke thelate fir firkoi baari daal humai gol, taar pasot aru kune pai pura tika dangi dangi taik sudibo dhorilu. Dui hater bike r handle dhorar dore tair dudu keita dhori taik sudilu. 45 min maan sudar pasot maalpani ulai gol, boosor vitorote uliai dilu, vabilu ji hoi hobo sala imaan controll koibo nuari. Olop homoy teneke hui thaki dui jone dress thik koi rest room r pora ulai duty korat lagilu.\nEtiya ami prai hodayei restroomot sex koru, ami duijone rest room tur naam sex room thoisu. Taniya mur girlfriend nohol jodiyo, mur best friend hoi thakil, etiya ami prayei kotha patu aru phonote bea bea kotha pati thaku.', "Front Page\nHome\nChat\nEntertainment Page\nAssamese Music\nAssamese Lyrics\nBest viewed @ 1024x768 resolution\nXoixobote Dhemalite - Lyrics of Assamese Songs\nSinger : Dr. Bhupen Hazarika\nMusic Director : Dr. Bhupen Hazarika\nLyricist : Dr. Bhupen Hazarika\nXoixobote(Shoishobot) Dhemalite tumaare umola monot aase..\nBohag maahor luit khonot duyu xaature monot aase\nJowbonote duyure dehaar laajuki porox monot aase\nMur abihone sip jori loba buli kuwa mur monot aase..\nBohag maahor logote edin aahil bordoisila\nXei dhumuhat kaarubar xote tumi dekhu gusi gola\nMonor goraaki eri tumi Dhonor goraaki bolila\nDhon Dhon buli dhonor premere swarup prakax korila\nBohu din gol hothaate xidina tumak dekh monot aase..\naatoror pora dekhilu tumar swarna gohona jilikise\nTumar abihone sip jori lom buli tumi bhaabisa\nBhul korisa xunjoni mur\nAleek xapun dekhisa\nJiyaai thaakim..\nJiyaai thaaki ekhon xomaj gorhibor mur mon aase\nJot xunot koiu manuhor damm olop holeu besi aase!!\n*** 'X' stands for assamese phonetic letter whose pronounciation is in between 'H' and 'KH'.\nYou can listen to this song in the\nAssamese Songs Playlist\nDiscuss, query and comments on Assamese songs and music in this\nforum\n.\nCopyright © 2007-2008 onlinesivasagar.com(Abhijit Borah)", '\n\n#1\n\n\nAssamese Sex Stories\nMOI ARU MAUSUMI (Assamese Sex Stories)\nHi friends aitu mur first post, akha koru apunalukor val lagibo... mur nam Momi moi aji apunalukok mur nijor first sex\nexperienceor bikhoya jonabo bisarisu. Moi akhon grils collegeot TDC\n2nd year scienceot porhi asu. Ghorot khub strict babe muk porhat bhal\nholeo muk bahirot porhibo jabo nidile. Ghorot muk okole kotu jabo\nnidea, korbat gole maa nohole papa nohole dada logot jai, kintu mur\nchildhood friend Mausumi,k amar ghorot khub trust kore babe tair logot\nkorbat gole badha nidiya. Mausumi mur akai batchor kintu tai artsot\nakai collegeot porha. ami duyu khub bhal friend, ami duyu hokolu kotha\nshare koru, even tair boyfriendor logot kora romanceor bikhoye hokolu\nkotha. Tair boyfriend notunke Bankot join korise, tar nam Pallab.mur\nhihotor romanceor kotha huni khub bhal lage. Katiyaba bhabu muru jodi\nkunuba boyfriend thakil haten moyu sage tair nisina romance koribo\nparilu haten, kintu mur ja ulai juar chance nai. Mausumir kintu ghorot\nkunu restriction nai tai bindas ghuribo pare. Tai bohudin amar\nghororloi oha nasil haibabe adin collegot taik ghoroloi ahibo kolu\n(actually tair romanceor kotha hunibole bor mon goisil). Tai ata\nSunday ahibo buli kole. Mur montu bhal lagi gol, moi bor akhare Sunday\nloi wait koribo dhorilu..\nSunday hunkale uthi moi hokolu kam hekh kori tair babe wait koribo\ndhorilu tair dari hua babe abar taik fon kori hunkale matilu, tai ahi\nthoka buli kole aru thik 15 minitor pasot tai amar ghorot palehi, amar\nghorot coffee khai hokolure logot kotha pati ami mur roomot humalu…\nkisu homoi kotha potar pasot moi tair boyfriendor logot hua romanceor\nkotha hudhilu.. tai koi jabo dhorile… hihotor first sex lifeor kotha….\nLove huar pasot hihot priya outing jai jodeo hihotor majot sex hua\nnasil… last month first weekor kotha hihote outing plan korisil.\nPallabe taik collegor pora pickup kori lunchor babe resturentor\ncabinot loi gol, waiterok lunch order kori pallabe bahiror pora\nnadekhakoi cabinor porda tani dile, tar pasot hi tair usorot bohi tair\nhatkhon nijor hatot loi kotha patibo dhorile… Pallabe tair hatkhon\nmuhari taik koisil – tumak aji bor dhunia lagise mousumi….. tumak pai\nmoi bohut lucky…. Moi tumak bohut bhal pao…. Aibuli pallabe tar akhon\nhat tair galot moromera muharibo dhorile.. mausumie mathu tar kotha\nhuni tak support kori asil…. Aibar hi mousumik kobo dhorile mousumi\ntumar gal dukhon gulopia hoi goise, voi lagise niki ? voi koribo\nnalage moi tumak bhal pao…. Mousumie kole … nai voi loga nai tumak\nkiyo voi korim?... moiu tumak bahut val pao…. Pallabe kole… mausumi\ntumar gulopia galkhonot ata kiss korile beya paba niki? Mousumia kole\nnai kintu kunuba ahibo para… pallabe kole… kunu nahe, order ahibo dari\nase…. Aibuli koi pallabe tar mukhon mousumir galot logai kiss koribo\ndhorile.. mousumir tar hator touch pai aru tar moromor kiss pai\nuttejona barhibo dhorile… taiu pallabok support kori tair duyukhon\nhat tar chulit khamus mari dhori tair lips kait pallabor lipsor fale\naag borhai dile… pallabe chance api tair lipsot kiss kori tair lips\nduta chupibo dhorile… aidore kisu homoi kiss kori thakute hothat\ncabinor doorot knock poril. Hunkale duyu easy hoi waitarok vitoroloi\nahibo kole, waiter ahi lunch di gol.. waiter juar pasot aibar bipllobe\ncabinor door khon bandh kori di tair usorot bohi loi taik haboti loi\npunor lipsot koribo dhrile aru aibar olop aag barhi goi akhon hatare\ntair boobs ata laheka kapuror uporote laheka muharibo dhorile mousumir\nprotest korar sťance nasil.. tairu bhal lagi tak support koribo\ndhorile… pallabe gom pale ja mousumie tar pora aru bisarise … aibar hi\nakhon hat tair thaisot thoi mousumir laggingor uporedi laheka muharibo\ndhorile…. Mousumi uttajonat aku kobo pora nasil mathu choku kaita mudi\npleasure loi asil…. Pallabe chance pai tair dangor dangor thais kaita\nmuhari muhari hat khon duyu thaisor major pussy area loi nibor babe\nthais kaita fak koribo dhorile… mousumie kunu protest nokori vori\nkaita fak kori dile… pallabe hubidha pai tar hatkhon tair pussy loi ni\nkapuror uporote pussy muharibo dhorile…. Pussyt hat porar loge loge\nmousumir mukhera aahhh… aaahhhh… ulai gol , tai uttajonat robo pora\nnasil tai tak support kori tar hatkhon khamusi dhori aru jure pussy\nhasi dile…. Pallabe tair condition buji tair kane kane kole mousumi\nplease tumar legging tu khuli diyana… mousumie kole nai pallab ai\nresturent amar babe safe nohoi… pallabe kole….. aku nohoi cabinor\ndorja lock ase… kunu nahe .. aibuli koi hi mousumir legging or\nelastic dalot anguli duta vorai legging tu knee level loika tani nomai\ndile…. Mousumieu badha nidi tak legging tu toloi nomuat hohai kori\ndile… tai ata off white color panty pindhi asil … tair panty tu dekhi\nhi robo nuari tair pantyr uporote pussy khon khumusi dhori muharibo\ndhorile… aidore kori abar pallabe tar hatkhon tair pantyr vitoroloi ni\ntair pussy fakot ata anguli thoi lahekoi pussyr vitoroloi vorai de\nangulitu ulua humua koribo dhorile… aidore korat mousumi robo nuari\ntai pussyra pani uliai panty aru pallabor hat tiyai dile… aibar\npallabe robo nuari mausumik kole … mousumi tumar pantytu khuli\ndibaniki? Mausumie kole … etiya nalage eaat ami safe nohoi… balag adin\nkorim.. pallab manti hoi kole… thik ase mousumi tumak aru alop basi\nmorom koribo mon goise, please tumi olop uthi diyana tumak valke morom\nkoru…. mausumia chairor pora uthi dile.. loge loge pallabe taik\nhoabti dhori kiss kori kori akhon hatare tar pantor chain dal khuli\ntar erect pains tu uliai mausumir hatot dhorai dile…. Aru mausumir\nkanot kole… please mausomi tumar pussyt logai diyana…. Mousumie kole\nnalage pallab please aibur nokoriba…. Pallabe kole … nohoi mousumi\nnakoru nohoi tumar pantyr uporot pussyt logoi diya… Mausumie manti\nhol… tar panes tu tair pantyr uporedi pussyt logai dile…. Pallabe tar\npanes tu tair pussyt gohabo dhorile…. Aidore korat mausumi uttajonat\npallabok haboti dhori tak support koribo dhorile aru 2nd time pani\nulieai dele…. Aidore kisu homoi korat pallaboru spurm ulai gol aru\nmousumir panty tar spurme tiay dile….. tar pasot duyu lunch kori ghuri\nahil……..\nAi story huni muru excitement barhi goisil aru……..\nMausumie muk hudhile kua Momi storitu huni kanakua lagil…. Moi kolu …\nval lagil … tai kole okol val lagil ne aru kiba hol …. moi kolu.. mur\nexperience nai babe moi tumar feelings pabo pora nai…. mur jodi kunuba\nboyfriend thakil haten moiu tumar dore enjoy kkoribo parilu haten….\ntai kole thik ase tumar boyfriend nathkileo moi tumak real sex felling\nkanekua dibo paru….. moi hudhilu kintu atiya kanakoi muk sex feelings\ndiba… tai kole first tumi rumor door khon valka lock kora jate kunu\nahibo nuare… moi kolu tumi chinta nokoriba door lock ase aru atiya\nrumoloi kunu nahe… tai kole thik ase tumi bisonat mur logot boha aru\nchoku kaita mudi diya aru feel kora ja moi tumar boyfriend… moi\nbisonat tair usorot bohi choku kaita mudi dilu aru mind tu concentrate\nkori bhabibo lolu ja mur boyfriend mur kakhot bohi ase… aibar tai muk\nlahikoi kokalot hat thoi muharibo dhorile aru mur lipsot kiss koribo\ndhorile… prothombar karubar kiss pai mur ata ojan feelings ahibo\ndhorile aru moyu taik lips kiss koribo dhorilu… tai kiss kori kori\ntair hatkhon lahekoi mur boobs kaita muharibo dhorile…. Boobsot tair\nhat porar loge loge moi pagolor dore hobo dhorilu aru taik haboti\ndhori kiss koribo dhorilu… aibar tai kurtistur vitoriloi hat vorai\nbratu uporoloi dangi mur boobs kaita muharibo dhorile… mur excitement\naru barhi gol… tai boobs kaita tapi tapi mur nipple kaita muhari diat\nmoi uttajonat 1st time pussyre pani uleai dilu……. aibar tai toptu\nboobsor uporoloika uthai tair mukhon boobsoloi ni nipple kaita suck\nkoribo dhorile… moi uttajonat tair murot hat thoi murtu jurke chapi\ndhorilu… aibar tai muk bisonat huai di boobs kaita suck kori tair\nhatkhon mur Thaisor uporot thoi kapuror uporote muharibo dhorile…\naidore kori tair hatkhon lahekoi mur kapuror uporote mur pussy\nmuharibo dhorile aru angulire mur pussy fakot press dibo dhorile….\nUttajonat moi dui vorire tair hatkhon chapa mari dhorilu… aibar tair\nhatkhon mur salwar pantor rosidalot gol…. Laheka tai pantor rosidal\nkhuli pantu tololoi nomabo dhorile…. Moi taik pantu khulat help kori\ndilu.. pantu khuli tai mur vori kaita fak kori panty tur uporote\nhatere muhari angulire pussy hole tu press kori muharibo dhorile…\nhosakoi… jibonot eman hukh moi katiyau pua nasilu… aibar tai muk\nbisonar pora uthibo kole … moi bisonar pora uthat tai mur toptu khuli\nbra aru panty khuli muk naked kori dile…. Mousumiru sex fellings\nhoisil.. muk naked kori tai nijor t-shirt tu muror uporere khuli loi\nbra tur huk khuli bratu khuli palala tar pasot tai jeans tu khulibo\ndhorile.. jeans tu khulat tai pindhi thoka light pink color panty tu\nmur chokut poril.. tair panty tu titi asil.. aibar tai panty tu khuli\npalai complete naked hoi gol…… tar pasot tai muk bisonat huai di mur\nvori kaita fak kori mur pussyt tair mukhon ni mur pussy suck koribo\ndhorile.. aibar uttajonat mur mukhere aahhhh… aahhhh…moaning ulabo\ndhorile.. tai suck kori kori tair jivakhon mur pussy fakot vorai diat\nmur uttajona barhi gol aru 2nd time pani ulai tair mukh tiaye dile…\ntai aidore kisu homoi korar pasot muk kole … momi moi aibar tumak\ncomplete satisfaction dim… tumar eat candle ase niki..? moi kolu\ntableor drawerot ase tai draweror pora adal candal ani mur pussyt\nlaheke logai press dibo dhorile… aru vitoriloi press koribo dhorile…\nhosai mur ane lagisil jan kunuba loraya muk fuck korise… first tai\nlahekoi candledal vitoroloi ni pasot jure ulua humua koribo dhorat moi\nrobo nuari 3rd time… pani uleai dilu…… already tairu uttajona barhi\ngoisil… aru tai robo nuari candle dal mur pussyr pora uleai tair nijor\npussyt vorai in-out kori pani uleai dile… ami duyu bahut tired hoi\ngoisilu, ami duyu bohu homoi naked hoi bisonat pori thokar pasot dress\nkori lolu…..\nAidore ji dhorone tai muk sex satisfaction dile… moi jibonot pahoribo\nnuaru…..\napunalukor hohari pale pasot aru potham..\nLast edited by sourav002 : 8th June 2013 at 02:15 PM.\n|Thread Tools||Search this Thread|\n\n' ]
If we find a better language score we can filter for this. For example, we can filter for the language score to be greater than 0.95.
= df.filter(pl.col("language_score") > 0.95) df_filtered
(697, 11)
with open("good_ids", "w") as f:
for id in df_filtered.select("id").to_series().to_list():
f"{id}\n") f.write(
Filtering bigger languages
Some languages have a lot of data and so we can use the scan_parquet
function to create a LazyFrame
. Let’s see how we can do this for the Japanese language.
= get_paths_for_language("jpn")
paths len(paths)
148
You can see here we have many more files. If you have a lot of memory, you could use the standard read_parquet
function. However, if you don’t have a lot of memory, you could use the scan_parquet
function. This will read the data in chunks and is more memory efficient. Even with this we might want to start with a subset of the data to experiment with and then work with the full dataset once we’re confident in our filtering.
import random
42)
random.seed(
= random.choices(paths, k=2) sample_paths
"HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ[
"temp_data").mkdir(exist_ok=True)
Path(
for path in tqdm(sample_paths):
hf_hub_download(="HuggingFaceFW/fineweb-2",
repo_id="dataset",
repo_type=path,
filename="temp_data",
local_dir )
0%| | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:00<00:00, 5.87it/s]
= pl.scan_parquet("temp_data/**/*.parquet")
df 5).collect() df.head(
text | id | dump | url | date | file_path | language | language_score | language_script | minhash_cluster_size | top_langs |
---|---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | f64 | str | i64 | str |
"欲しかった車を探せるサイト 独身時代は、ただ乗れればいいと思… | "<urn:uuid:9221bbac-4ab3-4d7b-9… | "CC-MAIN-2013-20" | "http://careerspaceezine.com/" | "2013-05-20T01:19:20Z" | "s3://commoncrawl/crawl-data/CC… | "jpn" | 1.000009 | "Jpan" | 1 | "{"jpn_Jpan_score": 1.000009059… |
" ふくむすめどうわしゅう(Hukumusume fairy… | "<urn:uuid:d03fc65f-99bb-4095-b… | "CC-MAIN-2013-20" | "http://hukumusume.com/douwa/En… | "2013-05-20T01:18:14Z" | "s3://commoncrawl/crawl-data/CC… | "jpn" | 0.992212 | "Jpan" | 2 | "{"jpn_Jpan_score": 0.992212295… |
"家電通信をお届けします 家電は一度購入したら、何年も使い続け… | "<urn:uuid:89b3dae5-8a49-4d51-a… | "CC-MAIN-2013-20" | "http://wnclivehosting.com/inde… | "2013-05-20T01:57:39Z" | "s3://commoncrawl/crawl-data/CC… | "jpn" | 1.00001 | "Jpan" | 1 | "{"jpn_Jpan_score": 1.000010013… |
"出版社からのコメント MovableTypeの特徴のひとつと… | "<urn:uuid:84019b07-0424-4d79-b… | "CC-MAIN-2013-20" | "http://www.amazon.co.jp/MOVABL… | "2013-05-20T01:50:55Z" | "s3://commoncrawl/crawl-data/CC… | "jpn" | 1.000009 | "Jpan" | 2 | "{"jpn_Jpan_score": 1.000008940… |
"FrontPage 私も結婚することで、今の保険に入ろうか考… | "<urn:uuid:3fc5c2a5-c3a7-409c-b… | "CC-MAIN-2013-20" | "http://www.christian-louboutin… | "2013-05-20T01:59:19Z" | "s3://commoncrawl/crawl-data/CC… | "jpn" | 1.00001 | "Jpan" | 15 | "{"jpn_Jpan_score": 1.000009894… |
"language_score").describe() df.select(
statistic | language_score |
---|---|
str | f64 |
"count" | 3.3735e7 |
"null_count" | 0.0 |
"mean" | 0.999791 |
"std" | 0.002776 |
"min" | 0.886358 |
"25%" | 0.999996 |
"50%" | 1.000007 |
"75%" | 1.000009 |
"max" | 1.00001 |
filter(pl.col("url").str.contains("wikipedia")).count().collect(streaming=True) df.
text | id | dump | url | date | file_path | language | language_score | language_script | minhash_cluster_size | top_langs |
---|---|---|---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
55053 | 55053 | 55053 | 55053 | 55053 | 55053 | 55053 | 55053 | 55053 | 55053 | 55053 |
= [
japanese_edu_domains "http://www.asagaku.com/",
"www3.nhk.or.jp/news/easy/",
"http://kids.yahoo.co.jp/",
]
filter(pl.col("url").is_in(japanese_edu_domains)).count().collect(streaming=True) df.
text | id | dump | url | date | file_path | language | language_score | language_script | minhash_cluster_size | top_langs |
---|---|---|---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
We’d obviously want to expand this list to include more domains but you can see how we can still use the same techniques to filter very large datasets without running out of memory.