Distiling DeepSeek reasoning to ModernBERT classifiers

huggingface
datasets
arxiv
synthetic-data
deepseek
How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?
Author

Daniel van Strien

Published

January 29, 2025

%pip install polars huggingface_hub datasets openai --upgrade

Updates

  • 2025-01-30: Sayak Paul has implemented this pipeline using Transformers (code). This will be a super good option if you already have sufficient GPUs for running these models.

How can we get the best of both worlds?

tl;dr, how can we use LLMs to generate labels to fine-tune a ModernBERT model?

It’s fair to say that DeepSeek-R1 has made quite an impact in the last few weeks. It’s a powerful reasoning model that excels at many tasks that require reasoning. One particularly exciting aspect of the release of this model, though, is the distilled versions of the model. These models are much smaller but still retain a lot of the reasoning ability of the larger models.

Classification often requires reasoning

While the interest in reasoning models often focs on use cases like mathematics and coding, there are many other use cases where reasoning can be helpful. One example is classification. Although some classification problems are very simple and mostly require “pattern matching,” there are many other problems where reasoning is needed. This is where a reasoning model could be helpful.

Can we distil even smaller models?

While the distilled models are fairly small (the smallest is 1.5B), we may still prefer to have an even smaller model for many use cases. If you can remember all the way back to December 2024, the ModernBERT release introduced a new BERT model, which is a good candidate for this kind of efficient classification use case. The main challenge is that in order to train a classifier, we need labeled data. This is where we can use a reasoning model to generate synthetic labels.

The use case: classifying ArXiv papers that introduce a newly created dataset

As the Machine Learning Librarian at Hugging Face, I want to keep track of new datasets being shared on ArXiv. While you can search for “dataset” or “benchmark” in the title or abstract, this returns any papers that mention datasets or benchmarks. I’m only interested in papers that introduce a newly created dataset.

So the goal is to give an article to classify into whether it introduces a newly created dataset.

I’ll use Polars to load the ArXiv dataset from the Hub, but you can use whichever data tool you want.

Note

Feel free to skip this section if you are not interested in the use case and just want to see how to do the labelling part.

import os
import polars as pl
from huggingface_hub import snapshot_download

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"  # turn on HF_TRANSFER
files = snapshot_download(
    repo_id="librarian-bots/arxiv-metadata-snapshot",
    allow_patterns=["*.parquet"],
    repo_type="dataset",
)
df = pl.scan_parquet(files)

Let’s look at the first row. We see a bunch of metadata about the paper and then the title and abstract. These are probably the column we’ll want to use as input for our model.

df.head(1).collect()
shape: (1, 14)
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
str str str str str str str str str str str list[struct[2]] datetime[ms] list[list[str]]
"1004.3702" "Lizhi Du" "Lizhi Du" "A Polynomial time Algorithm fo… "26 pages. This time, I add a d… null null null "cs.DS" "http://arxiv.org/licenses/none… "  Based on the famous Rotation… [{"v1","Mon, 12 Apr 2010 04:39:27 GMT"}, {"v10","Mon, 5 Nov 2012 01:44:46 GMT"}, … {"v9","Wed, 29 Aug 2012 06:39:31 GMT"}] 2025-01-24 00:00:00 [["Du", "Lizhi", ""]]

You will see there is a categories column. This is a string that contains a list of categories that the paper belongs to. We can grab a few examples of the categories.

df.head(10).collect().select("categories").to_series().to_list()
['cs.DS',
 'math.GM',
 'math.CA math.AT math.DG math.DS',
 'cond-mat.mtrl-sci',
 'cond-mat.mtrl-sci',
 'math.GT',
 'math.GT',
 'math.GT',
 'math.AP',
 'math.AP math-ph math.MP math.SP']

For my particular use case I’m mostly interest in papers that are in the computer science category i.e contain “cs.” in the categories column.

df = df.filter(
    pl.col("categories")
    .str.split(" ")
    .list.eval(pl.element().str.starts_with("cs."))
    .list.any()
)

We’ll filter papers to only include those that contain the word “dataset” in the title or abstract, again you could easily change this to use other words.

Note

One thing to consider here is that ideally you want the distribution of data you use for training the model to be similar to the distribution of data you will use in practice. Since I will only check for ArXiV papers that contain the word “dataset” in the title or abstract, I will filter out a lot of the data before it even gets passed to the model. For your use case, consider the distribution of data you’ll be using in practice and filter the data accordingly.

df = df.filter(
    pl.col("title").str.contains("dataset") | pl.col("abstract").str.contains("dataset")
)

Since we’re using the polars lazy api, we need to call collect() to actually get the data.

df = df.collect()

Generate labels not synthetic data

There has been significant growth in the use of LLMs for synthetic data generation over the past couple of years. While we could generate synthetic data, i.e., developing both the “input” and “target” columns, if we already have some data we want to work with, it makes more sense to generate labels. One of the significant challenges with synethic data generation is that the data generated is often not representative of the data we want to use in practice. For generative tasks, this might matter slightly less. Since we’re focused on building classifiers, which we’ll often focus on quite a narrow use case or domain, the data we use to train the model must be representative of the data we want to use in practice.

In this case, it might be more sensible to use a model’s reasoning ability to generate labels rather than generate synthetic data.

Let’s grab a few examples from the data to use as a starting point.

examples = df.head(4).select(pl.col(["abstract", "title"])).to_dicts()
examples[0]
{'abstract': '  This paper presents a new fuzzy k-means algorithm for the clustering of high\ndimensional data in various subspaces. Since, In the case of high dimensional\ndata, some features might be irrelevant and relevant but may have different\nsignificance in the clustering. For a better clustering, it is crucial to\nincorporate the contribution of these features in the clustering process. To\ncombine these features, in this paper, we have proposed a new fuzzy k-means\nclustering algorithm in which the objective function of the fuzzy k-means is\nmodified using two different entropy term. The first entropy term helps to\nminimize the within-cluster dispersion and maximize the negative entropy to\ndetermine clusters to contribute to the association of data points. The second\nentropy term helps to control the weight of the features because different\nfeatures have different contributing weights in the clustering process for\nobtaining the better partition of the data. The efficacy of the proposed method\nis presented in terms of various clustering measures on multiple datasets and\ncompared with various state-of-the-art methods.\n',
 'title': 'An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for\n  High Dimensional Data'}

Structured generation?

We’ll start by using a structured generation approach to generate the labels. This means we’ll define a schema for the model’s output and then use that to generate the labels. I’ve written more about this in a previous blog post but the basic idea is that we define a schema for the output of the model and then use that to generate the labels. This means we don’t have to do a lot of work to parse the output of the model and can be sure we can easily train on the output.

In this case, we define a Pydantic model as one that has a label and an explanation.

from enum import Enum
from pydantic import BaseModel, constr
from typing import Annotated


class DatasetLabel(str, Enum):
    NEW = "new_dataset"
    NOT_NEW = "no_new_dataset"


class IntroducesNewDataset(BaseModel):
    explanation: constr(min_length=40)
    label: DatasetLabel

We define a function to format the data as a prompt. This function takes a dictionary with the title and abstract and formats it as a prompt for the model.

def format_text_as_prompt(data: dict[str, str]):
    return f"""Look at the title and abstract for the following arXiv paper. Assess whether the paper is likely to introduce a newly created dataset.


Title: {data['title']}
Abstract: {data['abstract']}

Your role is to decide whether the paper introduces a newly created dataset. First you should think about whether the paper is likely to introduce a newly created dataset. You should then return your reasoning and the label you've chosen. 
You should choose out of the "new_dataset" or "no_new_dataset" labels.

Return your reasoning and the label you've chosen as a JSON object like this:
```json
{{
    "label": "new_dataset" | "no_new_dataset",
    "explanation": "The reasoning the model used to come to its conclusion"
}}
```
"""
print(format_text_as_prompt(examples[0]))
Look at the title and abstract for the following arXiv paper. Assess whether the paper is likely to introduce a newly created dataset.


Title: An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for
  High Dimensional Data
Abstract:   This paper presents a new fuzzy k-means algorithm for the clustering of high
dimensional data in various subspaces. Since, In the case of high dimensional
data, some features might be irrelevant and relevant but may have different
significance in the clustering. For a better clustering, it is crucial to
incorporate the contribution of these features in the clustering process. To
combine these features, in this paper, we have proposed a new fuzzy k-means
clustering algorithm in which the objective function of the fuzzy k-means is
modified using two different entropy term. The first entropy term helps to
minimize the within-cluster dispersion and maximize the negative entropy to
determine clusters to contribute to the association of data points. The second
entropy term helps to control the weight of the features because different
features have different contributing weights in the clustering process for
obtaining the better partition of the data. The efficacy of the proposed method
is presented in terms of various clustering measures on multiple datasets and
compared with various state-of-the-art methods.


Your role is to decide whether the paper introduces a newly created dataset. First you should think about whether the paper is likely to introduce a newly created dataset. You should then return your reasoning and the label you've chosen. 
You should choose out of the "new_dataset" or "no_new_dataset" labels.

Return your reasoning and the label you've chosen as a JSON object like this:
```json
{
    "label": "new_dataset" | "no_new_dataset",
    "explanation": "The reasoning the model used to come to its conclusion"
}
```

Using LM Studio to develop our approach

One of the powerful features of open source is that it makes it easier to run models in different places. While developing our approach, we can use a smaller version of the model to test it and then switch to a hosted version once we’re happy with it.

We’ll run the model using LM Studio. LM Studio is primarily known as a UI for running local LLMs, but it also has a server mode, which we’ll use here. We can interact with the server using the CLI. To start the server, we can run the following command.

!lms server start
Starting server...
Success! Server is now running on port 1234

We can use ls to see the models that are available, we’ll filter these to only show the DeepSeek models.

!lms ls | grep DeepSeek
lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF      1.12 GB          Qwen2           
lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF        4.68 GB          Qwen2           
lmstudio-community/DeepSeek-R1-Distill-Qwen-14B-GGUF       8.99 GB          Qwen2           
lmstudio-community/DeepSeek-R1-Distill-Llama-8B-GGUF       4.92 GB          Llama           
Note

Note that the output here is showing models I already have locally. There are many models LM studio can download from the Hugging Face Hub.

We can load the model by running the following command. If the model is not already downloaded, LM Studio will download it. We’ll try and see how well the 7B model does.

!lms load DeepSeek-R1-Distill-Qwen-7B-GGUF

Loading model "lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf"...
[LMStudioClient][LLM] Start loading model lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf...
⠹ [█████████████████████▋                            ] 43.04%          sl          ] 32.15%          ] 34.82%          █████████████████▋                                ] 35.25%          Model loaded successfully in 4.20s. (4.68 GB)
To use the model in the API/SDK, use the identifier "deepseek-r1-distill-qwen-7b".
To set a custom identifier, use the --identifier <identifier> option.

Since LM Studio has an OpenAI compatible API, we can use the OpenAI Python client to interact with the server. We just need to set the base URL to the LM Studio server and set the API key to lm-studio.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

Once we’ve created the client we can interact with it in the usual way i.e. to see available models we can run the following command.

client.models.list()
SyncPage[Model](data=[Model(id='deepseek-r1-distill-qwen-7b', created=None, object='model', owned_by='organization_owner')], object='list')

Generating labels

We can now generate labels for our examples. We’ll use the format_text_as_prompt function to format the data as a prompt and then pass it to the model. Since we’re using a structured output, we need to use the beta.chat.completions endpoint. We pass in our Pydantic model as the response_format argument.

messages = [
    {"role": "user", "content": format_text_as_prompt(examples[0])},
]


response = client.beta.chat.completions.parse(
    model="deepseek-r1-distill-qwen-7b",
    messages=messages,
    temperature=0.7,
    response_format=IntroducesNewDataset,
)

We can check that we can parse the output of the model into our Pydantic model.

IntroducesNewDataset.model_validate_json(response.choices[0].message.content)
IntroducesNewDataset(explanation="The paper discusses an entropy-based fuzzy k-means algorithm designed for high-dimensional data. While it mentions incorporating feature contributions into clustering, there's no information about introducing a new dataset.", label=<DatasetLabel.NOT_NEW: 'no_new_dataset'>)

We’ll wrap this in a function so we can easily use it for a lot of examples.

def predict_label(
    data: dict[str, str], model: str = "deepseek-r1-distill-qwen-1.5b", client=client
) -> IntroducesNewDataset | None:
    try:
        prompt = format_text_as_prompt(data)
        messages = [
            {"role": "user", "content": prompt},
        ]
        response = client.beta.chat.completions.parse(
            model=model,
            messages=messages,
            temperature=0.7,
            response_format=IntroducesNewDataset,
        )
        return IntroducesNewDataset.model_validate_json(
            response.choices[0].message.content
        )
    except Exception as e:
        print(e)
        return None

Before doing a big batch of predictions, let’s run the model on a few examples so we can see how it does.

from rich import print as rich_print

structured_results = []
for example in examples:
    title = example["title"]
    abstract = example["abstract"]
    prediction = predict_label(example)
    structured_results.append(prediction)
    rich_print(title)
    rich_print(abstract)
    rich_print(prediction)
    rich_print("---")
An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for
  High Dimensional Data
  This paper presents a new fuzzy k-means algorithm for the clustering of high
dimensional data in various subspaces. Since, In the case of high dimensional
data, some features might be irrelevant and relevant but may have different
significance in the clustering. For a better clustering, it is crucial to
incorporate the contribution of these features in the clustering process. To
combine these features, in this paper, we have proposed a new fuzzy k-means
clustering algorithm in which the objective function of the fuzzy k-means is
modified using two different entropy term. The first entropy term helps to
minimize the within-cluster dispersion and maximize the negative entropy to
determine clusters to contribute to the association of data points. The second
entropy term helps to control the weight of the features because different
features have different contributing weights in the clustering process for
obtaining the better partition of the data. The efficacy of the proposed method
is presented in terms of various clustering measures on multiple datasets and
compared with various state-of-the-art methods.

IntroducesNewDataset(
    explanation="The paper presents an algorithm for clustering high-dimensional data, focusing on feature 
weighting and entropy-based modifications to the fuzzy k-means method. The abstract mentions that their proposed 
method is evaluated against various datasets using different measures. Since the title doesn't suggest a new 
dataset but rather an improvement or variation in an existing one (fuzzy k-means), and the abstract emphasizes 
performance evaluation across multiple datasets without indicating the introduction of a new one, it's reasonable 
to assume that no new dataset was created in this paper.",
    label=<DatasetLabel.NOT_NEW: 'no_new_dataset'>
)
---
Identifying Influential Brokers on Social Media from Social Network
  Structure
  Identifying influencers in a given social network has become an important
research problem for various applications, including accelerating the spread of
information in viral marketing and preventing the spread of fake news and
rumors. The literature contains a rich body of studies on identifying
influential source spreaders who can spread their own messages to many other
nodes. In contrast, the identification of influential brokers who can spread
other nodes' messages to many nodes has not been fully explored. Theoretical
and empirical studies suggest that involvement of both influential source
spreaders and brokers is a key to facilitating large-scale information
diffusion cascades. Therefore, this paper explores ways to identify influential
brokers from a given social network. By using three social media datasets, we
investigate the characteristics of influential brokers by comparing them with
influential source spreaders and central nodes obtained from centrality
measures. Our results show that (i) most of the influential source spreaders
are not influential brokers (and vice versa) and (ii) the overlap between
central nodes and influential brokers is small (less than 15%) in Twitter
datasets. We also tackle the problem of identifying influential brokers from
centrality measures and node embeddings, and we examine the effectiveness of
social network features in the broker identification task. Our results show
that (iii) although a single centrality measure cannot characterize influential
brokers well, prediction models using node embedding features achieve F$_1$
scores of 0.35--0.68, suggesting the effectiveness of social network features
for identifying influential brokers.

IntroducesNewDataset(
    explanation="... reason ...”,... } To determine whether the paper introduces a newly created dataset, let's 
analyze the information provided in the title and abstract. The title is ",
    label=<DatasetLabel.NEW: 'new_dataset'>
)
---
Improving Performance of Automatic Keyword Extraction (AKE) Methods
  Using PoS-Tagging and Enhanced Semantic-Awareness
  Automatic keyword extraction (AKE) has gained more importance with the
increasing amount of digital textual data that modern computing systems
process. It has various applications in information retrieval (IR) and natural
language processing (NLP), including text summarisation, topic analysis and
document indexing. This paper proposes a simple but effective
post-processing-based universal approach to improve the performance of any AKE
methods, via an enhanced level of semantic-awareness supported by PoS-tagging.
To demonstrate the performance of the proposed approach, we considered word
types retrieved from a PoS-tagging step and two representative sources of
semantic information - specialised terms defined in one or more
context-dependent thesauri, and named entities in Wikipedia. The above three
steps can be simply added to the end of any AKE methods as part of a
post-processor, which simply re-evaluate all candidate keywords following some
context-specific and semantic-aware criteria. For five state-of-the-art (SOTA)
AKE methods, our experimental results with 17 selected datasets showed that the
proposed approach improved their performances both consistently (up to 100% in
terms of improved cases) and significantly (between 10.2% and 53.8%, with an
average of 25.8%, in terms of F1-score and across all five methods), especially
when all the three enhancement steps are used. Our results have profound
implications considering the ease to apply our proposed approach to any AKE
methods and to further extend it.

IntroducesNewDataset(
    explanation="The paper focuses on improving automatic keyword extraction methods using PoS-tagging and 
semantic-awareness. It mentions experiments with five state-of-the-art AKE methods across 17 datasets, but there's 
no indication of introducing a new dataset.",
    label=<DatasetLabel.NOT_NEW: 'no_new_dataset'>
)
---
LOCUS: LOcalization with Channel Uncertainty and Sporadic Energy
  Accurate sound source localization (SSL) requires consistent multichannel
data for reliable degree of arrival (DoA) estimation. However, intermittently
powered batteryless systems often suffer from incomplete sensor data due to the
stochastic nature of energy harvesting. Existing methods struggle with missing
channels, leading to significant performance degradation. In this paper, we
propose $\textit{LOCUS}$, a novel deep learning-based system designed to
recover corrupted features for SSL in batteryless systems. $\textit{LOCUS}$
addresses missing data by leveraging information entropy estimation and
conditional interpolation, combining three modules: (1) Information-Weighted
Focus (InFo), which identifies and quantifies corrupted data elements, (2)
Latent Feature Synthesizer (LaFS), which synthesizes missing features, and (3)
Guided Replacement (GRep), which intelligently replaces missing elements while
preserving valid data. We demonstrate significant performance improvements
using two datasets: DCASE and LargeSet, where $\textit{LOCUS}$ achieves up to
$36.91\%$ lower DoA error compared to existing methods. Real-world evaluations
across three environments with intermittent power sources show a
$25.87-59.46\%$ improvement in performance when channels are stochastically
missing. Additionally, we release a 50-hour multichannel dataset to support
further research in SSL.

IntroducesNewDataset(
    explanation="... reason why you think it's a new dataset or not",
    label=<DatasetLabel.NEW: 'new_dataset'>
)
---

Room to think?

One of the features of the R1 model is that it has “reasoning”, which is delineated by and tags. Since our structured output doesn’t allow for this, let’s try and see how well the model does without it.

def predict_label_without_structured_output(
    data: dict[str, str], model: str = "deepseek-r1-distill-qwen-1.5b", client=client
) -> str:
    prompt = format_text_as_prompt(data)
    messages = [
        {"role": "user", "content": prompt},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
    )
    return response.choices[0].message.content

We’ll compare the results from the two approaches.

# compare the results vs structured output
for i, example in enumerate(examples):
    rich_print(example["title"])
    rich_print(example["abstract"])
    prediction = predict_label_without_structured_output(example)
    print(f"Previous: {structured_results[i].label}")
    print(f"New: {prediction}")
    rich_print("---")
An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for
  High Dimensional Data
  This paper presents a new fuzzy k-means algorithm for the clustering of high
dimensional data in various subspaces. Since, In the case of high dimensional
data, some features might be irrelevant and relevant but may have different
significance in the clustering. For a better clustering, it is crucial to
incorporate the contribution of these features in the clustering process. To
combine these features, in this paper, we have proposed a new fuzzy k-means
clustering algorithm in which the objective function of the fuzzy k-means is
modified using two different entropy term. The first entropy term helps to
minimize the within-cluster dispersion and maximize the negative entropy to
determine clusters to contribute to the association of data points. The second
entropy term helps to control the weight of the features because different
features have different contributing weights in the clustering process for
obtaining the better partition of the data. The efficacy of the proposed method
is presented in terms of various clustering measures on multiple datasets and
compared with various state-of-the-art methods.

Previous: DatasetLabel.NOT_NEW
New: <think>
Okay, so I need to figure out whether the paper introduces a newly created dataset. The title and abstract are provided.

The title is: "An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data." It mentions an algorithm related to clustering high-dimensional data using fuzzy k-means with some entropy terms and feature weighting.

Looking at the abstract, it says they've proposed a new fuzzy k-means algorithm. The focus is on modifying the objective function by adding two different entropy terms: one to minimize within-cluster dispersion and another to control feature weights because features have varying contributions in clustering.

The paper mentions that their method was tested against various datasets and compared with state-of-the-art methods, but there's no explicit mention of introducing a new dataset. They evaluate performance on multiple existing datasets without specifying any novel data creation here.

So, the key points are: they're improving an algorithm for high-dimensional clustering but don't indicate creating a new dataset; instead, they apply it to various datasets that already exist.
</think>

The paper does not introduce a newly created dataset as part of its methodology. It focuses on enhancing an existing fuzzy k-means algorithm and evaluates its performance across multiple existing datasets.

```json
{
    "label": "no_new_dataset",
    "explanation": "The paper discusses modifications to an existing fuzzy k-means algorithm for high-dimensional data, but it does not mention introducing a new dataset. Instead, it evaluates the proposed method on various datasets that likely already exist."
}
```
---
Identifying Influential Brokers on Social Media from Social Network
  Structure
  Identifying influencers in a given social network has become an important
research problem for various applications, including accelerating the spread of
information in viral marketing and preventing the spread of fake news and
rumors. The literature contains a rich body of studies on identifying
influential source spreaders who can spread their own messages to many other
nodes. In contrast, the identification of influential brokers who can spread
other nodes' messages to many nodes has not been fully explored. Theoretical
and empirical studies suggest that involvement of both influential source
spreaders and brokers is a key to facilitating large-scale information
diffusion cascades. Therefore, this paper explores ways to identify influential
brokers from a given social network. By using three social media datasets, we
investigate the characteristics of influential brokers by comparing them with
influential source spreaders and central nodes obtained from centrality
measures. Our results show that (i) most of the influential source spreaders
are not influential brokers (and vice versa) and (ii) the overlap between
central nodes and influential brokers is small (less than 15%) in Twitter
datasets. We also tackle the problem of identifying influential brokers from
centrality measures and node embeddings, and we examine the effectiveness of
social network features in the broker identification task. Our results show
that (iii) although a single centrality measure cannot characterize influential
brokers well, prediction models using node embedding features achieve F$_1$
scores of 0.35--0.68, suggesting the effectiveness of social network features
for identifying influential brokers.

Previous: DatasetLabel.NEW
New: <think>
Okay, so I need to figure out whether the paper titled "Identifying Influential Brokers on Social Media from Social Network Structure" introduces a new dataset. Let me break this down.

First, looking at the title, it's about identifying influential brokers in social media using network structure. The abstract mentions they used three social media datasets to study these influencers. They compared brokers with source spreaders and central nodes based on centrality measures.

The abstract also talks about tackling the problem of identifying brokers from both centrality measures and node embeddings. It evaluates the effectiveness of network features, getting some F1 scores as a result.

So, I'm trying to see if they created any new dataset or used existing ones. They mention using three datasets: Twitter in their experiments. The paper doesn't seem to introduce any entirely new type of data beyond what's commonly available, like Twitter datasets. They're analyzing these existing datasets with their methods.

Therefore, the paper probably uses existing social media datasets rather than creating a new one.
</think>

The paper does not introduce a newly created dataset; it utilizes existing social media datasets such as Twitter for its analysis.

```json
{
    "label": "no_new_dataset",
    "explanation": "The paper does not introduce any new datasets. It uses three social media datasets, including Twitter, which are already available data sources."
}
```
---
Improving Performance of Automatic Keyword Extraction (AKE) Methods
  Using PoS-Tagging and Enhanced Semantic-Awareness
  Automatic keyword extraction (AKE) has gained more importance with the
increasing amount of digital textual data that modern computing systems
process. It has various applications in information retrieval (IR) and natural
language processing (NLP), including text summarisation, topic analysis and
document indexing. This paper proposes a simple but effective
post-processing-based universal approach to improve the performance of any AKE
methods, via an enhanced level of semantic-awareness supported by PoS-tagging.
To demonstrate the performance of the proposed approach, we considered word
types retrieved from a PoS-tagging step and two representative sources of
semantic information - specialised terms defined in one or more
context-dependent thesauri, and named entities in Wikipedia. The above three
steps can be simply added to the end of any AKE methods as part of a
post-processor, which simply re-evaluate all candidate keywords following some
context-specific and semantic-aware criteria. For five state-of-the-art (SOTA)
AKE methods, our experimental results with 17 selected datasets showed that the
proposed approach improved their performances both consistently (up to 100% in
terms of improved cases) and significantly (between 10.2% and 53.8%, with an
average of 25.8%, in terms of F1-score and across all five methods), especially
when all the three enhancement steps are used. Our results have profound
implications considering the ease to apply our proposed approach to any AKE
methods and to further extend it.

Previous: DatasetLabel.NOT_NEW
New: <think>
Okay, I'm trying to figure out whether the paper "Improving Performance of Automatic Keyword Extraction (AKE) Methods Using PoS-Tagging and Enhanced Semantic-Awareness" introduces a newly created dataset. 

First, looking at the title suggests that it's about improving an existing AKE method, which implies they're working with existing datasets rather than creating new ones.

The abstract mentions experiments conducted on 17 selected datasets for five SOTA AKE methods. They used these datasets to test their approach but didn't mention any new data collection or creation here. 

So, there's no indication that the paper includes a newly created dataset in its methodology or results section.
</think>

The paper focuses on enhancing existing AKE methods using PoS-tagging and semantic-aware criteria without introducing new datasets.

```json
{
    "label": "no_new_dataset",
    "explanation": "The paper does not mention any new datasets being created. It evaluates existing 17 datasets with five SOTA AKE methods."
}
```
---
LOCUS: LOcalization with Channel Uncertainty and Sporadic Energy
  Accurate sound source localization (SSL) requires consistent multichannel
data for reliable degree of arrival (DoA) estimation. However, intermittently
powered batteryless systems often suffer from incomplete sensor data due to the
stochastic nature of energy harvesting. Existing methods struggle with missing
channels, leading to significant performance degradation. In this paper, we
propose $\textit{LOCUS}$, a novel deep learning-based system designed to
recover corrupted features for SSL in batteryless systems. $\textit{LOCUS}$
addresses missing data by leveraging information entropy estimation and
conditional interpolation, combining three modules: (1) Information-Weighted
Focus (InFo), which identifies and quantifies corrupted data elements, (2)
Latent Feature Synthesizer (LaFS), which synthesizes missing features, and (3)
Guided Replacement (GRep), which intelligently replaces missing elements while
preserving valid data. We demonstrate significant performance improvements
using two datasets: DCASE and LargeSet, where $\textit{LOCUS}$ achieves up to
$36.91\%$ lower DoA error compared to existing methods. Real-world evaluations
across three environments with intermittent power sources show a
$25.87-59.46\%$ improvement in performance when channels are stochastically
missing. Additionally, we release a 50-hour multichannel dataset to support
further research in SSL.

Previous: DatasetLabel.NEW
New: <think>
Alright, so I'm trying to figure out whether the paper introduces a newly created dataset. Let's look at the information given.

First, the title is "LOCUS: LOcalization with Channel Uncertainty and Sporadic Energy." It mentions an acronym LOCUS, which seems to be the main focus of the paper—sound source localization using deep learning methods to handle missing data in batteryless systems.

Now looking at the abstract. The authors mention that their method addresses missing channels by leveraging information entropy estimation and conditional interpolation through three modules: InFo, LaFS, and GRep. They demonstrate performance improvements on two datasets called DCASE and LargeSet, achieving up to 36.91% reduction in DoA error compared to existing methods.

The abstract also includes a real-world evaluation across three environments with intermittent power sources, showing significant performance improvements when channels are stochastically missing. Additionally, the authors mention releasing a 50-hour multichannel dataset called DCASE 2023 Challenge Set v1.0.

From this information, it's clear that the paper not only presents their method but also provides new datasets for evaluation—specifically the DCASE and LargeSet datasets, along with the newly released DCASE challenge set. These datasets are likely used to test and validate the performance of their proposed system, LOCUS.

Therefore, since the abstract explicitly states the release of these datasets as part of their contribution and they use them in experiments, this paper does introduce a newly created dataset.
</think>

The paper "LOCUS: LOcalization with Channel Uncertainty and Sporadic Energy" introduces new datasets for evaluating its proposed method. The authors mention releasing two datasets (DCASE and LargeSet) and the DCASE 2023 Challenge Set v1.0, which are used to validate their system's performance.

```json
{
    "label": "new_dataset",
    "explanation": "The paper explicitly mentions the release of new datasets (DCASE, LargeSet, and DCASE 2023 Challenge Set v1.0) for evaluating the proposed method and demonstrates its effectiveness using these datasets."
}
```
---

While this is definitely a vibes-based assessment, it does seem like the model does better when it has room to think, so we’ll proceed with this approach.

Note

There are ways to allow for both structured generation and reasoning. I’ll post more on that in the future!

We’ll now write a function to extract the JSON from the model’s output.

Code
import contextlib
import re
import json

JSON_PATTERN = re.compile(r"```json\n(.*?)```", re.DOTALL)
DIRECT_JSON_PATTERN = re.compile(r"\{[^}]*\}", re.DOTALL)


def try_extract_json_from_text(text: str) -> tuple[str, dict | None]:
    if match := JSON_PATTERN.search(text):
        json_results = match.group(1)
        with contextlib.suppress(json.JSONDecodeError):
            return text, json.loads(json_results)
    if match := DIRECT_JSON_PATTERN.search(text):
        json_text = match.group(0)
        with contextlib.suppress(json.JSONDecodeError):
            return text, json.loads(json_text)
    return text, None
prediction = predict_label_without_structured_output(examples[0])
try_extract_json_from_text(prediction)
('<think>\nOkay, so I\'m trying to figure out whether the paper titled "An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data" introduces a newly created dataset. Let me go through this step by step.\n\nFirst, I look at the title. The title mentions an algorithm called fuzzy k-means that\'s been modified with entropy terms and feature weighting to handle high-dimensional data. It doesn\'t explicitly say anything about introducing new datasets, but it does focus on improving clustering in such data environments, which often involves dealing with irrelevant or less important features.\n\nNow, looking at the abstract. The paper discusses modifying the fuzzy k-means algorithm by incorporating two entropy terms: one for within-cluster dispersion and negative entropy to determine clusters, and another to control feature weights because different features have varying contributions. They compare their method\'s efficacy using various clustering measures on multiple datasets against state-of-the-art methods.\n\nHmm, so in both the title and abstract, I don\'t see any mention of a new dataset being created or introduced. The focus is more on improving an existing algorithm to better handle high-dimensional data rather than introducing entirely new data for analysis.\n\nI should also consider whether it\'s possible that the paper might use standard datasets without explicitly stating so. For example, many clustering algorithms are tested on common datasets like MNIST, CIFAR-10, etc., especially when dealing with high-dimensional data because these datasets have numerous features after preprocessing (like PCA or similar techniques). However, since the abstract doesn\'t specify which datasets they used or mention any novel datasets, it\'s more about comparing performance rather than introducing a new dataset.\n\nSo, putting this together: The paper presents an improved clustering algorithm but doesn\'t introduce a new dataset. It evaluates its performance on existing ones, hence likely "no_new_dataset."\n</think>\n\nThe paper discusses an enhanced fuzzy k-means algorithm designed for high-dimensional data, focusing on improving the clustering process by incorporating entropy terms and feature weighting. While it addresses challenges in handling irrelevant or less significant features, there is no mention of introducing a new dataset. Instead, it evaluates its method against existing datasets using various metrics.\n\n```json\n{\n    "label": "no_new_dataset",\n    "explanation": "The paper does not introduce a newly created dataset; instead, it focuses on improving an algorithm to handle high-dimensional data by modifying the objective function with entropy terms and feature weighting. It evaluates its performance on existing datasets without introducing new ones."\n}\n```',
 {'label': 'no_new_dataset',
  'explanation': 'The paper does not introduce a newly created dataset; instead, it focuses on improving an algorithm to handle high-dimensional data by modifying the objective function with entropy terms and feature weighting. It evaluates its performance on existing datasets without introducing new ones.'})

Let’s see how well this works on all the examples we had before

results = [predict_label_without_structured_output(example) for example in examples]
parsed_results = [try_extract_json_from_text(result) for result in results]
[p for p in parsed_results if p[1] is None]
[]

We can see in this example we don’t have any examples where we don’t get a valid JSON object (this is why we get back an empty list).

Although we might miss a few examples where we don’t get a valid JSON object when doing the full dataset, let’s proceed with this approach since the model does much better when given room to reason.

We’ll now use the hosted version of the model to generate labels for the entire dataset. For this version, we’ll use a dedicated Hugging Face inference endpoint, but if we wanted to use the full R1 model, we could use the new Inference Providers feature on the Hub. See this blog post for more information.

Code
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    base_url="https://tgtdz7g5h3sd1lov.us-east-1.aws.endpoints.huggingface.cloud/v1/",
    api_key=os.getenv("HF_TOKEN"),
)
rich_print(
    predict_label_without_structured_output(
        examples[0], model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", client=client
    )
)
<think>
Alright, I'm looking at this paper to determine if it introduces a newly created dataset. The title mentions 
"Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models." The word 
"dataset" isn't in the title, but the abstract gives more details.

In the abstract, the authors talk about introducing NuInstruct, which they describe as a novel dataset. It has 91K 
multi-view video-QA pairs across 17 subtasks. This indicates that they've created a new collection of data 
specifically for their research. They also mention a SQL-based method for generating instruction-response pairs 
automatically, which suggests they developed a systematic approach to build this dataset.

Furthermore, the paper introduces a new method called BEV-InMLLM, which uses this dataset. They report experiments 
on NuInstruct showing improvements, and they plan to release it for future research. This release plan is another 
indicator that they've created a new dataset intended for broader use.

Putting it all together, the paper clearly states the creation of NuInstruct, its characteristics, and their 
intention to share it. Therefore, it's introducing a new dataset.
</think>

```json
{
    "label": "new_dataset",
    "explanation": "The paper explicitly introduces NuInstruct, a novel dataset with 91K multi-view video-QA pairs 
across 17 subtasks. The authors describe its creation method and plan to release it for future research, clearly 
indicating the introduction of a new dataset."
}
```

We’ll now sample 3000 examples from the dataset and use the hosted model to generate labels for them.

sample_df = df.sample(3000, seed=42)
examples = sample_df.select(pl.col(["abstract", "title"])).to_dicts()

We create a function to predict the labels using the hosted model. We’ll use the stamina library to retry the request if it fails.

Code
import stamina
from openai import APIConnectionError, APIStatusError


@stamina.retry(on=(APIConnectionError, APIStatusError), attempts=3)
def predict_hf_endpoint(data: dict[str, str], model: str = "tgi", client=client):
    return predict_label_without_structured_output(data, model, client)


def predict(data):
    try:
        return predict_hf_endpoint(data)
    except Exception as e:
        print(e)
        return None

Get the results from the model.

from tqdm.contrib.concurrent import thread_map

results = thread_map(predict, examples, max_workers=5)

Let’s take a look at the first result

rich_print(results[0])
<think>
Okay, so I need to figure out if the given arXiv paper introduces a newly created dataset. Let's look at the title 
and abstract carefully.

The title is "Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models." That
immediately suggests it's about a dataset related to autonomous driving. The abstract mentions that the paper 
introduces a dataset called NuInstruct, which has 91K multi-view video-QA pairs across 17 subtasks. Each task 
requires holistic information like temporal, multi-view, and spatial data, making the challenges higher.

The authors propose a method using SQL to generate instruction-response pairs automatically, inspired by human 
driving logic. They also introduce BEV-InMLLM, an end-to-end method that enhances large language models by 
integrating BEV features, language alignment, and tasks like multi-view, spatial awareness, and temporal semantics.
They note that their BEV injection module is plug-and-play for existing MLLMs.

Experiments on NuInstruct show significant improvements over existing MLLMs. The authors also mention releasing the
dataset for future research.

So, from the title, abstract, and details, it's clear that NuInstruct is a new dataset they created specifically 
for their research. They describe it in detail, including the structure and methods used, so it's definitely a 
newly created dataset aimed at advancing autonomous driving through language models.
</think>

```json
{
    "label": "new_dataset",
    "explanation": "The paper introduces a dataset called NuInstruct with 91K multi-view video-QA pairs across 17 
subtasks, each requiring holistic information for robust autonomous driving tasks. The authors detail its structure
and methods for creation, confirming it as a newly developed dataset."
}
```
try_extract_json_from_text(results[0])
('<think>\nOkay, so I need to figure out if the given arXiv paper introduces a newly created dataset. Let\'s look at the title and abstract carefully.\n\nThe title is "Holistic Autonomous Driving Understanding by Bird\'s-Eye-View Injected Multi-Modal Large Models." That immediately suggests it\'s about a dataset related to autonomous driving. The abstract mentions that the paper introduces a dataset called NuInstruct, which has 91K multi-view video-QA pairs across 17 subtasks. Each task requires holistic information like temporal, multi-view, and spatial data, making the challenges higher.\n\nThe authors propose a method using SQL to generate instruction-response pairs automatically, inspired by human driving logic. They also introduce BEV-InMLLM, an end-to-end method that enhances large language models by integrating BEV features, language alignment, and tasks like multi-view, spatial awareness, and temporal semantics. They note that their BEV injection module is plug-and-play for existing MLLMs.\n\nExperiments on NuInstruct show significant improvements over existing MLLMs. The authors also mention releasing the dataset for future research.\n\nSo, from the title, abstract, and details, it\'s clear that NuInstruct is a new dataset they created specifically for their research. They describe it in detail, including the structure and methods used, so it\'s definitely a newly created dataset aimed at advancing autonomous driving through language models.\n</think>\n\n```json\n{\n    "label": "new_dataset",\n    "explanation": "The paper introduces a dataset called NuInstruct with 91K multi-view video-QA pairs across 17 subtasks, each requiring holistic information for robust autonomous driving tasks. The authors detail its structure and methods for creation, confirming it as a newly developed dataset."\n}\n```',
 {'label': 'new_dataset',
  'explanation': 'The paper introduces a dataset called NuInstruct with 91K multi-view video-QA pairs across 17 subtasks, each requiring holistic information for robust autonomous driving tasks. The authors detail its structure and methods for creation, confirming it as a newly developed dataset.'})

We’ll do a bit of cleaning up of the results to get them in a format we can add to our existing dataframe.

Code
parsed_results = [try_extract_json_from_text(result) for result in results]
parsed_results[:3]
[('<think>\nOkay, so I need to figure out if the given arXiv paper introduces a newly created dataset. Let\'s look at the title and abstract carefully.\n\nThe title is "Holistic Autonomous Driving Understanding by Bird\'s-Eye-View Injected Multi-Modal Large Models." That immediately suggests it\'s about a dataset related to autonomous driving. The abstract mentions that the paper introduces a dataset called NuInstruct, which has 91K multi-view video-QA pairs across 17 subtasks. Each task requires holistic information like temporal, multi-view, and spatial data, making the challenges higher.\n\nThe authors propose a method using SQL to generate instruction-response pairs automatically, inspired by human driving logic. They also introduce BEV-InMLLM, an end-to-end method that enhances large language models by integrating BEV features, language alignment, and tasks like multi-view, spatial awareness, and temporal semantics. They note that their BEV injection module is plug-and-play for existing MLLMs.\n\nExperiments on NuInstruct show significant improvements over existing MLLMs. The authors also mention releasing the dataset for future research.\n\nSo, from the title, abstract, and details, it\'s clear that NuInstruct is a new dataset they created specifically for their research. They describe it in detail, including the structure and methods used, so it\'s definitely a newly created dataset aimed at advancing autonomous driving through language models.\n</think>\n\n```json\n{\n    "label": "new_dataset",\n    "explanation": "The paper introduces a dataset called NuInstruct with 91K multi-view video-QA pairs across 17 subtasks, each requiring holistic information for robust autonomous driving tasks. The authors detail its structure and methods for creation, confirming it as a newly developed dataset."\n}\n```',
  {'label': 'new_dataset',
   'explanation': 'The paper introduces a dataset called NuInstruct with 91K multi-view video-QA pairs across 17 subtasks, each requiring holistic information for robust autonomous driving tasks. The authors detail its structure and methods for creation, confirming it as a newly developed dataset.'}),
 ('<think>\nOkay, so I need to figure out whether the paper "BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis" introduces a newly created dataset. Let me start by reading the title and abstract carefully.\n\nThe title mentions "BRACE: The Breakdancing Competition Dataset." That suggests it\'s introducing a dataset named BRACE. The abstract goes into more detail about what the dataset is. It says that generative models for audio-conditioned dance motion synthesis are trained, but these models rely on certain assumptions, like strong music-dance correlation and controlled motion data. The paper points out that existing datasets have limitations and introduces BRACE to challenge these assumptions by providing complex human poses, specifically focusing on breakdancing which includes acrobatic moves and tangled postures.\n\nThe authors mention that they used data from the Red Bull BC One competition videos. They faced challenges like estimating human keypoints due to the complexity of the dance and multiple cameras. To address this, they used a hybrid labeling pipeline combining deep estimation models and manual annotations to get high-quality keypoint sequences. The result is a dataset with over 3 hours and 30 minutes of densely annotated poses. They tested state-of-the-art methods and found their limitations with complex sequences, arguing that their dataset can advance dance motion synthesis by forcing models to reason more about body structure and movements.\n\nSo, putting this together, the paper is introducing a new dataset called BRACE. They created it by compiling dynamic dance sequences from Red Bull BC One videos, faced challenges in data annotation, and developed a hybrid approach to overcome these issues. They then used this dataset to evaluate existing methods and highlight its potential to push the boundaries of dance motion synthesis.\n\nI don\'t see any indication that this dataset isn\'t new. The abstract clearly states the creation of BRACE, describes the data sources and processing methods, and the purpose of the dataset. Therefore, it\'s clear that the authors are introducing this dataset as part of their contribution.\n</think>\n\nThe paper introduces a new dataset called BRACE, which is specifically designed for dance motion synthesis. The authors detail the creation of this dataset, including the challenges faced and the methods used to compile and process the data. Therefore, the paper clearly introduces a newly created dataset.\n\n```json\n{\n    "label": "new_dataset",\n    "explanation": "The paper introduces a new dataset named BRACE, detailing its creation, challenges faced, and the purpose of providing complex dance sequences to advance motion synthesis."\n}\n```',
  {'label': 'new_dataset',
   'explanation': 'The paper introduces a new dataset named BRACE, detailing its creation, challenges faced, and the purpose of providing complex dance sequences to advance motion synthesis.'}),
 ('<think>\nOkay, I need to determine whether the arXiv paper titled "RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports" introduces a newly created dataset. Let me start by reading through the title and abstract carefully to extract relevant information.\n\nThe title mentions that RET-CLIP is a model pre-trained using clinical diagnostic reports. The abstract elaborates that RET-CLIP was trained on a dataset of 193,865 patients. It highlights that the model focuses on extracting features from color fundus photographs (CFPs) and uses a tripartite optimization strategy covering the left eye, right eye, and patient level.\n\nNow, the key point is whether this dataset of 193,865 patients is newly created or if it\'s referring to an existing dataset. The abstract doesn\'t explicitly state that this dataset is new; it simply describes it as a dataset used for training. The focus is on the model\'s architecture and training strategy rather than the dataset\'s origin.\n\nTypically, when a paper mentions training a model on a dataset, they might reference an existing one unless they specify that it\'s newly collected. Since the abstract doesn\'t provide details about the dataset\'s origin, such as whether it\'s publicly available, if it\'s proprietary, or if it\'s a new collection, it\'s safer to assume that the dataset might not be newly created. However, the exact nature of the dataset isn\'t clarified, so without explicit information, it\'s hard to confirm if it\'s new.\n\nBut considering the context, the paper is about medical imaging and clinical reports, which are fields where large datasets are often publicly available or shared. The mention of 193,865 patients could imply a significant dataset, possibly derived from existing public resources. Therefore, it\'s plausible that the dataset isn\'t newly created but rather an aggregation or expansion of existing data.\n\nIn conclusion, there\'s insufficient information to confirm that a new dataset was introduced, so the label should be "no_new_dataset".\n</think>\n\n```json\n{\n    "label": "no_new_dataset",\n    "explanation": "The paper describes a dataset of 193,865 patients used to train the RET-CLIP model but does not explicitly state that this dataset is newly created. It is possible that the dataset is derived from existing public resources or aggregated data in the medical field."\n}\n```',
  {'label': 'no_new_dataset',
   'explanation': 'The paper describes a dataset of 193,865 patients used to train the RET-CLIP model but does not explicitly state that this dataset is newly created. It is possible that the dataset is derived from existing public resources or aggregated data in the medical field.'})]
Code
labels_and_explanations = [
    (result[1].get("label"), result[1].get("explanation"))
    if result[1] is not None and isinstance(result[1], dict)
    else (None, None)
    for result in parsed_results
]

# Unzip the list of tuples into separate lists
labels, explanations = zip(*labels_and_explanations)
lables = list(labels)
explanations = list(explanations)
sample_df = sample_df.with_columns(
    pl.Series(lables).alias("labels"),
    pl.Series(explanations).alias("explanations"),
)
sample_df.head(1)
shape: (1, 16)
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed labels explanations
str str str str str str str str str str str list[struct[2]] datetime[ms] list[list[str]] str str
"2401.00988" "Xinpeng Ding" "Xinpeng Ding and Jinahua Han a… "Holistic Autonomous Driving Un… null null null null "cs.CV" "http://arxiv.org/licenses/none… "  The rise of multimodal large… [{"v1","Tue, 2 Jan 2024 01:54:22 GMT"}] 2024-01-03 00:00:00 [["Ding", "Xinpeng", ""], ["Han", "Jinahua", ""], … ["Li", "Xiaomeng", ""]] "new_dataset" "The paper introduces a dataset…

Let’s take a look at the distribution of the labels.

sample_df.select(pl.col("labels").value_counts()).unnest("labels")
shape: (3, 2)
labels count
str u32
"new_dataset" 648
"no_new_dataset" 2350
null 2

We only get a few examples where the output doesn’t match the labels we want. We can filter these out.

sample_df = sample_df.filter(pl.col("labels").is_in(["new_dataset", "no_new_dataset"]))

We’ll now convert the dataframe to a Hugging Face dataset and push it to the Hub.

Code
from datasets import Dataset, Features, Value, ClassLabel
ds = Dataset.from_polars(
    sample_df.select(["id", "title", "abstract", "labels", "explanations"]),
)
large_string_columns = [
    k
    for k, v in ds.features.items()
    if isinstance(v, Value) and v.dtype == "large_string"
]
for column in large_string_columns:
    ds = ds.cast_column(column, Value("string"))
ds = ds.cast_column("labels", ClassLabel(names=["new_dataset", "no_new_dataset"]))
ds.push_to_hub("davanstrien/arxiv-new-datasets", token=os.getenv("HF_TOKEN"))

Here is the resulting dataset.

Fine tuning ModernBERT

Since the focus of this blog is on the data generation part I won’t go into too much detail here but you can see the code and the final results below.

Code
%pip install datasets setfit transformers accelerate --upgrade
%pip install flash-attn --no-build-isolation

from datasets import load_dataset
from evaluate import load
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
)
import numpy as np
from evaluate import load

# Load data
ds = load_dataset("davanstrien/arxiv-new-datasets", split="train")

# label info
labels = ds.features["labels"].names
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}

# prep a text column combining title and abstract
ds = ds.map(lambda x: {"text": x["title"] + " " + x["abstract"]})
ds = ds.train_test_split(test_size=0.2, stratify_by_column="labels")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")


# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)


# Tokenize datasets
tokenized_datasets = ds.map(tokenize_function, batched=True)

# Load metrics
accuracy = load("accuracy")
f1 = load("f1")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(
        predictions=predictions, references=labels, average="weighted"
    )

    return {
        "accuracy": accuracy_score["accuracy"],
        "f1": f1_score["f1"],
    }


# Load model with increased dropout
model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=2,
    label2id=label2id,
    id2label=id2label,
)

# Define improved training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=3e-5,  # Slightly higher initial learning rate
    per_device_train_batch_size=8,  # Reduced batch size
    per_device_eval_batch_size=64,
    num_train_epochs=20,  # Reduced epochs
    # Learning rate schedule
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    # Evaluation and saving
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    # Regularization
    weight_decay=0.01,
    max_grad_norm=1.0,
    label_smoothing_factor=0.1,
    # Logging
    logging_dir="./logs",
    logging_strategy="epoch",
)

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize Trainer with early stopping
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.001)
    ],
)
# Evaluate the model
eval_results = trainer.evaluate()
print("\nFinal evaluation results:", eval_results)

# Save the best model
trainer.save_model("./best_model")
Final evaluation results: {'eval_loss': 0.32631951570510864, 'eval_accuracy': 0.945, 'eval_f1': 0.9442747450661002, 'eval_runtime': 5.8106, 'eval_samples_per_second': 103.26, 'eval_steps_per_second': 1.721, 'epoch': 10.0}

Conclusion

In this blog post, we’ve seen how we can use the reasoning abilities of models to be effective classifiers. Since lack of training data is one of the main reasons people may use an LLM over a fine tuned model, benefiting from the reasoning abilities of an LLM is a great way to get the best of both worlds.