Efficient batch inference for LLMs with vLLM + UV Scripts on HF Jobs

huggingface
uv-scripts
vllm
hf Jobs
Generate responses for thousands of dataset prompts using Qwen/Qwen3-30B-A3B-Instruct-2507 across 4 GPUs with automatic prompt filtering and tensor parallelism
Author

Daniel van Strien

Recently, we launched HF Jobs, a new way to run jobs on the Hugging Face platform. This post will show you how to use it to run large language model inference jobs with vLLM and uv Scripts, processing thousands of prompts with models that don’t fit on a single GPU.

HF Jobs can be a very powerful tool for running a variety of compute jobs but I think it’s particularly useful for running LLMS for batched infernece workloads where it can make a lot of sense to try and bring the data close to the model (to remove the latency of transferring data via an API) and to use trhe powerful auto batching features of vLLM to get the most out of your GPUs.

The Challenge

Large language models like Qwen3-30B-A3B-Instruct (30 billion parameters) exceed single GPU memory limits. Running batch inference on datasets requires: - Multi-GPU setup with tensor parallelism - Handling prompts that exceed context limits - Managing dependencies and environment setup

Traditional approaches involve complex Docker setups, manual dependency management, and custom scripts for GPU coordination. HF Jobs changes this.

What is vLLM?

vLLM is a very well known and heavily used inference engine. It is known for its ability to scale inference for LLMs. While we can use vLLM via an OpenAI compatible API, it also has a powerful batch inference mode that allows us to process large datasets of prompts efficiently. This “offline inference” is particularly useful when we want to generate responses for a large number of prompts without the overhead of API calls.

What are uv Scripts?

UV scripts are Python scripts with inline dependency metadata that automatically install and manage their requirements. Instead of separate requirements.txt files or complex setup instructions, everything needed to run the script is declared in the script itself:

  # /// script
  # requires-python = ">=3.10"
  # dependencies = [
  #     "vllm",
  #     "transformers",
  #     "datasets",
  # ]
  # ///

  # Your Python code here

When you run uv run script.py, UV automatically creates an isolated environment, installs dependencies, and executes your code. No virtual env setup, no pip install commands, no version conflicts.

The Solution: UV Scripts + vLLM + HF Jobs

HF Jobs provides managed GPU infrastructure. This is already very useful but combined with uv Scripts we can more easily distribute scripts for a variety of ML/AI tasks in a (more) reproducible way.

The uv scripts Hugging Face org

Since I’m so excited about uv Scripts, I created a Hugging Face org to host them: uv scripts. This org will contain a variety of uv Scripts that you can use to run jobs on HF Jobs. For this example we’ll use a script that allows us to run inference for a model using vLLM. This script exposes a bunch of parameters that allow you to control how the inference is run, including the model to use, the number of GPUs to use, and the batch size etc.

In this case, the script expects as input a dataset with a column containing the input prompts (as messages). It will then run inference on the model using vLLM and return the generated responses in a new dataset.

I’m personally quite excited to see people sharing more uv scripts for things that are not complex enough to justify a full repository but that are still useful to share and run on the Hugging Face platform!

If you are curious, you can check out the script here or below:

Code
import requests

print(requests.get("https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py").text)
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "datasets",
#     "flashinfer-python",
#     "huggingface-hub[hf_transfer]",
#     "torch",
#     "transformers",
#     "vllm>=0.8.5",
# ]
#
# ///
"""
Generate responses for prompts in a dataset using vLLM for efficient GPU inference.

This script loads a dataset from Hugging Face Hub containing chat-formatted messages,
applies the model's chat template, generates responses using vLLM, and saves the
results back to the Hub with a comprehensive dataset card.

Example usage:
    # Local execution with auto GPU detection
    uv run generate-responses.py \\
        username/input-dataset \\
        username/output-dataset \\
        --messages-column messages

    # With custom model and sampling parameters
    uv run generate-responses.py \\
        username/input-dataset \\
        username/output-dataset \\
        --model-id meta-llama/Llama-3.1-8B-Instruct \\
        --temperature 0.9 \\
        --top-p 0.95 \\
        --max-tokens 2048

    # HF Jobs execution (see script output for full command)
    hf jobs uv run --flavor a100x4 ...
"""

import argparse
import logging
import os
import sys
from datetime import datetime
from typing import Optional

from datasets import load_dataset
from huggingface_hub import DatasetCard, get_token, login
from torch import cuda
from tqdm.auto import tqdm
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Enable HF Transfer for faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)


def check_gpu_availability() -> int:
    """Check if CUDA is available and return the number of GPUs."""
    if not cuda.is_available():
        logger.error("CUDA is not available. This script requires a GPU.")
        logger.error(
            "Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
        )
        sys.exit(1)

    num_gpus = cuda.device_count()
    for i in range(num_gpus):
        gpu_name = cuda.get_device_name(i)
        gpu_memory = cuda.get_device_properties(i).total_memory / 1024**3
        logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")

    return num_gpus


def create_dataset_card(
    source_dataset: str,
    model_id: str,
    messages_column: str,
    sampling_params: SamplingParams,
    tensor_parallel_size: int,
    num_examples: int,
    generation_time: str,
    num_skipped: int = 0,
    max_model_len_used: Optional[int] = None,
) -> str:
    """Create a comprehensive dataset card documenting the generation process."""
    filtering_section = ""
    if num_skipped > 0:
        skip_percentage = (num_skipped / num_examples) * 100
        processed = num_examples - num_skipped
        filtering_section = f"""

### Filtering Statistics

- **Total Examples**: {num_examples:,}
- **Processed**: {processed:,} ({100 - skip_percentage:.1f}%)
- **Skipped (too long)**: {num_skipped:,} ({skip_percentage:.1f}%)
- **Max Model Length Used**: {max_model_len_used:,} tokens

Note: Prompts exceeding the maximum model length were skipped and have empty responses."""

    return f"""---
tags:
- generated
- vllm
- uv-script
---

# Generated Responses Dataset

This dataset contains generated responses for prompts from [{source_dataset}](https://huggingface.co/datasets/{source_dataset}).

## Generation Details

- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
- **Messages Column**: `{messages_column}`
- **Model**: [{model_id}](https://huggingface.co/{model_id})
- **Number of Examples**: {num_examples:,}
- **Generation Date**: {generation_time}{filtering_section}

### Sampling Parameters

- **Temperature**: {sampling_params.temperature}
- **Top P**: {sampling_params.top_p}
- **Top K**: {sampling_params.top_k}
- **Min P**: {sampling_params.min_p}
- **Max Tokens**: {sampling_params.max_tokens}
- **Repetition Penalty**: {sampling_params.repetition_penalty}

### Hardware Configuration

- **Tensor Parallel Size**: {tensor_parallel_size}
- **GPU Configuration**: {tensor_parallel_size} GPU(s)

## Dataset Structure

The dataset contains all columns from the source dataset plus:
- `response`: The generated response from the model

## Generation Script

Generated using the vLLM inference script from [uv-scripts/vllm](https://huggingface.co/datasets/uv-scripts/vllm).

To reproduce this generation:

```bash
uv run https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
    {source_dataset} \\
    <output-dataset> \\
    --model-id {model_id} \\
    --messages-column {messages_column} \\
    --temperature {sampling_params.temperature} \\
    --top-p {sampling_params.top_p} \\
    --top-k {sampling_params.top_k} \\
    --max-tokens {sampling_params.max_tokens}{f" \\\\\\n    --max-model-len {max_model_len_used}" if max_model_len_used else ""}
```
"""


def main(
    src_dataset_hub_id: str,
    output_dataset_hub_id: str,
    model_id: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
    messages_column: str = "messages",
    output_column: str = "response",
    temperature: float = 0.7,
    top_p: float = 0.8,
    top_k: int = 20,
    min_p: float = 0.0,
    max_tokens: int = 16384,
    repetition_penalty: float = 1.0,
    gpu_memory_utilization: float = 0.90,
    max_model_len: Optional[int] = None,
    tensor_parallel_size: Optional[int] = None,
    skip_long_prompts: bool = True,
    hf_token: Optional[str] = None,
):
    """
    Main generation pipeline.

    Args:
        src_dataset_hub_id: Input dataset on Hugging Face Hub
        output_dataset_hub_id: Where to save results on Hugging Face Hub
        model_id: Hugging Face model ID for generation
        messages_column: Column name containing chat messages
        output_column: Column name for generated responses
        temperature: Sampling temperature
        top_p: Top-p sampling parameter
        top_k: Top-k sampling parameter
        min_p: Minimum probability threshold
        max_tokens: Maximum tokens to generate
        repetition_penalty: Repetition penalty parameter
        gpu_memory_utilization: GPU memory utilization factor
        max_model_len: Maximum model context length (None uses model default)
        tensor_parallel_size: Number of GPUs to use (auto-detect if None)
        skip_long_prompts: Skip prompts exceeding max_model_len instead of failing
        hf_token: Hugging Face authentication token
    """
    generation_start_time = datetime.now().isoformat()

    # GPU check and configuration
    num_gpus = check_gpu_availability()
    if tensor_parallel_size is None:
        tensor_parallel_size = num_gpus
        logger.info(
            f"Auto-detected {num_gpus} GPU(s), using tensor_parallel_size={tensor_parallel_size}"
        )
    else:
        logger.info(f"Using specified tensor_parallel_size={tensor_parallel_size}")
        if tensor_parallel_size > num_gpus:
            logger.warning(
                f"Requested {tensor_parallel_size} GPUs but only {num_gpus} available"
            )

    # Authentication - try multiple methods
    HF_TOKEN = hf_token or os.environ.get("HF_TOKEN") or get_token()

    if not HF_TOKEN:
        logger.error("No HuggingFace token found. Please provide token via:")
        logger.error("  1. --hf-token argument")
        logger.error("  2. HF_TOKEN environment variable")
        logger.error("  3. Run 'huggingface-cli login' or use login() in Python")
        sys.exit(1)

    logger.info("HuggingFace token found, authenticating...")
    login(token=HF_TOKEN)

    # Initialize vLLM
    logger.info(f"Loading model: {model_id}")
    vllm_kwargs = {
        "model": model_id,
        "tensor_parallel_size": tensor_parallel_size,
        "gpu_memory_utilization": gpu_memory_utilization,
    }
    if max_model_len is not None:
        vllm_kwargs["max_model_len"] = max_model_len
        logger.info(f"Using max_model_len={max_model_len}")

    llm = LLM(**vllm_kwargs)

    # Load tokenizer for chat template
    logger.info("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # Create sampling parameters
    sampling_params = SamplingParams(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        min_p=min_p,
        max_tokens=max_tokens,
        repetition_penalty=repetition_penalty,
    )

    # Load dataset
    logger.info(f"Loading dataset: {src_dataset_hub_id}")
    dataset = load_dataset(src_dataset_hub_id, split="train")
    total_examples = len(dataset)
    logger.info(f"Dataset loaded with {total_examples:,} examples")

    # Validate messages column
    if messages_column not in dataset.column_names:
        logger.error(
            f"Column '{messages_column}' not found. Available columns: {dataset.column_names}"
        )
        sys.exit(1)

    # Get effective max length for filtering
    if max_model_len is not None:
        effective_max_len = max_model_len
    else:
        # Get model's default max length
        effective_max_len = llm.llm_engine.model_config.max_model_len
    logger.info(f"Using effective max model length: {effective_max_len}")

    # Process messages and apply chat template
    logger.info("Applying chat template to messages...")
    all_prompts = []
    valid_prompts = []
    valid_indices = []
    skipped_info = []

    for i, example in enumerate(tqdm(dataset, desc="Processing messages")):
        messages = example[messages_column]
        # Apply chat template
        prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        all_prompts.append(prompt)

        # Count tokens if filtering is enabled
        if skip_long_prompts:
            tokens = tokenizer.encode(prompt)
            if len(tokens) <= effective_max_len:
                valid_prompts.append(prompt)
                valid_indices.append(i)
            else:
                skipped_info.append((i, len(tokens)))
        else:
            valid_prompts.append(prompt)
            valid_indices.append(i)

    # Log filtering results
    if skip_long_prompts and skipped_info:
        logger.warning(
            f"Skipped {len(skipped_info)} prompts that exceed max_model_len ({effective_max_len} tokens)"
        )
        logger.info("Skipped prompt details (first 10):")
        for idx, (prompt_idx, token_count) in enumerate(skipped_info[:10]):
            logger.info(
                f"  - Example {prompt_idx}: {token_count} tokens (exceeds by {token_count - effective_max_len})"
            )
        if len(skipped_info) > 10:
            logger.info(f"  ... and {len(skipped_info) - 10} more")

        skip_percentage = (len(skipped_info) / total_examples) * 100
        if skip_percentage > 10:
            logger.warning(f"WARNING: {skip_percentage:.1f}% of prompts were skipped!")

    if not valid_prompts:
        logger.error("No valid prompts to process after filtering!")
        sys.exit(1)

    # Generate responses - vLLM handles batching internally
    logger.info(f"Starting generation for {len(valid_prompts):,} valid prompts...")
    logger.info("vLLM will handle batching and scheduling automatically")

    outputs = llm.generate(valid_prompts, sampling_params)

    # Extract generated text and create full response list
    logger.info("Extracting generated responses...")
    responses = [""] * total_examples  # Initialize with empty strings

    for idx, output in enumerate(outputs):
        original_idx = valid_indices[idx]
        response = output.outputs[0].text.strip()
        responses[original_idx] = response

    # Add responses to dataset
    logger.info("Adding responses to dataset...")
    dataset = dataset.add_column(output_column, responses)

    # Create dataset card
    logger.info("Creating dataset card...")
    card_content = create_dataset_card(
        source_dataset=src_dataset_hub_id,
        model_id=model_id,
        messages_column=messages_column,
        sampling_params=sampling_params,
        tensor_parallel_size=tensor_parallel_size,
        num_examples=total_examples,
        generation_time=generation_start_time,
        num_skipped=len(skipped_info) if skip_long_prompts else 0,
        max_model_len_used=effective_max_len if skip_long_prompts else None,
    )

    # Push dataset to hub
    logger.info(f"Pushing dataset to: {output_dataset_hub_id}")
    dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)

    # Push dataset card
    card = DatasetCard(card_content)
    card.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)

    logger.info("✅ Generation complete!")
    logger.info(
        f"Dataset available at: https://huggingface.co/datasets/{output_dataset_hub_id}"
    )


if __name__ == "__main__":
    if len(sys.argv) > 1:
        parser = argparse.ArgumentParser(
            description="Generate responses for dataset prompts using vLLM",
            formatter_class=argparse.RawDescriptionHelpFormatter,
            epilog="""
Examples:
  # Basic usage with default Qwen model
  uv run generate-responses.py input-dataset output-dataset
  
  # With custom model and parameters
  uv run generate-responses.py input-dataset output-dataset \\
    --model-id meta-llama/Llama-3.1-8B-Instruct \\
    --temperature 0.9 \\
    --max-tokens 2048
  
  # Force specific GPU configuration
  uv run generate-responses.py input-dataset output-dataset \\
    --tensor-parallel-size 2 \\
    --gpu-memory-utilization 0.95
  
  # Using environment variable for token
  HF_TOKEN=hf_xxx uv run generate-responses.py input-dataset output-dataset
            """,
        )

        parser.add_argument(
            "src_dataset_hub_id",
            help="Input dataset on Hugging Face Hub (e.g., username/dataset-name)",
        )
        parser.add_argument(
            "output_dataset_hub_id", help="Output dataset name on Hugging Face Hub"
        )
        parser.add_argument(
            "--model-id",
            type=str,
            default="Qwen/Qwen3-30B-A3B-Instruct-2507",
            help="Model to use for generation (default: Qwen3-30B-A3B-Instruct-2507)",
        )
        parser.add_argument(
            "--messages-column",
            type=str,
            default="messages",
            help="Column containing chat messages (default: messages)",
        )
        parser.add_argument(
            "--output-column",
            type=str,
            default="response",
            help="Column name for generated responses (default: response)",
        )
        parser.add_argument(
            "--temperature",
            type=float,
            default=0.7,
            help="Sampling temperature (default: 0.7)",
        )
        parser.add_argument(
            "--top-p",
            type=float,
            default=0.8,
            help="Top-p sampling parameter (default: 0.8)",
        )
        parser.add_argument(
            "--top-k",
            type=int,
            default=20,
            help="Top-k sampling parameter (default: 20)",
        )
        parser.add_argument(
            "--min-p",
            type=float,
            default=0.0,
            help="Minimum probability threshold (default: 0.0)",
        )
        parser.add_argument(
            "--max-tokens",
            type=int,
            default=16384,
            help="Maximum tokens to generate (default: 16384)",
        )
        parser.add_argument(
            "--repetition-penalty",
            type=float,
            default=1.0,
            help="Repetition penalty (default: 1.0)",
        )
        parser.add_argument(
            "--gpu-memory-utilization",
            type=float,
            default=0.90,
            help="GPU memory utilization factor (default: 0.90)",
        )
        parser.add_argument(
            "--max-model-len",
            type=int,
            help="Maximum model context length (default: model's default)",
        )
        parser.add_argument(
            "--tensor-parallel-size",
            type=int,
            help="Number of GPUs to use (default: auto-detect)",
        )
        parser.add_argument(
            "--hf-token",
            type=str,
            help="Hugging Face token (can also use HF_TOKEN env var)",
        )
        parser.add_argument(
            "--skip-long-prompts",
            action="store_true",
            default=True,
            help="Skip prompts that exceed max_model_len instead of failing (default: True)",
        )
        parser.add_argument(
            "--no-skip-long-prompts",
            dest="skip_long_prompts",
            action="store_false",
            help="Fail on prompts that exceed max_model_len",
        )

        args = parser.parse_args()

        main(
            src_dataset_hub_id=args.src_dataset_hub_id,
            output_dataset_hub_id=args.output_dataset_hub_id,
            model_id=args.model_id,
            messages_column=args.messages_column,
            output_column=args.output_column,
            temperature=args.temperature,
            top_p=args.top_p,
            top_k=args.top_k,
            min_p=args.min_p,
            max_tokens=args.max_tokens,
            repetition_penalty=args.repetition_penalty,
            gpu_memory_utilization=args.gpu_memory_utilization,
            max_model_len=args.max_model_len,
            tensor_parallel_size=args.tensor_parallel_size,
            skip_long_prompts=args.skip_long_prompts,
            hf_token=args.hf_token,
        )
    else:
        # Show HF Jobs example when run without arguments
        print("""
vLLM Response Generation Script
==============================

This script requires arguments. For usage information:
    uv run generate-responses.py --help

Example HF Jobs command with multi-GPU:
    # If you're logged in with huggingface-cli, token will be auto-detected
    hf jobs uv run \\
        --flavor l4x4 \\
        https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
        username/input-dataset \\
        username/output-dataset \\
        --messages-column messages \\
        --model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \\
        --temperature 0.7 \\
        --max-tokens 16384
        """)

An example: Running Qwen3-30B-A3B-Instruct to generare summaries of datasets from 2025

As an example, let’s run Qwen3-30B-A3B-Instruct to generate summaries of datasets from 2025. We’ll use the hf-jobs Python API to create a job that runs a uv Script on 4 GPUs with vLLM. First we’ll quickly prepare the dataset and prompts. We’ll use Polars + datasets to load the dataset and filter it down to the 2025 datasets.

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="librarian-bots/dataset_cards_with_metadata",
    local_dir="data",
    repo_type="dataset",
    allow_patterns=["*.parquet"],
)
'/Users/davanstrien/Documents/daniel/blog/posts/2025/hf-jobs/data'

We’ll do some filtering to focus on datasets where the cards are not super long or super short. We’ll also filter to focus on datasets with at least one like and ten downloads.

import polars as pl
df = pl.scan_parquet("data/**/*.parquet")
df.collect_schema()
Schema([('datasetId', String),
        ('author', String),
        ('last_modified', String),
        ('downloads', Int64),
        ('likes', Int64),
        ('tags', List(String)),
        ('task_categories', List(String)),
        ('createdAt', String),
        ('trending_score', Float64),
        ('card', String)])
df = df.filter(pl.col("card").str.len_chars() > 200)
df = df.filter(pl.col("downloads") > 2)
df = df.filter(pl.col("likes") > 1)

We make sure we have datetime for the createdAt column so we can filter by year.

df = df.with_columns(pl.col("createdAt").str.to_datetime())
from datetime import datetime

this_year = datetime.now().year
this_year
2025
df_2025 = df.filter(pl.col("createdAt").dt.year() == this_year)

Since we’re using the LazyFrame API, we can use the collect method to execute the query and get the results we get a nice optimized query plan. This is very nice since you can be quite lazy in how you filter and transform the data and Polars will optimize the query for you!

df_2025.show_graph(optimized=True, engine="streaming")

Polars and datasets play nicely together so we can easily convert between the two. Since we’ve done all the filtering we want, we can convert the Polars DataFrame to a Datasets Dataset.

from datasets import Dataset
ds = Dataset.from_polars(df_2025.collect())
ds
Dataset({
    features: ['datasetId', 'author', 'last_modified', 'downloads', 'likes', 'tags', 'task_categories', 'createdAt', 'trending_score', 'card'],
    num_rows: 2419
})

We’ll do one more filter to remove datasets that don’t have a card. We could also do this in Polars but since the huggingface_hub library has a nice way of converting a string into a dataset card where we can seperate the YAML from the main content, we’ll do it using the datasets library and the filter function.

from huggingface_hub import DatasetCard
def is_short_card(row, min_length=200):
    card = DatasetCard(row['card']).text
    return len(card) > min_length
ds = ds.filter(is_short_card, num_proc=4)

Preparing the prompts

Since the uv + vLLM script expecets as input a list of prompts, we’ll convert the dataset to a list of prompts. We’ll use the map function to create a list of prompts that we can use for inference. We’ll use the card field of the dataset to create a prompt that asks the model to summarize the dataset.

def format_prompt_for_card(row, max_length=8000):
    card = DatasetCard(row['card']).text
    datasetId = row['datasetId']
    return f"""You are a helpful assistant that provides concise summaries of dataset cards for datasets on the Hugging Face Hub.
The Hub ID of the dataset is: {datasetId}.
The dataset card is as follows:
{card[:max_length]}]
Please write a one to two sentence summary of the dataset card.
The summary should be concise and informative, capturing the essence of the dataset.
The summary should be in English.
The goal of the summary is to provide a quick overview of the dataset's content and purpose. 
This summary will be used to help users quickly understand the dataset and as input for creating embeddings for the dataset card.
    """
print(format_prompt_for_card(ds[0]))
You are a helpful assistant that provides concise summaries of dataset cards for datasets on the Hugging Face Hub.
The Hub ID of the dataset is: agentlans/high-quality-multilingual-sentences.
The dataset card is as follows:
# High Quality Multilingual Sentences

- This dataset contains multilingual sentences derived from the [agentlans/LinguaNova](https://huggingface.co/datasets/agentlans/LinguaNova) dataset.
- It includes 1.58 million rows across 51 different languages, each in its own configuration.

Example row (from the `all` config):
```json
{
    "text": "امام جمعه اصفهان گفت: میزان نیاز آب شرب اصفهان ۱۱.۵ متر مکعب است که تمام استان اصفهان را پوشش میدهد و نسبت به قبل از انقلاب یکی از پیشرفتها در حوزه آب بوده است.",
    "fasttext": "fa",
    "gcld3": "fa"
}
```

Fields:
- **text**: The sentence in the original language.
- **fasttext**, **gcld3**: Language codes determined using fastText and gcld3 Python packages.

## Configurations

Each individual language is available as a separate configuration, such as `ar`, `en`. These configurations contain only sentences identified to be of that specific language by both the fastText and gcld3 models.

Example row (from a language-specific config):
```json
{
    "text": "Ne vienas asmuo yra apsaugotas nuo parazitų atsiradimo organizme."
}
```

## Methods

### Data Loading and Processing

The `all` split was downloaded from the [agentlans/LinguaNova](https://huggingface.co/datasets/agentlans/LinguaNova) dataset.
1. **Text Cleaning**: Raw text was cleaned by removing HTML tags, emails, emojis, hashtags, user handles, and URLs. Unicode characters and whitespace were normalized, and hyphenated words were handled to ensure consistency.
2. **Sentence Segmentation**: Text was segmented into individual sentences using ICU's `BreakIterator` class, which efficiently processed different languages and punctuation.
3. **Deduplication**: Duplicate entries were removed to maintain uniqueness and prevent redundancy in the dataset.

### Language Detection

Two methods were used for language identification:
1. **gcld3**: Google's Compact Language Detector 3 was used for fast and accurate language identification.
2. **fastText**: Facebook’s fastText model was employed, which improved accuracy by considering subword information.

### Quality Assessment

Text quality was assessed through batch inference using the [agentlans/multilingual-e5-small-aligned-quality](https://huggingface.co/agentlans/multilingual-e5-small-aligned-quality) model.
1. **Data Retrieval**: Entries with a quality score of 1 or higher and a minimum input length of 20 characters were retained.
2. **Text Refinement**: Leading punctuation and spaces were removed, and balanced quotation marks were validated using regular expressions.

### Dataset Configs

The filtered sentences and their annotated languages were written to the `all.jsonl` file. The file was then split into language-specific JSONL files, containing only those sentences that matched consistently with both gcld3 and fasttext in terms of language identification. Only languages with at least 100 sentences after filtering were included in these configs.

## Usage

### Loading the Dataset
```python
from datasets import load_dataset

dataset = load_dataset('agentlans/high-quality-multilingual-sentences', 'all')
```

For language-specific configurations:
```python
language_config = load_dataset('agentlans/high-quality-multilingual-sentences', 'en')  # Replace with desired language code.
```

### Example Usage in Python
```python
from datasets import load_dataset

# Load the dataset for all languages or a specific one
dataset_all = load_dataset("agentlans/high-quality-multilingual-sentences", "all")
print(dataset_all["train"][0])

language_config = load_dataset("agentlans/high-quality-multilingual-sentences", "en")  # Replace 'en' with desired language code.
print(language_config["train"][:5])
```

## Limitations

- **Multilingual content bias**: The quality classifier is biased towards educational and more formal content.
- **Language coverage**: Limited to the 50 written languages from LinguaNova. There's a lack of African and indigenous languages.
- **Short input issues**: Language identification accuracy can suffer when working with short inputs like single sentences.
- **Sentence segmentation challenges**: Some languages' delimiters might not be handled correctly.
- **Redundancy**: The filtering was only done on exact matches so some sentences may be similar (but not identical).

Additionally:
- **Thai data imbalance**: Fewer examples are available for `th` (Thai) than expected. Could be a sentence segmentation problem.
- **Malay and Indonesian**: There are few examples for the `ms` (Malay) subset. Consider also using the `id` (Indonesian) subset when training models.
- **Chinese written forms**: This dataset does not distinguish between different Chinese character variations.

## Licence

This dataset is released under a [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) licence, allowing for free use and distribution as long as proper attribution is given to the original source.]
Please write a one to two sentence summary of the dataset card.
The summary should be concise and informative, capturing the essence of the dataset.
The summary should be in English.
The goal of the summary is to provide a quick overview of the dataset's content and purpose. 
This summary will be used to help users quickly understand the dataset and as input for creating embeddings for the dataset card.
    
def create_messages(row):
    return {"messages": [
        {
            "role": "user",
            "content": format_prompt_for_card(row),
        },
    ]}
ds = ds.map(create_messages, num_proc=4)
ds
Dataset({
    features: ['datasetId', 'author', 'last_modified', 'downloads', 'likes', 'tags', 'task_categories', 'createdAt', 'trending_score', 'card', 'messages'],
    num_rows: 2082
})

We remove columns we don’t need

ds = ds.remove_columns([c for c in ds.column_names if c not in ['messages', 'datasetId']])

And push to the Hub!

ds.push_to_hub("davanstrien/cards_with_prompts")
CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/cards_with_prompts/commit/8e32c041eba4fbf1729e3f5a4d1536365185f7d2', commit_message='Upload dataset', commit_description='', oid='8e32c041eba4fbf1729e3f5a4d1536365185f7d2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davanstrien/cards_with_prompts', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davanstrien/cards_with_prompts'), pr_revision=None, pr_num=None)
Note

Hugging Face recently moved most of the backend storage to Xet. The tl;dr of this is that it means that datasets are deduplicated at a much more granular level, this makes working with datasets which change regularly much more efficient. See for more details. This combined with Jobs could make for a very powerful combination for running jobs on datasets that change frequently.

Launching our job

We now have the dataset with the prompts we want to use for inference.

The interface for Jobs should look familiar if you’ve used Docker before. We can use Jobs via CLI or Python API. Via the CLI a basic command to run a job looks like this:

hf jobs run python:3.12 python -c "print('Hello from the cloud!')"

There is also an experimental uv command thaty allows us to run uv scripts directly:

hf jobs uv run script-url

As an example, we can run another simple script from the uv scripts org and just print the help for the script:

hf jobs uv run https://huggingface.co/datasets/uv-scripts/deduplication/raw/main/semantic-dedupe.py --help
/Users/davanstrien/Library/Application Support/uv/tools/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/utils/_experimental.py:60: UserWarning: 'HfApi.run_uv_job' is experimental and might be subject to breaking changes in the future without prior notice. You can disable this warning by setting `HF_HUB_DISABLE_EXPERIMENTAL_WARNING=1` as environment variable.
  warnings.warn(
Job started with ID: 688a31096dcd97e42f8095e7
View at: https://huggingface.co/jobs/davanstrien/688a31096dcd97e42f8095e7
Downloading pygments (1.2MiB)
Downloading hf-xet (3.0MiB)
Downloading numpy (15.9MiB)
Downloading tokenizers (3.0MiB)
Downloading setuptools (1.1MiB)
Downloading aiohttp (1.6MiB)
Downloading pandas (11.4MiB)
Downloading pyarrow (40.8MiB)
Downloading usearch (2.0MiB)
Downloading hf-transfer (3.4MiB)
Downloading simsimd (1.0MiB)
 Downloading simsimd
 Downloading usearch
 Downloading tokenizers
 Downloading hf-xet
 Downloading hf-transfer
 Downloading aiohttp
 Downloading pygments
 Downloading setuptools
 Downloading numpy
 Downloading pyarrow
 Downloading pandas
Installed 50 packages in 116ms
usage: semantic-dedupekDOpug.py [-h] [--split SPLIT]
                                [--method {duplicates,outliers,representatives}]
                                [--threshold THRESHOLD]
                                [--batch-size BATCH_SIZE]
                                [--max-samples MAX_SAMPLES] [--private]
                                [--hf-token HF_TOKEN]
                                dataset column output_repo

Deduplicate a dataset using semantic similarity

positional arguments:
  dataset               Input dataset ID (e.g., 'imdb' or 'username/dataset')
  column                Text column to deduplicate on
  output_repo           Output dataset repository name

options:
  -h, --help            show this help message and exit
  --split SPLIT         Dataset split to process (default: train)
  --method {duplicates,outliers,representatives}
                        Deduplication method (default: duplicates)
  --threshold THRESHOLD
                        Similarity threshold for duplicates (default: 0.9)
  --batch-size BATCH_SIZE
                        Batch size for processing (default: 64)
  --max-samples MAX_SAMPLES
                        Maximum number of samples to process (for testing)
  --private             Create private dataset repository
  --hf-token HF_TOKEN   Hugging Face API token (defaults to HF_TOKEN env var)

Examples:
  # Basic usage
  uv run semantic-dedupe.py imdb text imdb-deduped

  # With options
  uv run semantic-dedupe.py squad question squad-deduped --threshold 0.85 --method duplicates

  # Test with small sample
  uv run semantic-dedupe.py large-dataset text test-dedup --max-samples 100
        

You’ll see that uv takes care of installing the dependencies and running the script. This is very convenient since we don’t have to worry about setting up a virtual environment or installing dependencies manually. This can also be very nice if you want to share a script with others and want to help them avoid getting stuck in dependency hell.

We can also run hf jobs via the Python API. This is very convenient if you want to run jobs programmatically or if you want to integrate Jobs into your existing Python code (i.e. to run one step that requires a GPU and another step that doesn’t).

Running our inference job via huggingface_hub library

We can use the huggingface_hub library to run our inference job using run_uv_job

We’ll grab a token to pass to our job

from huggingface_hub import HfApi, get_token


HF_TOKEN = get_token()

We’ll create an instance of the HfApi class and use the run_uv_job method to run our job. We’ll pass the URL of the script we want to run, the dataset we want to use, and the parameters we want to use for the job.

api = HfApi()

Let’s see what the run_uv_job method looks like:

?api.run_uv_job
Signature:

api.run_uv_job(

    script: 'str',

    *,

    script_args: 'Optional[List[str]]' = None,

    dependencies: 'Optional[List[str]]' = None,

    python: 'Optional[str]' = None,

    image: 'Optional[str]' = None,

    env: 'Optional[Dict[str, Any]]' = None,

    secrets: 'Optional[Dict[str, Any]]' = None,

    flavor: 'Optional[SpaceHardware]' = None,

    timeout: 'Optional[Union[int, float, str]]' = None,

    namespace: 'Optional[str]' = None,

    token: 'Union[bool, str, None]' = None,

    _repo: 'Optional[str]' = None,

) -> 'JobInfo'

Docstring:

Run a UV script Job on Hugging Face infrastructure.



Args:

    script (`str`):

        Path or URL of the UV script.



    script_args (`List[str]`, *optional*)

        Arguments to pass to the script.



    dependencies (`List[str]`, *optional*)

        Dependencies to use to run the UV script.



    python (`str`, *optional*)

        Use a specific Python version. Default is 3.12.



    image (`str`, *optional*, defaults to "ghcr.io/astral-sh/uv:python3.12-bookworm"):

        Use a custom Docker image with `uv` installed.



    env (`Dict[str, Any]`, *optional*):

        Defines the environment variables for the Job.



    secrets (`Dict[str, Any]`, *optional*):

        Defines the secret environment variables for the Job.



    flavor (`str`, *optional*):

        Flavor for the hardware, as in Hugging Face Spaces. See [`SpaceHardware`] for possible values.

        Defaults to `"cpu-basic"`.



    timeout (`Union[int, float, str]`, *optional*):

        Max duration for the Job: int/float with s (seconds, default), m (minutes), h (hours) or d (days).

        Example: `300` or `"5m"` for 5 minutes.



    namespace (`str`, *optional*):

        The namespace where the Job will be created. Defaults to the current user's namespace.



    token `(Union[bool, str, None]`, *optional*):

        A valid user access token. If not provided, the locally saved token will be used, which is the

        recommended authentication method. Set to `False` to disable authentication.

        Refer to: https://huggingface.co/docs/huggingface_hub/quick-start#authentication.



Example:



    ```python

    >>> from huggingface_hub import run_uv_job

    >>> script = "https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/trl/scripts/sft.py"

    >>> run_uv_job(script, dependencies=["trl"], flavor="a10g-small")

    ```

File:      ~/Documents/daniel/blog/.venv/lib/python3.12/site-packages/huggingface_hub/hf_api.py

Type:      method

we can use the run_uv_job method to run our job. We’ll pass the URL of the script we want to run, the dataset we want to use, and the parameters we want to use for the job. These parameters will be passed to the script as command line arguments. Since we’re using vLLM, we’ll pass the vllm Docker image. This will mean that our job is run in this Docker container.

Note

This Docker image already has uv installed but if you want to use an image + uv for an image without uv insalled you’ll need to make sure uv is installed first. You can also not specify any image and hf jobs will use the default UV image which has uv installed. This will work well in many cases but for LLM inference libraries which can have quite specific requirements, it can be useful to use a specific image that has the library installed.

We can now run our job using the run_uv_job method. This will start the job and return a job object that we can use to monitor the job’s progress.

job = api.run_uv_job(
    script="https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py",
    script_args=[
        "davanstrien/cards_with_prompts",  # Dataset with prompts
        "davanstrien/test-generated-responses",  # Where to store the generated responses
        "--model-id",  # Model to use for inference
        "Qwen/Qwen3-30B-A3B-Instruct-2507",  # Model to use for inference
        "--gpu-memory-utilization",  # GPU memory utilization
        "0.9",
        "--max-tokens",  # Maximum number of tokens
        "900",
        "--max-model-len",  # Maximum model length
        "8000",
    ],
    flavor="l4x4",  # What hardware to use
    image="vllm/vllm-openai:latest",  # Docker image to use
    secrets={"HF_TOKEN": HF_TOKEN},  # Pass as secret``
    env={"UV_PRERELEASE": "if-necessary"},  # Pass as env var
)
/Users/davanstrien/Documents/daniel/blog/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_experimental.py:60: UserWarning: 'HfApi.run_uv_job' is experimental and might be subject to breaking changes in the future without prior notice. You can disable this warning by setting `HF_HUB_DISABLE_EXPERIMENTAL_WARNING=1` as environment variable.
  warnings.warn(

We can get a url for our job, this will give us a page where we can monitor the job’s progress and see the logs (note this won’t URL won’t work for you unless you run this job yourself).

print(f"Job URL: {job.url}")
Job URL: https://huggingface.co/jobs/davanstrien/688a33391c97bc486de2a232

We can also print the status of the job

job.status
JobStatus(stage='RUNNING', message=None)

There are also a bunch of other attributes from the job that can be useful when running jobs as part of a larger workflow. For example, we can get the job’s creation time, the job’s status etc.

job.created_at
datetime.datetime(2025, 7, 30, 14, 59, 5, 648000, tzinfo=datetime.timezone.utc)
job.flavor
'l4x4'
job.command
['uv',
 'run',
 'https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py',
 'davanstrien/cards_with_prompts',
 'davanstrien/test-generated-responses',
 '--model-id',
 'Qwen/Qwen3-30B-A3B-Instruct-2507',
 '--gpu-memory-utilization',
 '0.9',
 '--max-tokens',
 '900',
 '--max-model-len',
 '8000']

We can also grab the logs

api.fetch_job_logs(
    job_id=job.id,
)
<generator object HfApi.fetch_job_logs at 0x1612df4c0>

This returns a generator, let’s turn it into a list so we can print out a few examples of the logs

print(
    list(
        api.fetch_job_logs(
            job_id=job.id,
        )
    )[:-10]
)  # Print the last 10 lines of logs

We can also see the resulting dataset for the job here or below. You can see we have the original prompts and the generated responses.