FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

hynky new activity about 1 hour ago

HuggingFaceFW/finepdfs:How to use this dataset to extract PDFs by subject?

hynky new activity about 1 hour ago

HuggingFaceFW/finepdfs:Can additional corpuses further train this model?

hynky new activity about 1 hour ago

HuggingFaceFW/finepdfs:Decontamination against benchmarks?

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

hynky

in HuggingFaceFW/finepdfs about 1 hour ago

How to use this dataset to extract PDFs by subject?

👍 1

#14 opened 4 months ago by

vgoklani

Can additional corpuses further train this model?

#13 opened 4 months ago by

fenjamin

Decontamination against benchmarks?

#11 opened 4 months ago by

jo-kn

MarCognity-AI for HuggingFaceFW/finepdfs

#23 opened 3 months ago by

elly99

hynky

updated a Space about 7 hours ago

FinePDFs: Liberating 3T of the finest tokens from PDFs

📄

hynky

published a Space 1 day ago

FinePDFs: Liberating 3T of the finest tokens from PDFs

📄

guipenedo

updated a Space 1 day ago

FinePDFs: Liberating 3T of the finest tokens from PDFs

📄

craffel

authored a paper 13 days ago

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Paper • 2512.20757 • Published 15 days ago • 16

eliebak

submitted a paper to Daily Papers 20 days ago

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Paper • 2512.14080 • Published 23 days ago • 6

hynky

in HuggingFaceFW/finepdfs 24 days ago

Which language detector did you use

#28 opened 24 days ago by

ming030890

hynky

in HuggingFaceFW/finepdfs 27 days ago

The "file_path" data field appears to primarily contain cc-index paths rather than WARC paths.

#16 opened 4 months ago by

lnstrument

A Few Questions About the Implementation Details of the finepdfs Project

#24 opened 3 months ago by

yoliax

hynky

in HuggingFaceFW/finepdfs about 1 month ago

Dataset broken by latest update?

#27 opened about 1 month ago by

Rijgersberg

hynky

updated a dataset about 1 month ago

HuggingFaceFW/finepdfs

Viewer • Updated Dec 2, 2025 • 476M • 24.7k • 692

meg

posted an update 2 months ago

Post

3874

🤖 Did you know your voice might be cloned without your consent from just *one sentence* of audio?
That's not great. So with @frimelle , we brainstormed a new idea for developers who want to curb malicious use: ✨The Voice Consent Gate.✨
Details, code, here: https://huggingface.co/blog/voice-consent-gate

3 replies

thomwolf

authored a paper 3 months ago

Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14, 2025 • 120

lvwerra

authored a paper 3 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9, 2025 • 36

meg

posted an update 4 months ago

Post

2911

🤖 As AI-generated content is shared in movies/TV/across the web, there's one simple low-hanging fruit 🍇 to help know what's real: Visible watermarks. With the Gradio team, I've made sure it's trivially easy to add this disclosure to images, video, chatbot text. See how: https://huggingface.co/blog/watermarking-with-gradio
Thanks to the code collab in particular from @abidlabs and Yuvraj Sharma.

davanstrien

posted an update 4 months ago

Post

1541

I fine-tuned a smol VLM to generate specialized art history metadata!

https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

eliebak

posted an update 4 months ago

Post

3896

Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST 🤗

science

AI & ML interests

Recent Activity

Papers

Team members 18

HuggingFaceFW's activity

How to use this dataset to extract PDFs by subject?

Can additional corpuses further train this model?

Decontamination against benchmarks?

MarCognity-AI for HuggingFaceFW/finepdfs

FinePDFs: Liberating 3T of the finest tokens from PDFs

FinePDFs: Liberating 3T of the finest tokens from PDFs

FinePDFs: Liberating 3T of the finest tokens from PDFs

Which language detector did you use

The "file_path" data field appears to primarily contain cc-index paths rather than WARC paths.

A Few Questions About the Implementation Details of the finepdfs Project

Dataset broken by latest update?