Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

Jaward 
posted an update about 15 hours ago
view post
Post
3266
Our preprint is out!
We attempt to model human teaching behaviors into agents yielding a unified framework that enables adaptive personalized learning experiences:
LectūraAgents addresses the prevailing limitations in current AI learning systems with three essential capabilities:
(1) a hierarchical multi-agent architecture modeled on academic standards. we observe that agents collaborating across hierarchies yield better learning outcomes.
(2) an adaptive embodied teaching mechanism, in which the instructor agent executes visible and pedagogically motivated teaching actions (e.g. handwrite, highlight, circle etc) on contents in a teaching environment while speaking.
(3) to achieve this we propose a novel teaching action-speech alignment algorithm (TASA) that dynamically aligns speech with visual teaching actions: specifically, TASA temporally chops up speech segments into word-level tokens, performs salience heuristics analysis on learning contents (texts, images etc) then identifies relevant regions to apply pedagogical teaching actions that guide attention and augment understanding.

We conducted several experiments to assess these capabilities: starting with pedagogical evaluation of the various components under frontier models, comparative analysis with existing frameworks and an efficacy study with real students.

Results show consistent gains in standard instructional metrics (curated by expert educators) spanning lecture content quality, embodied teaching quality, assessment, and personalization over baseline systems, positioning LectūraAgents as a pedagogically well-grounded framework for personalized
learning at scale.

Paper: LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching (2606.16428)
Data: Jaward/lectura-agents-data
mmhamdy 
posted an update 2 days ago
view post
Post
2504
What if you could train a model on just 10 images instead of 60,000 and still get close to the same performance?

Traditional machine learning requires thousands, even millions, of data points to achieve high accuracy. But what if we could "distill" the entire dataset into just a few synthetic samples?

This is what Dataset Distillation offers. Unlike traditional knowledge distillation, we keep the model fixed and distill the knowledge contained in a massive training set into a tiny set of synthetic distilled images.

The goal is to train a model on this ultra-small set and achieve performance that almost matches what the same model would get when trained on the massive original dataset.

For example, training on only 10 distilled MNIST images (this is equivalent to a single image per class) yields 94% accuracy, compared to 99% when training on the full 60,000 images.

Interestingly, these distilled images look significantly different (as you can see in the image below) from natural images because they are optimized for model training rather than for matching the correct data distribution.

But that's not all.

Most importantly, this same method opens the door to a potent form of data poisoning. Because distilled images are specifically optimized for rapid learning, an attacker can create a tiny set of adversarial distilled images to cause a well-trained model to forget or misclassify a specific category.

What I find fascinating about dataset distillation is this: it mimics human-like learning by letting a model grasp a concept from a single example, but it does so using alien synthetic images that mean absolutely nothing to a human eye!

What about you? What are your thoughts on it?
  • 2 replies
·
KingNish 
posted an update 2 days ago
view post
Post
3927
We trained an open-source Mythos like cybersecurity LLM for the Build Small Hackathon meet OpenMythos

Trained in two stages: SFT on ~1.84K filtered ArXiv cs.CR papers + real CVE data, then RLVR using paired with past vulnerabilities GitHub repos with a verifier model checking outputs against ground truth.

Trained on: H100s from Modal

The RLVR stage made the biggest difference responses got more precise and less prone to confusing similar vulnerability classes.

Everything is open:
🤖 Demo → build-small-hackathon/OpenMythos
🧠 Model → build-small-hackathon/OpenMythos
📦 CVE Dataset → build-small-hackathon/CVE_Vulnerailities_Detailed
📄 ArXiv Dataset → himanshu17HF/ArvixImport-Filtered-Final

Try it out and let us know where it breaks 🙏
danielhanchen 
posted an update 2 days ago
owensong 
posted an update about 18 hours ago
view post
Post
554
I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.

The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.

Quick facts:
- 4.63M total inference parameters
- 3.46M acoustic model
- 1.17M vocoder
- 24 kHz audio
- English-only
- Single male voice
- Runs locally with a simple PyTorch inference script

Why I made it:
Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.

It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.

What works:
It can generate arbitrary English speech locally, and the model is small enough to be interesting for:

- local voice assistants
- embedded/edge experiments
- browser or WASM-style TTS exploration
- efficient inference research
- tiny-model baselines

Limitations:
The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.

So I would frame this as a research/demo release, not a production TTS engine.

I’d love feedback from people interested in:
- tiny speech models
- vocoders
- local TTS
- efficient inference
- embedded speech synthesis
- improving small-model generalization

If people find it useful, I’m interested in putting more training budget into a stronger v2.

Model page:
owensong/Inflect-Nano-v1
ovi054 
posted an update 3 days ago
view post
Post
3523
Qwen3-14B Manim Expert LoRA

For "Build Small Hackathon", I built a Gradio app that turns any concept into a Manim explainer video.

This is powered by Qwen3-14B + Manim LoRA I trained on a synthetic 10k dataset I generated.

👉 Try it now: build-small-hackathon/anim-vid-ai
  • 2 replies
·
kanaria007 
posted an update 2 days ago
view post
Post
166
✅ Article highlight: *Institutional Memory & Forgetting for Learning Worlds* (art-60-172, v0.1)

TL;DR:
This article argues that if a living world becomes training data, memory becomes infrastructure.

Logs, dialogue, labels, releases, feature stores, and model weights can turn a world into something that cannot honestly forget. 172 makes deletion, redaction, exclusion, forgetting requests, SANITIZED/PUBLIC releases, and unlearning claims into receipted governance lifecycles.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents learning worlds from becoming “unforgettable worlds”
• separates deletion, redaction, and future extraction exclusion
• makes right-to-be-forgotten requests caseable and appealable
• preserves canon facts without preserving every memory surface
• blocks public promises like “guaranteed deletion everywhere”

What’s inside:
• retention policy contracts for what may be kept, copied, trained on, or released
• corpus segment manifests and propagation indexes for known controlled copies
• forgetting request, adjudication, remedy, deletion, redaction, and exclusion receipts
• tombstone manifests and semantic preservation receipts for canon-safe forgetting
• use eligibility receipts for deciding whether a segment may train a future run
• release contracts, redaction maps, and irreversibility disclosures for SANITIZED/PUBLIC releases
• bounded unlearning contracts and post-unlearning verification receipts

Key idea:
Do not say:

*“we deleted it, so it is forgotten.”*

Say:

*“this subject was handled under this retention policy, propagation index, adjudication path, remedy contract, tombstone, semantic preservation receipt, extraction exclusion receipt, and bounded public claim.”*

Forgetting is not a button.

It is governance with receipts.
loay 
posted an update 3 days ago
view post
Post
1016
I built EchoYard for the
build-small-hackathon
: a tiny listen-and-repeat language practice app.

Pick a language, level, and voice style, listen to a short reference voice, record yourself, then get simple speaking feedback and a next practice step.

Built with
openbmb
VoxCPM2 for multilingual reference audio and MiniCPM5-1B for friendly feedback.

Try it here: https://build-small-hackathon-echoyard.hf.space

Would love feedback, especially on the recording flow and how useful the speaking tips feel.
nevmenandr 
posted an update 3 days ago
view post
Post
920
🔥 New Russian Stylometry Dataset!

Russian Stylometric Dataset (RSD) — 322 texts from the 19th – early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python).

📚 What's inside?

Fiction, journalism, scientific texts, drama, poetry

Grouped by author, gender, age, genre, literary movements (Romanticism/Realism)

Character speech (Tolstoy, Gogol, Ostrovsky)

Generated texts (LSTM, GPT)

📊 Use cases: authorship attribution, clustering, classification, benchmarking methods.

🔓 Public domain + GPL-3.0 license.

👉 Learn more: https://github.com/nevmenandr/RSD

DOI: 10.5281/zenodo.20701309
Moneyparking 
posted an update 3 days ago