ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code Paper • 2506.02314 • Published Jun 2, 2025
Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning Paper • 2506.05256 • Published Jun 5, 2025 • 2
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing Paper • 2507.00769 • Published Jul 1, 2025 • 5
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models Paper • 2407.07086 • Published Jul 9, 2024
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though Paper • 2501.04682 • Published Jan 8, 2025 • 99