Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published 6 days ago • 49
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality Paper • 2512.10791 • Published 9 days ago • 5
Evaluating Gemini Robotics Policies in a Veo World Simulator Paper • 2512.10675 • Published 9 days ago • 15
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale Paper • 2512.10398 • Published 10 days ago • 6
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving Paper • 2512.10739 • Published 9 days ago • 45
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published 24 days ago • 12
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research Paper • 2511.19399 • Published 26 days ago • 59
OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists Paper • 2511.16931 • Published 30 days ago • 6
WorldGen: From Text to Traversable and Interactive 3D Worlds Paper • 2511.16825 • Published about 1 month ago • 21
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents Paper • 2511.13593 • Published Nov 17 • 24
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Paper • 2511.16334 • Published about 1 month ago • 91
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs Paper • 2511.16664 • Published about 1 month ago • 25
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark Paper • 2511.13853 • Published Nov 17 • 34