Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization Paper • 2605.26457 • Published 13 days ago • 6
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 7 days ago • 53
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 7 days ago • 53
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 7 days ago • 53
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published 19 days ago • 12
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published 19 days ago • 12
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published 19 days ago • 12
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design Paper • 2605.10978 • Published 26 days ago • 19
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 30 days ago • 80
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 30 days ago • 80
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
Measuring Sycophancy of Language Models in Multi-turn Dialogues Paper • 2505.23840 • Published May 28, 2025 • 3
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Paper • 2507.00432 • Published Jul 1, 2025 • 79
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs Paper • 2508.13141 • Published Aug 18, 2025