MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence Paper • 2512.10863 • Published 14 days ago • 21
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations Paper • 2406.09401 • Published Jun 13, 2024
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Paper • 2505.17015 • Published May 22 • 9
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence Paper • 2505.23764 • Published May 29 • 3
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Paper • 2507.07984 • Published Jul 10 • 42
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization Paper • 2508.05211 • Published Aug 7 • 1
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning Paper • 2511.21688 • Published 29 days ago • 8
Seedream 4.0: Toward Next-generation Multimodal Image Generation Paper • 2509.20427 • Published Sep 24 • 81