Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
This repository contains the model weights for the DSR Suite, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models.
Introduction
Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in Dynamic Spatial Reasoning (DSR) – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce DSR Suite, which comprises:
- Automated Data Generation Pipeline: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR.
- DSR-Train: A training dataset of 50K QAs generated by the pipeline.
- DSR-Bench: A human-refined benchmark with 1484 QAs for rigorous evaluation.
- Geometry Selection Module (GSM): A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a Qwen2.5-VL-7B backbone, without compromising general understanding capabilities.
Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning.
Resources
- Paper: Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
- GitHub Repository: https://github.com/TencentARC/DSR_Suite
- Hugging Face Dataset: TencentARC/DSR_Suite-Data
- Hugging Face Collection: TencentARC/dsr-suite
Usage and Evaluation
For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official DSR_Suite GitHub repository.
The evaluation framework is based on VLMEvalKit. An example command for evaluating a trained model (like Qwen2.5-VL-7B-Instruct-ForVideo-Spatial) on the Spatial-Reasoning task is:
cd VLMEvalKit_mine
CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning
Citation
If you find our work useful, please consider citing:
@misc{zhou2025learning,
title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models},
author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi},
year={2025},
eprint={2512.20557},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.20557},
}
Acknowledgement
This work builds upon the following projects:
- Qwen2.5-VL: The model codebase we built upon.
- VLMEvalKit: The evaluation framework we built upon.
- Grounded SAM2, Orient Anything, π^3: Models used in our data generation pipeline to extract 3D cues.
- Koala-36M: The video database we build QAs upon.
- Downloads last month
- 16