Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

This repository contains the model weights for the DSR Suite, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models.

Introduction

Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in Dynamic Spatial Reasoning (DSR) – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce DSR Suite, which comprises:

Automated Data Generation Pipeline: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR.
DSR-Train: A training dataset of 50K QAs generated by the pipeline.
DSR-Bench: A human-refined benchmark with 1484 QAs for rigorous evaluation.
Geometry Selection Module (GSM): A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a Qwen2.5-VL-7B backbone, without compromising general understanding capabilities.

Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning.

Resources

Paper: Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
GitHub Repository: https://github.com/TencentARC/DSR_Suite
Hugging Face Dataset: TencentARC/DSR_Suite-Data
Hugging Face Collection: TencentARC/dsr-suite

Usage and Evaluation

For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official DSR_Suite GitHub repository.

The evaluation framework is based on VLMEvalKit. An example command for evaluating a trained model (like Qwen2.5-VL-7B-Instruct-ForVideo-Spatial) on the Spatial-Reasoning task is:

cd VLMEvalKit_mine
CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning

Citation

If you find our work useful, please consider citing:

@misc{zhou2025learning,
      title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models}, 
      author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi},
      year={2025},
      eprint={2512.20557},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.20557}, 
}

Acknowledgement

This work builds upon the following projects:

Qwen2.5-VL: The model codebase we built upon.
VLMEvalKit: The evaluation framework we built upon.
Grounded SAM2, Orient Anything, π^3: Models used in our data generation pipeline to extract 3D cues.
Koala-36M: The video database we build QAs upon.