ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
Abstract
Visual Foresight Planning (ForeAct) enhances Vision-Language-Action models by generating future observations and subtask descriptions to improve decision-making in real-world environments.
Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640times480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the π_0 baseline (46.5%) and a +30.3% absolute improvement over π_0 augmented with textual subtask guidance (57.1%).
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Action with Visual Primitives (2026)
- SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution (2026)
- PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction (2026)
- Anticipation-VLA: Solving Long-Horizon Embodied Tasks via Anticipation-based Subgoal Generation (2026)
- AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding (2026)
- Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System (2026)
- GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2602.12322 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper