S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan1 Zhide Zhong1 Jiaguan Zhu1 Junjie He1 Weilin Yuan1 Wenxuan Song1 Xin Gong1 Yingjie CAI2 Guanyi Zhao2 Xu Yan2 Bingbing Liu2 Ying-Cong Chen1 Haoang Li1

1 The Hong Kong University of Science and Technology (Guangzhou) | 2 Huawei Foundation Model Department

Abstract

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify action prediction. To enable this efficient shortcut, we introduce a self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model representations extracted from the diffusion model's own multi-step generated videos provide teacher targets, while lightweight decouplers learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments.

Introduction

As illustrated in the teaser figure, current VAMs mainly fall into two paradigms. The first relies on generating future videos to guide control, but the latency of multi-step denoising makes it unsuitable for high-frequency closed-loop control. The second paradigm, exemplified by one-step feature extraction, is efficient but yields noisy and highly entangled representations with poor temporal coherence. These representations lack the geometry-oriented cues needed to compensate for monocular depth ambiguity and the semantic distinctiveness required to distinguish task-relevant objects from irrelevant elements. To overcome this dilemma between inference latency and foresight fidelity, S-VAM foresees coherent geometric and semantic representations via a single forward pass.

Overview figure for S-VAM showing the shortcut from one-step features to geometric and semantic foresight.
Motivation and overview of our shortcut video-action model. One-step feature extraction is fast but yields noisy and entangled representations, whereas multi-step video generation predicts precise future states but is too slow for real-time control. S-VAM addresses this by foreseeing coherent geometric and semantic representations via a single forward pass, with teacher supervision extracted from the diffusion model's own multi-step generated videos during training.

Method

As shown in the method figure, S-VAM establishes a shortcut that bypasses the prohibitive latency of iterative video generation. We extract one-step denoising features, use specialized decouplers to disentangle them into geometric and semantic foresight, and supervise these branches with DPAv3 and DINOv2 representations extracted from the model's own multi-step generated videos. The resulting foresight is then aggregated with the original diffusion features by a Uni-Perceiver, providing a holistic conditioning context for the downstream diffusion policy to predict precise robot action.

S-VAM method architecture.
Architecture of S-VAM. The core technical novelty lies in establishing a shortcut that bypasses the prohibitive latency of iterative video generation. Specialized decouplers disentangle highly entangled one-step diffusion features into coherent geometric and semantic foresight, which is then aggregated with original features by a Uni-Perceiver before downstream diffusion-policy action prediction.

Experiments

Simulation Benchmarks

Category Method ith Task Success Rate Avg. Len. ↑
1 2 3 4 5
Direct Action
Learning Methods
OpenVLA 91.3 77.8 62.0 52.1 43.5 3.27
CLOVER 96.0 83.5 70.8 57.5 45.4 3.53
π0 93.7 83.2 74.0 62.9 51.0 3.65
Spatial Forcing 93.6 85.8 78.4 72.0 64.6 3.94
Predictive
Methods
SuSIE 87.0 69.0 49.0 38.0 26.0 2.69
VPP 90.9 81.5 71.3 62.0 51.8 3.58
Uni-VLA 95.5 85.8 74.8 66.9 56.5 3.80
HiF-VLA 93.5 87.4 81.4 75.9 69.4 4.08
S-VAM (our) 95.8 90.7 83.7 77.0 68.9 4.16
Quantitative comparison on CALVIN. We report stage-wise success rates (1-5) and average sequence length, with best and second-best results highlighted accordingly.
Category Method Easy Middle Hard Average ↑
(28 tasks) (11 tasks) (11 tasks) (50 tasks)
Direct Action
Learning Methods
RT-1 0.605 0.042 0.015 0.346
Diffusion Policy 0.442 0.062 0.095 0.279
Spatial Forcing 0.737 0.436 0.451 0.609
Predictive
Methods
SuSIE 0.560 0.196 0.255 0.410
GR-1 0.725 0.327 0.451 0.574
HiF-VLA 0.729 0.364 0.404 0.577
VPP 0.818 0.493 0.526 0.682
S-VAM (our) 0.793 0.607 0.684 0.728
Quantitative comparison on MetaWorld. We report success rates on easy, middle, and hard tasks, together with the overall average, with best and second-best results highlighted accordingly.

Qualitative Comparison

Qualitative comparison on CALVIN.
Qualitative comparison on CALVIN. VPP utilizes entangled one-step features, resulting in an erratic attention trajectory that explicitly contradicts the language instruction and leads to failed actuation. In contrast, S-VAM foresees geometric and semantic representations that enable a coherent attention trajectory aligned with the language instruction, ensuring successful execution.
Qualitative comparison on MetaWorld.
Qualitative comparison on MetaWorld. VPP utilizes entangled one-step features, resulting in a diverging attention trajectory that completely misses the target object. In contrast, S-VAM foresees explicit geometric and semantic representations, enabling the action expert to anchor a coherent attention trajectory for accurate target-oriented grasping.

Real-World Experiments

Real-world experimental results for S-VAM.
Multi-task real-world experiments. We deploy S-VAM on a dual-arm Cobot using only monocular front-camera observations. S-VAM demonstrates a significant success-rate improvement over VPP on all four tasks without compromising real-time control capabilities.

BibTeX