S-VAM | Project Page

Abstract

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify action prediction. To enable this efficient shortcut, we introduce a self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model representations extracted from the diffusion model's own multi-step generated videos provide teacher targets, while lightweight decouplers learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments.

Introduction

As illustrated in the teaser figure, current VAMs mainly fall into two paradigms. The first relies on generating future videos to guide control, but the latency of multi-step denoising makes it unsuitable for high-frequency closed-loop control. The second paradigm, exemplified by one-step feature extraction, is efficient but yields noisy and highly entangled representations with poor temporal coherence. These representations lack the geometry-oriented cues needed to compensate for monocular depth ambiguity and the semantic distinctiveness required to distinguish task-relevant objects from irrelevant elements. To overcome this dilemma between inference latency and foresight fidelity, S-VAM foresees coherent geometric and semantic representations via a single forward pass.

Overview figure for S-VAM showing the shortcut from one-step features to geometric and semantic foresight. — Motivation and overview of our shortcut video-action model. One-step feature extraction is fast but yields noisy and entangled representations, whereas multi-step video generation predicts precise future states but is too slow for real-time control. S-VAM addresses this by foreseeing coherent geometric and semantic representations via a single forward pass, with teacher supervision extracted from the diffusion model's own multi-step generated videos during training.

Method

As shown in the method figure, S-VAM establishes a shortcut that bypasses the prohibitive latency of iterative video generation. We extract one-step denoising features, use specialized decouplers to disentangle them into geometric and semantic foresight, and supervise these branches with DPAv3 and DINOv2 representations extracted from the model's own multi-step generated videos. The resulting foresight is then aggregated with the original diffusion features by a Uni-Perceiver, providing a holistic conditioning context for the downstream diffusion policy to predict precise robot action.

Experiments

Simulation Benchmarks

Category	Method	i^th Task Success Rate					Avg. Len. ↑
Category	Method	1	2	3	4	5	Avg. Len. ↑
Direct Action Learning Methods	OpenVLA	91.3	77.8	62.0	52.1	43.5	3.27
	CLOVER	96.0	83.5	70.8	57.5	45.4	3.53
	π₀	93.7	83.2	74.0	62.9	51.0	3.65
	Spatial Forcing	93.6	85.8	78.4	72.0	64.6	3.94
Predictive Methods	SuSIE	87.0	69.0	49.0	38.0	26.0	2.69
	VPP	90.9	81.5	71.3	62.0	51.8	3.58
	Uni-VLA	95.5	85.8	74.8	66.9	56.5	3.80
	HiF-VLA	93.5	87.4	81.4	75.9	69.4	4.08
	S-VAM (our)	95.8	90.7	83.7	77.0	68.9	4.16

Quantitative comparison on CALVIN. We report stage-wise success rates (1-5) and average sequence length, with best and second-best results highlighted accordingly.

Category	Method	Easy	Middle	Hard	Average ↑
Category	Method	(28 tasks)	(11 tasks)	(11 tasks)	(50 tasks)
Direct Action Learning Methods	RT-1	0.605	0.042	0.015	0.346
	Diffusion Policy	0.442	0.062	0.095	0.279
	Spatial Forcing	0.737	0.436	0.451	0.609
Predictive Methods	SuSIE	0.560	0.196	0.255	0.410
	GR-1	0.725	0.327	0.451	0.574
	HiF-VLA	0.729	0.364	0.404	0.577
	VPP	0.818	0.493	0.526	0.682
	S-VAM (our)	0.793	0.607	0.684	0.728

Quantitative comparison on MetaWorld. We report success rates on easy, middle, and hard tasks, together with the overall average, with best and second-best results highlighted accordingly.

Qualitative Comparison

Real-World Experiments

Real-world experimental results for S-VAM. — Multi-task real-world experiments. We deploy S-VAM on a dual-arm Cobot using only monocular front-camera observations. S-VAM demonstrates a significant success-rate improvement over VPP on all four tasks without compromising real-time control capabilities.

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight