Abstract
Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify action prediction. To enable this efficient shortcut, we introduce a self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model representations extracted from the diffusion model's own multi-step generated videos provide teacher targets, while lightweight decouplers learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments.
Introduction
As illustrated in the teaser figure, current VAMs mainly fall into two paradigms. The first relies on generating future videos to guide control, but the latency of multi-step denoising makes it unsuitable for high-frequency closed-loop control. The second paradigm, exemplified by one-step feature extraction, is efficient but yields noisy and highly entangled representations with poor temporal coherence. These representations lack the geometry-oriented cues needed to compensate for monocular depth ambiguity and the semantic distinctiveness required to distinguish task-relevant objects from irrelevant elements. To overcome this dilemma between inference latency and foresight fidelity, S-VAM foresees coherent geometric and semantic representations via a single forward pass.
Method
As shown in the method figure, S-VAM establishes a shortcut that bypasses the prohibitive latency of iterative video generation. We extract one-step denoising features, use specialized decouplers to disentangle them into geometric and semantic foresight, and supervise these branches with DPAv3 and DINOv2 representations extracted from the model's own multi-step generated videos. The resulting foresight is then aggregated with the original diffusion features by a Uni-Perceiver, providing a holistic conditioning context for the downstream diffusion policy to predict precise robot action.
Experiments
Simulation Benchmarks
| Category | Method | ith Task Success Rate | Avg. Len. ↑ | ||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | |||
| Direct Action Learning Methods |
OpenVLA | 91.3 | 77.8 | 62.0 | 52.1 | 43.5 | 3.27 |
| CLOVER | 96.0 | 83.5 | 70.8 | 57.5 | 45.4 | 3.53 | |
| π0 | 93.7 | 83.2 | 74.0 | 62.9 | 51.0 | 3.65 | |
| Spatial Forcing | 93.6 | 85.8 | 78.4 | 72.0 | 64.6 | 3.94 | |
| Predictive Methods |
SuSIE | 87.0 | 69.0 | 49.0 | 38.0 | 26.0 | 2.69 |
| VPP | 90.9 | 81.5 | 71.3 | 62.0 | 51.8 | 3.58 | |
| Uni-VLA | 95.5 | 85.8 | 74.8 | 66.9 | 56.5 | 3.80 | |
| HiF-VLA | 93.5 | 87.4 | 81.4 | 75.9 | 69.4 | 4.08 | |
| S-VAM (our) | 95.8 | 90.7 | 83.7 | 77.0 | 68.9 | 4.16 | |
| Category | Method | Easy | Middle | Hard | Average ↑ |
|---|---|---|---|---|---|
| (28 tasks) | (11 tasks) | (11 tasks) | (50 tasks) | ||
| Direct Action Learning Methods |
RT-1 | 0.605 | 0.042 | 0.015 | 0.346 |
| Diffusion Policy | 0.442 | 0.062 | 0.095 | 0.279 | |
| Spatial Forcing | 0.737 | 0.436 | 0.451 | 0.609 | |
| Predictive Methods |
SuSIE | 0.560 | 0.196 | 0.255 | 0.410 |
| GR-1 | 0.725 | 0.327 | 0.451 | 0.574 | |
| HiF-VLA | 0.729 | 0.364 | 0.404 | 0.577 | |
| VPP | 0.818 | 0.493 | 0.526 | 0.682 | |
| S-VAM (our) | 0.793 | 0.607 | 0.684 | 0.728 |
Qualitative Comparison
Real-World Experiments
BibTeX