Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

ECCV 2026
Xingtong Ge1,2, Yi Zhang2, Yushi Huang1, Dailan He2, Xiahong Wang2, Bingqi Ma2, Guanglu Song2, Yu Liu2, Jun Zhang1
1Hong Kong University of Science and Technology   2Vivix Group Limited
Qualitative results from Salt for fast video generation

Salt improves few-step video generation quality and temporal stability for both non-autoregressive and autoregressive settings.

Abstract

Distilling video generation models to extremely low inference budgets is crucial for real-time deployment, yet remains difficult. Trajectory-style consistency distillation can become conservative under complex video dynamics, while Distribution Matching Distillation (DMD) produces sharper samples but lacks an explicit constraint on how denoising updates compose across timesteps. Salt addresses this gap with Self-Consistent Distribution Matching Distillation (SC-DMD), which regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, Salt further treats the KV cache as a quality-parameterized condition and introduces Cache-Distribution-Aware training with multi-step rollouts and cache-conditioned feature alignment. Across non-autoregressive and autoregressive backbones, Salt improves low-NFE video generation while adding no inference overhead.

Compositional Regularization

SC-DMD penalizes the semigroup defect between direct and composed denoising updates, improving multi-step rollout coherence.

Distribution Matching

The method keeps DMD as the primary alignment signal, preserving sharpness and avoiding over-smoothed trajectory averages.

Cache-Aware Training

Mixed-step autoregressive rollouts expose the model to varying cache quality and transfer high-step features to low-step outputs.

Method Overview

Salt combines SC-DMD for compositional few-step denoising with cache-aware mixed-step autoregressive training.

SC-DMD: Semigroup-Defect Regularization

Conceptual comparison between consistency distillation, shortcut models, DMD, and SC-DMD

SC-DMD keeps distribution matching as the primary objective while adding a self-consistency regularizer on composed denoising updates.

Salt: Cache-Aware Mixed-Step Training

Overview of Salt cache-aware mixed-step autoregressive training

Salt samples few-step trajectories, applies SC-DMD over denoising chunks, and aligns low-quality cache rollouts with high-quality references.

Video Comparisons

Qualitative comparisons between the Causal Forcing baseline and Salt under the same prompts.

Case 1
A couple in formal evening wear caught in a heavy downpour.
Causal Forcing
Salt
Case 2
A stage performance with a trumpet under dramatic lighting.
Causal Forcing
Salt
Case 3
A watercolor-style reading scene with consistent identity.
Causal Forcing
Salt
Case 4
Macro fruit rotation with water drops and stable object layout.
Causal Forcing
Salt

Quantitative Results

Salt improves few-step video generation across image-to-video, short autoregressive, and long-horizon settings.

93.90
Best I2V Score on Wan 2.1 I2V 14B, 4 NFE
85.08
VBench Total for Ours - Causal Forcing, 4 NFE
79.27
VBench-Long Total for Ours - LongLive, 30-second rollouts
Table 1. VBench-I2V Results on Wan 2.1 14B
Method NFE I2V Score Quality Backg. Consist. Motion Smooth. Dynamic Degree Imaging Quality Temporal Flicker
Consistency distillation
PCM893.6378.5297.3498.2430.9870.4297.67
Distribution matching
DMD493.0978.8992.7997.9958.4670.3595.21
LightX2V493.5080.9295.8797.8960.3371.6796.30
Ours493.9080.8695.9798.3752.8572.1697.41
Ours-alpha493.8881.7195.4698.3068.1372.0896.48

Higher is better. Bold indicates best and underline indicates second best within the table.

Table 2. VBench Results for 5-Second Video Generation
Model #Params Resolution NFE Total Quality Semantic
Diffusion models
rCM1.3B832 x 480482.7383.6579.04
DMD1.3B832 x 480482.7884.3976.36
DMD1.3B832 x 480282.4183.4978.06
Ours (SC-DMD)1.3B832 x 480483.1984.4278.30
Ours (SC-DMD)1.3B832 x 480282.8584.0678.01
Autoregressive models
CausVid1.3B832 x 480481.2084.0569.80
Self Forcing1.3B832 x 480484.2084.7482.05
Ours - Self Forcing1.3B832 x 480484.4785.2781.28
LongLive1.3B832 x 480484.4085.1281.53
Ours - LongLive1.3B832 x 480484.9385.4183.00
Causal Forcing1.3B832 x 480484.6285.4181.47
Ours - Causal Forcing1.3B832 x 480485.0885.9681.59
Ours - Causal Forcing1.3B832 x 480284.8085.6381.49
Table 3. VBench-Long Results for 30-Second Autoregressive Generation
Model Total Quality Semantic Subject Consist. Temp. Flicker Motion Smooth. Dynamic Degree Aesthetic Quality Imaging Quality
Causal Forcing family
Causal Forcing78.1182.5760.2596.2995.4797.6776.9556.5071.12
Ours - Causal Forcing78.2882.1562.7797.1696.9598.3254.0257.4270.99
LongLive family
LongLive79.0382.8263.8897.3696.6298.1663.4558.2670.90
Ours - LongLive79.2782.9064.7497.2896.9298.2963.6858.3570.07

Ablation Study

Ablations isolate the effect of self-consistency regularization and cache-aware autoregressive training.

SC-DMD Ablation on Wan 2.1 T2V
Method Quality Semantic Total Spatial Rel. Multi-Objects Object Class Imaging Q.
DMD-884.0576.5082.5467.1378.9893.4765.46
DMD-484.3976.3682.7869.4979.3793.2364.73
SC-DMD (Ours)84.4278.3083.1972.3285.7295.1666.40
Rebuttal Ablation: SC-DMD vs. Shortcut-Style Alternatives
Method Quality Semantic Total
SC vs. SC-DMD at 4 NFEs
SCFM79.4770.6377.70
Naive DMD + SCFM77.1766.6475.06
DMD-484.3976.3682.78
SC-DMD84.7677.7783.36
Sensitivity to lambda_SC
lambda_SC = 0.484.4078.1183.27
lambda_SC = 1.084.2178.2983.02
Autoregressive Training Component Ablation on Causal Forcing
Method NFE Quality Semantic Total
4-NFE main ablation
Causal Forcing official baseline485.4181.4784.62
CF + naive LSC484.3581.7783.83
CF + mixed-step + LSC485.9181.4885.02
Ours (full)485.9681.5985.08
2-NFE low-budget transfer
CF + mixed-step + LSC285.6280.6584.63
Ours (full)285.6381.4984.80

Self-Consistency Across Euler Steps

SC-DMD keeps the generated structure more stable as Euler sampling uses 2, 4, and 8 denoising steps.

Euler 2-Step
DMD vs SC-DMD under the same prompt and sampler.
DMD
SC-DMD
Euler 4-Step
Compositional regularization reduces structure drift in few-step rollouts.
DMD
SC-DMD
Euler 8-Step
SC-DMD maintains cleaner visual details under longer composed denoising trajectories.
DMD
SC-DMD

BibTeX

@article{ge2026salt,
  title={Salt: Self-consistent distribution matching with cache-aware training for fast video generation},
  author={Ge, Xingtong and Zhang, Yi and Huang, Yushi and He, Dailan and Wang, Xiahong and Ma, Bingqi and Song, Guanglu and Liu, Yu and Zhang, Jun},
  journal={arXiv preprint arXiv:2604.03118},
  year={2026}
}