Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Qualitative results from Salt for fast video generation

Salt improves few-step video generation quality and temporal stability for both non-autoregressive and autoregressive settings.

Abstract

Distilling video generation models to extremely low inference budgets is crucial for real-time deployment, yet remains difficult. Trajectory-style consistency distillation can become conservative under complex video dynamics, while Distribution Matching Distillation (DMD) produces sharper samples but lacks an explicit constraint on how denoising updates compose across timesteps. Salt addresses this gap with Self-Consistent Distribution Matching Distillation (SC-DMD), which regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, Salt further treats the KV cache as a quality-parameterized condition and introduces Cache-Distribution-Aware training with multi-step rollouts and cache-conditioned feature alignment. Across non-autoregressive and autoregressive backbones, Salt improves low-NFE video generation while adding no inference overhead.

Compositional Regularization

SC-DMD penalizes the semigroup defect between direct and composed denoising updates, improving multi-step rollout coherence.

Distribution Matching

The method keeps DMD as the primary alignment signal, preserving sharpness and avoiding over-smoothed trajectory averages.

Cache-Aware Training

Mixed-step autoregressive rollouts expose the model to varying cache quality and transfer high-step features to low-step outputs.

Method Overview

Salt combines SC-DMD for compositional few-step denoising with cache-aware mixed-step autoregressive training.

SC-DMD: Semigroup-Defect Regularization

Conceptual comparison between consistency distillation, shortcut models, DMD, and SC-DMD

SC-DMD keeps distribution matching as the primary objective while adding a self-consistency regularizer on composed denoising updates.

Salt: Cache-Aware Mixed-Step Training

Overview of Salt cache-aware mixed-step autoregressive training

Salt samples few-step trajectories, applies SC-DMD over denoising chunks, and aligns low-quality cache rollouts with high-quality references.

Video Comparisons

Qualitative comparisons between the Causal Forcing baseline and Salt under the same prompts.

Case 1

A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, featuring a steady and smooth perspective.

Causal Forcing

Salt

Case 2

A raccoon dressed in suit playing the trumpet, stage background.

Causal Forcing

Salt

Case 3

Gwen Stacy reading a book, watercolor painting.

Causal Forcing

Salt

Case 4

Few big purple plums rotating on the turntable. water drops appear on the skin during rotation. isolated on the white background. close-up. macro

Causal Forcing

Salt

Quantitative Results

Salt improves few-step video generation across image-to-video, short autoregressive, and long-horizon settings.

93.90

Best I2V Score on Wan 2.1 I2V 14B, 4 NFE

85.08

VBench Total for Ours - Causal Forcing, 4 NFE

79.27

VBench-Long Total for Ours - LongLive, 30-second rollouts

Table 1. VBench-I2V Results on Wan 2.1 14B

Method	NFE	I2V Score	Quality	Backg. Consist.	Motion Smooth.	Dynamic Degree	Imaging Quality	Temporal Flicker
Consistency distillation
PCM	8	93.63	78.52	97.34	98.24	30.98	70.42	97.67
Distribution matching
DMD	4	93.09	78.89	92.79	97.99	58.46	70.35	95.21
LightX2V	4	93.50	80.92	95.87	97.89	60.33	71.67	96.30
Ours	4	93.90	80.86	95.97	98.37	52.85	72.16	97.41
Ours-alpha	4	93.88	81.71	95.46	98.30	68.13	72.08	96.48

Higher is better. Bold indicates best and underline indicates second best within the table.

Table 2. VBench Results for 5-Second Video Generation

Model	#Params	Resolution	NFE	Total	Quality	Semantic
Diffusion models
rCM	1.3B	832 x 480	4	82.73	83.65	79.04
DMD	1.3B	832 x 480	4	82.78	84.39	76.36
DMD	1.3B	832 x 480	2	82.41	83.49	78.06
Ours (SC-DMD)	1.3B	832 x 480	4	83.19	84.42	78.30
Ours (SC-DMD)	1.3B	832 x 480	2	82.85	84.06	78.01
Autoregressive models
CausVid	1.3B	832 x 480	4	81.20	84.05	69.80
Self Forcing	1.3B	832 x 480	4	84.20	84.74	82.05
Ours - Self Forcing	1.3B	832 x 480	4	84.47	85.27	81.28
LongLive	1.3B	832 x 480	4	84.40	85.12	81.53
Ours - LongLive	1.3B	832 x 480	4	84.93	85.41	83.00
Causal Forcing	1.3B	832 x 480	4	84.62	85.41	81.47
Ours - Causal Forcing	1.3B	832 x 480	4	85.08	85.96	81.59
Ours - Causal Forcing	1.3B	832 x 480	2	84.80	85.63	81.49

Table 3. VBench-Long Results for 30-Second Autoregressive Generation

Model	Total	Quality	Semantic	Subject Consist.	Temp. Flicker	Motion Smooth.	Dynamic Degree	Aesthetic Quality	Imaging Quality
Causal Forcing family
Causal Forcing	78.11	82.57	60.25	96.29	95.47	97.67	76.95	56.50	71.12
Ours - Causal Forcing	78.28	82.15	62.77	97.16	96.95	98.32	54.02	57.42	70.99
LongLive family
LongLive	79.03	82.82	63.88	97.36	96.62	98.16	63.45	58.26	70.90
Ours - LongLive	79.27	82.90	64.74	97.28	96.92	98.29	63.68	58.35	70.07

Ablation Study

Ablations isolate the effect of self-consistency regularization and cache-aware autoregressive training.

SC-DMD Ablation on Wan 2.1 T2V

Method	Quality	Semantic	Total	Spatial Rel.	Multi-Objects	Object Class	Imaging Q.
DMD-8	84.05	76.50	82.54	67.13	78.98	93.47	65.46
DMD-4	84.39	76.36	82.78	69.49	79.37	93.23	64.73
SC-DMD (Ours)	84.42	78.30	83.19	72.32	85.72	95.16	66.40

Rebuttal Ablation: SC-DMD vs. Shortcut-Style Alternatives

Method	Quality	Semantic	Total
SC vs. SC-DMD at 4 NFEs
SCFM	79.47	70.63	77.70
Naive DMD + SCFM	77.17	66.64	75.06
DMD-4	84.39	76.36	82.78
SC-DMD	84.76	77.77	83.36
Sensitivity to lambda_SC
lambda_SC = 0.4	84.40	78.11	83.27
lambda_SC = 1.0	84.21	78.29	83.02

Autoregressive Training Component Ablation on Causal Forcing

Method	NFE	Quality	Semantic	Total
4-NFE main ablation
Causal Forcing official baseline	4	85.41	81.47	84.62
CF + naive LSC	4	84.35	81.77	83.83
CF + mixed-step + LSC	4	85.91	81.48	85.02
Ours (full)	4	85.96	81.59	85.08
2-NFE low-budget transfer
CF + mixed-step + LSC	2	85.62	80.65	84.63
Ours (full)	2	85.63	81.49	84.80

Self-Consistency Across Euler Steps

SC-DMD keeps the generated structure more stable as Euler sampling uses 2, 4, and 8 denoising steps.

Euler 2-Step

DMD vs SC-DMD under the same prompt and sampler.

DMD

SC-DMD

Euler 4-Step

Compositional regularization reduces structure drift in few-step rollouts.

DMD

SC-DMD

Euler 8-Step

SC-DMD maintains cleaner visual details under longer composed denoising trajectories.

DMD

SC-DMD

BibTeX

@article{ge2026salt,
  title={Salt: Self-consistent distribution matching with cache-aware training for fast video generation},
  author={Ge, Xingtong and Zhang, Yi and Huang, Yushi and He, Dailan and Wang, Xiahong and Ma, Bingqi and Song, Guanglu and Liu, Yu and Zhang, Jun},
  journal={arXiv preprint arXiv:2604.03118},
  year={2026}
}