SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

ICLR 2026

Xingtong Ge^1,2, Xin Zhang³, Tongda Xu⁴, Yi Zhang³, Xinjie Zhang¹, Yan Wang⁴, Jun Zhang¹

¹HKUST ²SenseTime Research ³Vivix AI ⁴Tsinghua AIR

Abstract

Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose Implicit Distribution Alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose Intra-Segment Guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled-up discriminator models, our final model, dubbed SenseFlow, achieves superior performance in distillation for both diffusion-based (SDXL) and flow-matching models (SD 3.5 Large and FLUX).

Key Contributions

Implicit Distribution Alignment (IDA)

Softly aligns fake-model parameters toward the generator so that the fake distribution stays close to the generator distribution. This eliminates the training oscillation and divergence that plague vanilla DMD on 8B–12B scale models, enabling stable convergence even with efficient TTUR schedules.

Intra-Segment Guidance (ISG)

Samples intermediate timesteps within each coarse segment, lets the teacher denoise there, then has the generator match that local trajectory. This compresses the teacher's fine-grained denoising behavior into a few anchor steps, resulting in faster early convergence and a higher final quality ceiling.

Vision Foundation Model (VFM) Discriminator

Uses frozen VFM features (CLIP, DINOv2) with lightweight trainable heads and timestep-aware adversarial weighting to inject stronger semantic and aesthetic supervision. This brings consistent gains on human-preference metrics (HPSv2, PickScore, ImageReward) across all three teacher backbones.

Together, these three components let SenseFlow achieve strong 4-step generation across SDXL, SD 3.5 Large, and FLUX.1 dev, often matching or surpassing the teacher on human-preference metrics while being >20× faster.

Visual Comparisons (4-Step Generation)

4-step generation comparison on FLUX.1 dev. SenseFlow-FLUX produces richer details, better text rendering, and more coherent compositions.

Comparison on SD 3.5 Large. SenseFlow-SD3.5 shows superior fine-grained detail and prompt adherence at both 2-step and 4-step settings.

Comparison on SDXL. SenseFlow-SDXL delivers stronger visual quality and structure against LCM, DMD2, Hyper-SDXL, and other baselines.

1-Step Generation

1024 × 1024 samples from SenseFlow with only 1 sampling step.

BibTeX