SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

ICLR 2026
Xingtong Ge1,2, Xin Zhang3, Tongda Xu4, Yi Zhang3, Xinjie Zhang1, Yan Wang4, Jun Zhang1
1HKUST   2SenseTime Research   3Vivix AI   4Tsinghua AIR
1024x1024 examples of our 4-step generator distilled on FLUX.1 dev

1024 × 1024 samples from our 4-step generator distilled on FLUX.1 dev.

Abstract

Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose Implicit Distribution Alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose Intra-Segment Guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled-up discriminator models, our final model, dubbed SenseFlow, achieves superior performance in distillation for both diffusion-based (SDXL) and flow-matching models (SD 3.5 Large and FLUX).

Key Contributions

Implicit Distribution Alignment (IDA)

Softly aligns fake-model parameters toward the generator so that the fake distribution stays close to the generator distribution. This eliminates the training oscillation and divergence that plague vanilla DMD on 8B–12B scale models, enabling stable convergence even with efficient TTUR schedules.

Intra-Segment Guidance (ISG)

Samples intermediate timesteps within each coarse segment, lets the teacher denoise there, then has the generator match that local trajectory. This compresses the teacher's fine-grained denoising behavior into a few anchor steps, resulting in faster early convergence and a higher final quality ceiling.

Vision Foundation Model (VFM) Discriminator

Uses frozen VFM features (CLIP, DINOv2) with lightweight trainable heads and timestep-aware adversarial weighting to inject stronger semantic and aesthetic supervision. This brings consistent gains on human-preference metrics (HPSv2, PickScore, ImageReward) across all three teacher backbones.

Together, these three components let SenseFlow achieve strong 4-step generation across SDXL, SD 3.5 Large, and FLUX.1 dev, often matching or surpassing the teacher on human-preference metrics while being >20× faster.

Visual Comparisons (4-Step Generation)

FLUX comparison

4-step generation comparison on FLUX.1 dev. SenseFlow-FLUX produces richer details, better text rendering, and more coherent compositions.

SD 3.5 Large comparison

Comparison on SD 3.5 Large. SenseFlow-SD3.5 shows superior fine-grained detail and prompt adherence at both 2-step and 4-step settings.

SDXL comparison

Comparison on SDXL. SenseFlow-SDXL delivers stronger visual quality and structure against LCM, DMD2, Hyper-SDXL, and other baselines.

1-Step Generation

1-step generation results

1024 × 1024 samples from SenseFlow with only 1 sampling step.

Poster

SenseFlow ICLR 2026 Poster

Click to view full resolution.

BibTeX

@article{ge2025senseflow,
  title={SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation},
  author={Ge, Xingtong and Zhang, Xin and Xu, Tongda and Zhang, Yi and Zhang, Xinjie and Wang, Yan and Zhang, Jun},
  journal={arXiv preprint arXiv:2506.00523},
  year={2025}
}