Selective Underfitting in Diffusion Models

A new rule for understanding how diffusion models learn, generalize, and scale.

^*Equal contribution ¹MIT ²Harvard

Introduction

Diffusion models have emerged as the principal paradigm for generative modeling across various domains. During training, they learn the score function, which in turn is used to generate samples at inference. Despite their success, they face an apparent paradox: if training were perfect, they would learn the empirical score function–the score with respect to the finite training data–and simply replicate the training data at inference, never generating novel samples.

Recent work has sought to resolve this paradox by arguing that diffusion models “globally” underfit the empirical score and suggest that inductive bias or smoothness of neural networks are the reason for this, including (Kamb & Ganguli 2024; Niedoba et al. 2024; Scarvelis et al. 2023; Pidstrigach 2022; Yoon et al. 2023; Yi et al. 2023). Instead, we argue that understanding diffusion models requires us to investigate not just how they underfit the empirical score, but where they underfit. We show that diffusion models do not underfit in a certain region of the space and selectively underfit beyond it. This provides a novel lens for explaining diffusion models’ behavior.

Selective underfitting

Selective Underfitting

We first distinguish two regions of the space: supervision region and extrapolation region. Error between the learned score and the empirical score decreases in the supervision region (X underfitting, blue line) but increases in the extrapolation region (O underfitting, red line) as models improve with better FID scores. These two regions can be further characterized as follows:

Supervision region: where the noisy training samples concentrated with a high probability. A model is supervised to learn the empirical score in this region (blue shells).
Extrapolation region: where the model is actually queried during inference. (red region).
In supervision region, the learned and empirical scores are sufficiently close that the model reproduces the training images when sampling starts from there.

Therefore, we claim that the right way to understand diffusion models is to think of them as operating by (1) approximating the empirical score within the supervision region and (2) extrapolating it to the extrapolation region, which is ultimately used at inference time.

Generalization

This perspective is crucial for understanding diffusion model generalization: the size of the supervision region strongly affects a model's ability to generalize. Enlarging this region degrades generalization—a trend that persists regardless of architecture, e.g., SiT, U-Net.
This behavior cannot be explained by previous works, which focus on how inductive bias shapes underfitting of the empirical score. Our perspective adds a complementary dimension: to fully understand diffusion model generalization, one must consider where the model underfits, not only how.

Generalization

Scaling Law

Next, we decompose a model's FID into two components:

How well it approximates the empirical score in the supervision region. This translates to the training loss.
How this approximation translates to generative performance (training loss → FID), i.e., extrapolation efficiency.

This provides a quantative and comprehensive lens to understand why recent diffusion training recipes work well.

RePA barely alters the model’s behavior in the supervision region but greatly changes its extrapolation behavior—endowing it with better extrapolation efficiency.
Transformers (DiT) trail U-Nets in extrapolation efficiency but excel in FLOPs→supervision loss. Combining both factors, transformers achieve higher overall compute efficiency (FLOPs → FIDs).
We further argue that recent diffusion training methods, such as RePA and ReDi, align the score network’s output with the learned representation, improving generative performance by better handling underfitting (extrapolation) regions.

Takeaways

Diffusion models don’t underfit everywhere, but selectively in distinct regions of data space.
The relative sizes of these regions affect generalization.
Decomposing FID into this interaction reveals why recipes like RePA and DiT/SiT work well.

Last but not least, our lens of selective underfitting opens the door to new questions. Principled question is on how the score is shaped in extrapolation regions. The mechanism for how a model extrapolates toward a favorable score that generates good perceptual samples is still questionable. Further practical question is how to design training recipes that better balance supervision and extrapolation.

@article{song2025selective, title = {Selective Underfitting in Diffusion Models}, author = {Song, Kiwhan and Kim, Jaeyeon and Chen, Sitan and Du, Yilun and Kakade, Sham and Sitzmann, Vincent}, journal = {arXiv preprint arXiv:2510.01378}, year = {2025} }

Selective Underfitting in Diffusion Models

Introduction