Why Diffusion Language Models Are the Future

February 27, 2026 · 20 min read

Throughout my first year as a PhD student working on advancing discrete diffusion language models (DLMs), I’ve had plenty of observations, learnings, and numerous insights. Some of those haven’t yet made their way into a paper, being either too speculative or not yet fleshed out enough. However, in ensemble, an interesting picture is starting to emerge, which should be worth sharing even if it’s a bit rough around the edges. As such, this post is to be viewed as an opinion piece containing lessons learned, hot takes, and more or less bold predictions about the future of large language models.

#Diffusion LMs scale better

The first, most interesting, and perhaps most speculative prediction is that diffusion LMs will have a lower loss floor than autoregressive models [1]. The claim is big: if true, then the “irreducible loss” of LLMs may not be so irreducible after all, and could in fact be caused by too strong of an inductive bias holding back the model’s ability to fit the data at very large scales. This inductive bias is, of course, the strict left-to-right generation order of current autoregressive LLMs.

On first glance, it sounds surprising—preposterous, even—to suggest that next-token prediction imposes such a strong inductive bias that it limits the model’s scalability. After all, scaling up autoregressive next-token-predictors has been the primary driver behind recent advances in AI. But if you think about it a little deeper, it’s not too hard to come up with problems where a strict left-to-right order actually makes the generative task more difficult than it needs to be: simple arithmetic (addition, multiplication, etc.) is just one example where it’s much easier to predict the tokens in reverse order, right-to-left, than the canonical order [2]. Sudoku is another example where filling in the left-most digit can be exponentially hard, potentially requiring solving the whole puzzle just to fill in the first digit. As it turns out, diffusion models indeed excel at Sudoku, while autoregressive models struggle [3, 4]. In general, autoregressive models trained with teacher forcing (i.e., all current LLMs) struggle to learn problems where the first step is hard but subsequent steps, conditioned on the first being correct, are easy [5]. One caveat here is that “thinking” models, which are trained with reinforcement learning, can potentially circumvent this problem by generating a solution path via a non-autoregressive chain-of-thought. Still, there ought to be some difference between learning a skill during pre-training or having to learn it during post-training, where the learning process is often constrained to simply sharpening the distribution that was learned during pre-training.

While this motivates the argument for a weaker inductive bias in diffusion LMs, if and to what extent this translates to a lower loss floor is still somewhat speculative. As of today, the evidence is still quite preliminary and based on extrapolating from small(ish)-scale experiments ($\leq\!10^{20}$ FLOPs) to very large compute budgets ($>\!10^{23}$ FLOPs) [1]. Nevertheless, it is certainly an exciting possibility that deserves more attention and investigation.

#Uniform diffusion is the future

Uniform-state diffusion has been around since the advent of discrete diffusion, and for the longest time, reports have been consistent about masked diffusion being strictly superior to uniform diffusion. Compared to masked diffusion, where the model is trained to fill in missing tokens starting from a completely blank sequence, uniform diffusion models are trained to revise tokens starting from a sequence of completely random tokens. The generative process thus becomes one of token replacement, where the model is tasked to replace noisy-looking tokens with more fitting alternatives. Conceptually, this has always had a lot going for it: uniform diffusion models are literally and entirely pre-trained to spot and fix mistakes, a desirable property that is sorely missing from both autoregressive and masked diffusion models. Still, the observed likelihood gap between masked and uniform diffusion was too big to look past, which is why many people went all-in on masked diffusion. However, evidence is starting to accumulate indicating that uniform diffusion models aren’t as bad as we thought: yes, the likelihood gap is real, but it shrinks with scale [1]. More importantly, uniform diffusion models excel at generating high-quality, high-accuracy samples—which is ultimately what we care about—thanks to their self-correction abilities [1, 6].

Going on a bit of a limb, I may even go as far as to argue that the skills that uniform diffusion models pick up during pre-training are qualitatively different from those learned by autoregressive models and masked diffusion models. To see why, let’s think about what “skills” are required to do well on various language modeling paradigms. For autoregressive models, it is now well-understood that predicting the next token with high accuracy given all prior tokens is actually more rich than it may first appear [7]. It requires not only a good understanding of syntax and semantics, but also extensive factual knowledge, common sense, the ability to reason about the world, and so on. For masked diffusion, the required skills are largely similar, but now with the added difficulty (flexibility) of predicting a missing token given its partially observed past and future. This already results in a more holistic understanding of language and helps with things like the reversal curse [8] and arithmetic. Uniform diffusion, on the other hand, changes the game entirely: the task is no longer to fill in missing tokens, but to spot mistakes, inaccuracies, and inconsistencies in the input, and to find plausible ways to resolve them. This still requires all the knowledge about syntax, semantics, facts, common sense, etc., but it’s fundamentally a different paradigm; one of iterative refinement and improvement. It’s not hard to imagine that a (large) model becoming really good at this would pick up qualitatively different skills compared to existing autoregressive or even masked diffusion LLMs.

Excitingly, compared to masked diffusion, uniform diffusion models are still understood relatively poorly, as most efforts focus on masked diffusion models and do not trivially transfer to uniform diffusion. There may be breakthroughs just waiting to be made—a tantalizing possibility. In the following, I will go into detail on what I think are some promising avenues.

#Beyond naive uniform diffusion

Perhaps the most exciting fact about uniform diffusion models is that they are actually a very crude approximation of the detect-and-revise task. To intuitively understand this, consider replacing some words in a sentence with words chosen uniformly at random from all possible words. More often than not, the randomly chosen token will be obviously wrong and very easy to spot in context (e.g., being a rare word that doesn’t fit at all), especially if the number of random words is rather low. In those cases, the random words are just a convoluted way of masking the token, so we may as well stick to masking. But sometimes, albeit perhaps rarely, the randomly chosen token will not be so obviously wrong and will actually fit in the context. Perhaps it’s even the kind of mistake that a trained model would make at inference time. In those cases, the detect-and-revise task is very meaningful and teaches the model about the subtleties of natural language, including syntax, semantics, correctness, and so on.

To make the training task maximally meaningful, we should therefore aim to create as many of the latter cases as possible and prevent obvious, low-signal replacements if we can help it. If uniform diffusion is uninformed and adds noise in a data-agnostic way, such a semantically meaningful diffusion process needs to be data-informed and add noise in a way that is more likely to (approximately) preserve meaning instead of completely destroying it in a single transition. Such a process will be more complex than the simple masked and uniform diffusion processes, of course. But the huge design space also presents an opportunity for innovation. What follows is a list of ideas that have been floating in my head, starting with what likely is the simplest possible version of a data-informed diffusion process.

Unigram diffusion. Instead of replacing tokens uniformly at random, we could simply resample them according to their unigram distribution (word frequency). This would make most of the replaced tokens be common words that occur naturally more often and therefore are generally more difficult to spot and more likely to “blend in”. Unigram noise would also put a higher prior likelihood for rare words to be correct. While it’s not so clear whether this is a good or a bad thing, it could help “anchor” the generation trajectory early on through a few information-heavy keywords that are unlikely to change throughout the denoising process.

Beyond unigram: $N$-gram diffusion. To go from unigram statistics to more semantic transitions, one obvious step is to incorporate more local context. If uniform and unigram distributions are, respectively, zeroth-order (no context) and first-order (one token of context) statistics, then $N$-gram distributions provide access to higher-order semantic information. Formally, the (bidirectional) $N$-gram distribution of a token $x_i$ as a function of its bidirectional context is defined as

$$ p(x_i | x_{i-k :i-1}, x_{i+1:i+k}) = \frac{\mathrm{count}(x_{i-k:i+k})}{\mathrm{count}(x_{i-k :i-1}, x_{i+1:i+k})}, \quad k = \frac{N-1}{2}, $$

where $\mathrm{count}(\cdot)$ denotes the count of the given $N$-gram in the training corpus and $N$ is assumed to be odd. The higher the value of $N$, the more context is taken into account, and the more semantically-informed the predictions about $x_i$ become. One can imagine a diffusion process that starts by replacing tokens in a very informed manner (i.e., large $N$) and gradually transitions to lower values of $N$, becoming less and less informed until a simple prior is reached at $N=1$ (unigram) or $N=0$ (uniform).¹

Model-based diffusion. At this point, the pattern is clear: eventually, we want the noising process to not only be informed by data but to be directly optimized (or adaptive in some other way) to facilitate efficient and effective learning. For example, can we use the model itself and, in particular, the mistakes it makes to find even more meaningful training examples? Can we train a small, capacity-limited model to come up with crude infills/corrections and then train a larger model to refine those proposals (ELECTRA-style [9])? The possibilities are endless, and unlike for continuous-state diffusion with Gaussian noise, there may not be “one noise to rule them all”. In that timeline, optimizing the noising process end-to-end with the denoising model will certainly be a worthwhile endeavor.

#The importance of the jump schedule

Shifting gears a bit, let’s discuss the theory that powers diffusion LMs. Discrete diffusion models are generally trained in continuous time, meaning that the underlying noising/denoising process is a continuous-time Markov chain (CTMC). While the discrete diffusion literature often defines these in terms of their transition rates (infinitesimal probability of jumping from one state to another) [10], there is another, arguably more intuitive and insightful way to define them, which is in terms of their holding distribution and jump chain. The intuition behind this perspective is captured by the following mental model. Picture that each token in the sequence has a timer that counts down to the next jump (state transition). The timer is set to a random initial time, potentially depending on the current state. Once the timer hits zero (when it “rings”), the token jumps to a new state following some distribution of transition probabilities. This perfectly describes what happens in a continuous-time Markov process—in fact, you can think of it as a game of alternating between waiting and jumping. The distribution of waiting times is referred to as the holding distribution, which is typically an exponential distribution, and the jump probabilities are described by the jump chain, a discrete-time Markov chain which specifies the transition probabilities between two states conditioned on the fact that a jump is actually occurring. These holding times and transition probabilities may, in general, depend on both the current noise level and the current state. In the forward noising direction, they typically don’t, but in the reverse denoising direction, they often do.

More formally, a CTMC with state $X_t$ is fully defined by the jump chain $P \in \mathbb{R}^{N \times N}$ with $P_{ij}$ denoting the probability of jumping from state $i$ to state $j$, and the holding distribution with parameter $\lambda_i$ specifies how long the state $i$ is held before jumping elsewhere. If $X_t = i$, we have that

$$ \text{holding time} \sim \mathrm{Exp}(\lambda_i), \quad \text{next state} \sim P_{i}, $$

where $\mathrm{Exp}(\lambda_i)$ is the exponential distribution with rate $\lambda_i$.² Pishro-Nik [11] has a nice introduction to this perspective on CTMCs.

It can be shown that, in continuous time, if the forward diffusion process is a CTMC, then the backward denoising process is also a CTMC, albeit with potentially different holding times and transition probabilities. This, in turn, makes it easy to see that in order to reverse a forward CTMC, the denoising model has to predict two things: the previous state (i.e., the reverse jump chain) and the holding time between the current and previous state (i.e., the reverse holding distribution). As it turns out, for masked diffusion, the forward and backward holding distributions are exactly the same, since each token experiences exactly one jump (from unmasked to masked or vice-versa) whether the direction of the chain is forward or backward. The backward jump chain is, of course, still highly non-trivial: even if we know the time at which each token becomes unmasked, it is very difficult to fill them in correctly. A masked diffusion model, therefore, does not need to fit the holding times, and it suffices to learn how to infill the missing tokens. In other words, it only has to fit where to jump, but not when to jump [12], which makes its job somewhat easier.

For uniform diffusion, however, we are not so lucky. Clearly, the backward holding time should depend on how “correct” or “noisy” the current token is: if the token has a low probability of noise (“looks correct”), then the backward holding time should be long since the token is unlikely to change again. On the other hand, if the token has a high probability of noise (“looks wrong”), then the backward holding time should be shorter since the token still needs at least one more transition to reach the clean state.³ Current uniform diffusion models implicitly learn the backward holding distribution and jump chain jointly through a single categorical distribution of the clean token $x$ given some noisy observation $z$, but there are clues that the denoising objective may actually consist of two distinct tasks.

#Denoising in two parts: learning the when and the where

Specifically, the GIDD (Generalized Interpolating Discrete Diffusion) ELBO [13] has two divergence terms: a KL (Kullback-Leibler) divergence between the true and the model-predicted marginals, and an IS (Itakura-Saito) divergence between the same distributions at the current token. We have

$$ \mathcal{L}_{\mathrm{GIDD}}(x, \theta) = \mathbb{E}_{t, z \sim q_t(\cdot | x)} \big[ w_t(z, x) \{ \underbrace{D_{KL}(q_t(\cdot | x) \| q_t(\cdot | \mathbf{x}_\theta))}_{\text{ jump chain?}} + \underbrace{D_{IS}(q_t(z | x) \| q_t(z | \mathbf{x}_\theta))}_{\text{ holding distribution?}} \} \big]. $$

Here, $q(z | x)$ denotes the marginal distribution of the forward process at time $t$ given the clean data $x$, and $q(z | \mathbf{x}_\theta)$ denotes the model-predicted equivalent. Please refer to the original paper for further details [13]. While both divergence terms are minimal if the model $\mathbf{x}_\theta$ perfectly predicts the data $x$, they are not minimized in the same way: the KL term is incentivizes matching the entire distribution, while the IS term is only concerned about the probability mass on the current token. In other words, the IS term asks the model to quantify how likely the current token $z$ is to be correct, i.e., to approximate $q(x = z | z_t = z)$. In particular, it does not care about how the mass is distributed for other tokens, which is instead regulated by the KL divergence.

A curious fact about these divergences is that the IS-divergence is the loss function that arises from doing a maximum likelihood estimation (MLE) on an exponential distribution, which matches the holding distribution of our CTMC. Moreover, the KL-divergence results from MLE on a categorical distribution, which matches the jump chain. Coincidence? I think not.
(Credit goes to @HessianFree for first pointing this out.)

Waving our hands a little bit, we can conjecture that there is a way to parameterize our denoising model such that it outputs a holding distribution and a jump chain separately, disentangling the IS-loss and KL-loss components via two separate predictions.⁴ Elegance and beauty aside, this would admit a more interpretable model of the denoising process: the backward holding distribution gives a direct estimate of how likely each token is to be correct/needing revision, and the backward jump chain would give a direct prediction of the most likely clean token given that a revision occurs. Among other things, this would certainly simplify the design and implementation of adaptive sampling strategies for uniform diffusion models.

#Adaptive sampling for uniform diffusion

Adaptive, confidence-based decoding has been a major driver behind state-of-the-art benchmark scores achieved by masked diffusion models. The core idea is that some unmasking orders are easier than others, and that we can use the model’s predictions (e.g., via confidence or entropy) to identify tokens that are easy to fill in and therefore are unlikely to incur a large error. This is particularly important for masked diffusion models, since they have no way to correct mistakes if ever an erroneous token gets sampled.⁵

However, this adaptive sampling paradigm does not trivially transfer to uniform diffusion: while masked diffusion models are guaranteed to update every token in the sequence exactly once (from the masked to some unmasked state), in uniform diffusion, we need to be prepared to revise every token in every step. Instead of “infilling confidence”, we need to think about “revision probability” and “revision confidence”: how likely is this current token to be correct, and how confident are we in our ability to revise it to a more correct version? A relatively naive heuristic to capture this idea proposed in our recent paper [1] is to, for any token $z$ in the sequence, take the maximum confidence on any token $z' \neq z$ minus the confidence on the current token $z$, i.e.

$$ \mathrm{score}(z) = \max_{z' \neq z} p_\theta(x = z' | z_t = z) - p_\theta(x = z | z_t = z). $$

Then, we just take the top-$k$ positions with the highest scores and update those tokens to the highest-confidence token as predicted by the model, i.e., $z \gets \arg\max_{z' \neq z} p_\theta(x = z' | z_t = z)$.

While this works reasonably well, there are at least two obvious things wrong with this:

Not a good measure of revision confidence. The proposed heuristic, being a difference between two probabilities, is ungrounded and hard to interpret at best. It may not accurately reflect the likelihood of the current token being correct, much less the likelihood of the new token being more correct than before. Ideally, an adaptive sampler could provide guarantees on this.
No randomness. The sampling algorithm is entirely deterministic, which limits exploration and may lead to low-diversity or even low-quality outcomes. Uniform diffusion strongly resembles an MCMC/simulated annealing process, and a good sampling algorithm may need to reflect this.

#Diffusion LMs make more of your data

It has been shown, in my opinion quite convincingly, that diffusion models are able to make more of the training data than autoregressive models. This is because they are able to train on the same data for many more epochs before overfitting [14, 15]. The intuition behind this is simple: each training sample is heavily augmented with random noise, so even if repeated many times, the odds that we see the same version twice are vanishingly small.

This is significant, especially for settings where we care a lot about data quality (e.g., small-LMs, SFT/post-training). Having the option to repeat the data many times without overfitting opens up new possibilities. For example, we can be more aggressive with quality filtering and just repeat the remaining data more often to compensate, both during pre- and post-training. What we are still missing, however, are scaling laws: how much filtering and how many repetitions are ideal? There is an inherent tradeoff between the two, and the size of the training corpus relative to model capacity also plays a role (smaller data and larger models lead to overfitting more quickly).

Even in this aspect, uniform diffusion models may have an advantage too. Since uniform noise is inherently a stronger/richer augmentation, it may allow for even more data repetition before the model starts overfitting. Existing papers on this topic, unfortunately, only investigate masked diffusion models, so this is an exciting direction for future work.

#Conclusion

The history of diffusion LLMs is only just starting to be written, and while the arguments made here are without a doubt an optimistic look in the crystal ball, the future is bright and full of potential. On the other hand, there are a number of more fundamental questions that still need to be answered. Things like, what is the optimal (de)noising process? What representations and capabilities do discrete diffusion models learn during pre-training, and how do they differ from autoregressive models? And are there any fundamental limitations to what diffusion LMs can and, more importantly, cannot do? Any such limitations may also heavily depend on the diffusion process, putting additional emphasis on finding the right type of noise. For example, the inability of masked diffusion models to revise already-generated tokens is resolved simply by injecting some uniform noise or switching to uniform diffusion entirely. Another relevant limitation is the rigidity of token positions: while token revision is readily possible in uniform diffusion, inserting new tokens or deleting existing tokens is not, which in some situations may still result in unrecoverable errors. Beyond simple insertion and deletion, it’s not hard to imagine more complex operations like token swapping or sequence splicing. However, all of these operations make the training process and, in particular, the derivation of a computationally efficient ELBO more challenging. As far as insertion goes, there already has been some recent progress on insertion-based diffusion processes [16, 17], and developing these methods further ought to be a fruitful endeavor.

Finally, it is good to recall that extraordinary claims require extraordinary evidence, and this post certainly leans heavily on the former. Hopefully, though, the arguments and ideas presented here will compel people to start investigating and eventually uncover the latter, too.

# Citation

Please cite this post as:

Dimitri von Rütte. Why Diffusion Language Models Are the Future, 2026. https://dimitri.ml/posts/why-diffusion-language-models-are-the-future/ (visited on {{today}}).

For academic contexts, feel free to use the following BibTeX entry:

@misc{vonrütte2026why,
  author = {Dimitri {von Rütte}},
  title = {Why Diffusion Language Models Are the Future},
  url = {https://dimitri.ml/posts/why-diffusion-language-models-are-the-future/},
  year = {2026},
  urldate = {{{today}}}
}

A remaining detail is the question of how to formulate a tractable ELBO for this diffusion process, especially one in continuous time. Let’s just say this is left as an exercise to the reader :) ↩︎
It is worth noting that a CTMC defined in this way has time running from $0$ to $\infty$, whereas diffusion models typically require time to be constrained to the interval $(0, 1)$. This can easily be resolved by warping time through some appropriate monotone transformation, e.g., $t' = 1 - e^{-\lambda t}$, which maps $t \in [0, \infty)$ to $t' \in [0, 1)$. ↩︎
Note how this argument also applies to masked diffusion, with the major difference being that we, a priori, have perfect knowledge about which tokens are noisy and noise-free: masked tokens are always noisy and unmasked tokens are always (assumed to be) noise-free. ↩︎
In fact, we can be a bit more concrete: Taking the KL-divergence between two exponential distributions parameterized by their means actually results in exactly the IS-divergence between their means. We can therefore write the entire loss as a sum of two KL-divergences, which points towards a factorization of the underlying model, e.g. $p(z_t) \cdot p(z' | z_t)$ where $p(z_t)$ models the holding time (exponential distribution) and $p(z' | z_t)$ models the jump chain (categorical distribution). ↩︎
The caveat here is that people have tried, increasingly more successfully, to remedy this limitation through remasking, where we sometimes decide to remask a token (e.g., based on heuristics or, more recently, learned policies) so that it can later be unmasked again with additional context and, hopefully, higher accuracy. Still, this is a workaround to patch around a fundamental limitation of masked diffusion, and resolving the problem at its core would certainly be more elegant and, ideally, also more effective. ↩︎

References

D von Rütte, J Fluri, O Pooladzandi, B Schölkopf, T Hofmann, A Orvieto. Scaling Behavior of Discrete Diffusion Language Models, 2025. The Fourteenth International Conference on Learning Representations.
N Lee, K Sreenivasan, JD Lee, K Lee, D Papailiopoulos. Teaching Arithmetic to Small Transformers, 2024. The Twelfth International Conference on Learning Representations.
J Kim, K Shah, V Kontonis, SM Kakade, S Chen. Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions, 2025. Forty-second International Conference on Machine Learning.
J Ye, J Gao, S Gong, L Zheng, X Jiang, Z Li, L Kong. Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning, 2025. The Thirteenth International Conference on Learning Representations.
G Bachmann, V Nagarajan. The Pitfalls of Next-Token Prediction, 2024. International Conference on Machine Learning, 2296-2318 (PMLR).
SS Sahoo, J-M Lemercier, Z Yang, J Deschenaux, J Liu, J Thickstun, A Jukic. Scaling Beyond Masked Diffusion Language Models, 2026. arXiv preprint arXiv:2602.15014.
T Brown, B Mann, N Ryder, M Subbiah, JD Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al.. Language Models Are Few-Shot Learners, 2020. Advances in Neural Information Processing Systems 33, 1877-1901.
L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, O Evans. The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”, 2024. The Twelfth International Conference on Learning Representations.
K Clark, M-T Luong, QV Le, CD Manning. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, 2020. International Conference on Learning Representations.
A Campbell, J Benton, V De Bortoli, T Rainforth, G Deligiannidis, A Doucet. A Continuous Time Framework for Discrete Denoising Models, 2022. Proceedings of the 36th International Conference on Neural Information Processing Systems, 28266-28279.
H Pishro-Nik. Introduction to Probability, Statistics, and Random Processes, 2014. Kappa Research LLC, Section 11.3.
AN Amin, N Gruver, AG Wilson. Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion, 2025. The Thirty-ninth Annual Conference on Neural Information Processing Systems.
D von Rütte, J Fluri, Y Ding, A Orvieto, B Schölkopf, T Hofmann. Generalized Interpolating Discrete Diffusion, 2025. Proceedings of the 42nd International Conference on Machine Learning.
M Prabhudesai, M Wu, A Zadeh, K Fragkiadaki, D Pathak. Diffusion Beats Autoregressive in Data-Constrained Settings, 2025. The Thirty-ninth Annual Conference on Neural Information Processing Systems.
J Ni, Q Liu, L Dou, C Du, Z Wang, H Yan, T Pang, MQ Shieh. Diffusion Language Models Are Super Data Learners, 2025. arXiv preprint arXiv:2511.03276.
J Kim, LC Kit, C Domingo-Enrich, Y Du, SM Kakade, T Ngotiaoco, S Chen, MS Albergo. Any-Order Flexible Length Masked Diffusion, 2026. The Fourteenth International Conference on Learning Representations.
F Ding, D Ding, S Chen, K Wang, P Xu, Z Feng, H Bai, K Han, Y Yan, B Yuan, J Sun. Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes, 2026. The Fourteenth International Conference on Learning Representations.