Sequential Diffusion Language Models

Yangzhou Liu*, Yue Cao*, Hao Li*, Gen Luo*, Zhe Chen, Weiyun Wang, Xiaobo Liang,
Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li,
Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Shanghai AI Laboratory Nanjing University Tsinghua University
Fudan University The Chinese University of Hong Kong
Soochow University Donghua University
* Equal contribution.

Abstract

Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1× higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm.

Sequential Diffusion Language Model

We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.

  • (a) Training pipeline. Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
  • (b) Sampling Pipeline. Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of D tokens is predicted with D - 1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.

Performance

Long-Form Benchmarks

SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.

General Mutiple-Choice Benchmarks

Block Size & Self-Speculative Decoding

Trade-off Between Performance and Speed

Trade-off between performance and speed under different confidence thresholds τ for SDLM-3B (D=4) and SDLM-3B (D=8). By adjusting τ, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass.

BibTeX

@article{liu2025sdlm,
  title={Sequential Diffusion Language Models},
  author={Liu, Yangzhou and Cao, Yue and Li, Hao and Luo, Gen and Chen, Zhe and Wang, Weiyun and Liang, Xiaobo and Qi, Biqing and Wu, Lijun and Tian, Changyao and Zhang, Yanting and Li, Yuqiang and Lu, Tong and Qiao, Yu and Dai, Jifeng and Wang, Wenhai},
  journal={arXiv preprint arXiv:2509.24007},
  year={2025}
}