dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning

Agentic AI
Published: arXiv: 2512.21446v1
Authors

Shirui Chen Jiantao Jiao Lillian J. Ratliff Banghua Zhu

Abstract

Masked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. As a result, their sampling speeds are often comparable to AR + speculative decoding schemes, limiting their advantage over mainstream autoregressive approaches. Existing distillation-based accelerators (dParallel, d3LLM) finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model's samples. We propose \texttt{dUltra}, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra improves the accuracy--efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving ``diffusion supremacy'' over autoregressive models.

Paper Summary

Problem
The main challenge addressed in this research paper is the slow sampling speed of masked diffusion language models (MDLMs), which limits their advantage over mainstream autoregressive approaches. Even with sophisticated sampling strategies, most open-source MDLMs can only decode fewer than 5 tokens per model forward pass, making their sampling speeds comparable to AR + speculative decoding schemes.
Key Innovation
The key innovation of this work is the introduction of an on-policy reinforcement learning framework called dUltra, which learns unmasking strategies for efficient parallel decoding. dUltra uses Group Relative Policy Optimization (GRPO) to jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps.
Practical Impact
This research has the potential to improve the accuracy-efficiency trade-off of masked diffusion language models, making them more competitive with autoregressive models. The learned unmasking policies guided by mode filtering and conditional independence principles can enable progress towards "diffusion supremacy" over autoregressive LLMs. This could lead to faster and more efficient language generation, with applications in areas such as natural language processing, text summarization, and language translation.
Analogy / Intuitive Explanation
Imagine you're trying to solve a puzzle with many missing pieces. A traditional approach would be to try to find each piece one by one, but this can be slow and inefficient. dUltra is like a clever puzzle solver that learns to prioritize which pieces to find first, based on their relevance and connection to other pieces. This allows it to find the solution much faster and more efficiently, while still ensuring that the solution is accurate and complete.
Paper Information
Categories:
cs.LG cs.AI
Published Date:

arXiv ID:

2512.21446v1

Quick Actions