Table of Contents
Fetching ...

$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Keze Wang, Liang Lin, Guangrun Wang

TL;DR

E0 introduces a continuized discrete diffusion framework for action generation in Vision-Language-Action (VLA) robotics, using a flexible, high-resolution discrete action vocabulary that remains compatible with pretrained VLM/VLA backbones. By applying Gaussian noise to one-hot action embeddings and performing iterative denoising, E0 achieves strong semantic grounding and fine-grained control, while preserving the discrete action structure through a Bayes-optimal denoiser. The approach is augmented with a spherical viewpoint perturbation mechanism to improve cross-view robustness, and is validated across LIBERO, VLABench, ManiSkill, and real-world Franka experiments, outperforming state-of-the-art baselines by substantial margins. Theoretical analyses in the supplementary material argue for tighter generalization with discrete tokens and demonstrate how Bayes-optimal denoisers preserve action support, reinforcing the practical advantages of discrete diffusion for generalizable VLA policies.

Abstract

Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.

$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

TL;DR

E0 introduces a continuized discrete diffusion framework for action generation in Vision-Language-Action (VLA) robotics, using a flexible, high-resolution discrete action vocabulary that remains compatible with pretrained VLM/VLA backbones. By applying Gaussian noise to one-hot action embeddings and performing iterative denoising, E0 achieves strong semantic grounding and fine-grained control, while preserving the discrete action structure through a Bayes-optimal denoiser. The approach is augmented with a spherical viewpoint perturbation mechanism to improve cross-view robustness, and is validated across LIBERO, VLABench, ManiSkill, and real-world Franka experiments, outperforming state-of-the-art baselines by substantial margins. Theoretical analyses in the supplementary material argue for tighter generalization with discrete tokens and demonstrate how Bayes-optimal denoisers preserve action support, reinforcing the practical advantages of discrete diffusion for generalizable VLA policies.

Abstract

Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.

Paper Structure

This paper contains 37 sections, 2 theorems, 26 equations, 19 figures, 13 tables.

Key Result

Lemma 1

Let $\mathcal{V}_L$ be a finite language token set and let the action space be represented either as a finite discrete set $\mathcal{V}_A$ of size $K = |\mathcal{V}_A|$ or as a continuous space $\mathbb{R}^p$. Consider hypothesis classes where $\Delta^{K-1}$ is the probability simplex over $K$ actions. Then the following hold: Consequently, discrete action tokens admit a structurally simpler and

Figures (19)

  • Figure 1: Overview of action modeling paradigms. (a) Discrete modeling: Traditional autoregressive (AR) approaches brohan2022rtzitkovich2023rtkim2024openvla and recent mask-based discrete diffusion methods liang2025discrete, which operate over a small discrete action vocabulary. (b) Continuous modeling: Continuous diffusion–based policies liu2024rdtxu2025a0 and AR–diffusion hybrids black2024pi_0intelligence2025pi05lin2025onetwovla that regress continuous actions. (c) Our approach: $\mathcal{E}_0$ integrates AR-style conditioning with continuized discrete diffusion, enabling efficient action generation while preserving compatibility with pretrained vision–language backbones and supporting fine-grained action control.
  • Figure 2: Overview and detailed illustration of $\mathcal{E}_0$. (a) Overall architecture of the proposed model. (b) Training and inference pipeline, showing how inputs are encoded, diffused, and decoded into executable action sequences.
  • Figure 3: Benchmarks for evaluation. (a) LIBEROliu2023libero: tasks with varying objects, layouts, and goals, including long-horizon settings. (b) ManiSkilltao2024maniskill3: diverse fine-grained manipulation skills (push, pick, stack, insert, plug). (c) VLABenchzhang2024vlabench: open-ended tasks requiring language grounding and commonsense reasoning (select toy/fruit/painting/poker/mahjong).
  • Figure 4: Comparison on the VLABench benchmark. In the task “pick up the spade 3”, our $\mathcal{E}_0$ correctly identifies and precisely grasps the target card, showing superior multimodal reasoning and control.
  • Figure 5: Performance on real-world robotic experiments. (a) Short-horizon tasks (press button, close door, pull drawer, pick block, stack block, stack block unseen). (b) Long-horizon tasks (pick block twice, pull drawer and put in block, and put in plate and close door).
  • ...and 14 more figures

Theorems & Definitions (4)

  • Lemma 1: Discrete action tokens and hypothesis complexity
  • proof
  • Lemma 2: Support Preservation of Discrete Diffusion vs. Off-Support Averaging of Continuous Diffusion
  • proof