Table of Contents
Fetching ...

UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen

TL;DR

UniCoD tackles the problem of learning generalist robot policies by unifying discrete vision–language understanding with continuous future-state prediction through a Mixture-of-Transformers framework. It pre-trains on large-scale embodied VQA and TI2E data, then fine-tunes with an action expert to map predictions to actions, achieving state-of-the-art results in both simulation and real-world robotics. The key contributions are the dual-representation (discrete and continuous) learning, the two-stage training regime, and extensive validation showing superior generalization to novel objects and tasks. This approach enables robust, scalable embodied AI with practical impact on flexible manipulation across diverse robots and environments.

Abstract

Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning, and continuous future representation learning. Building on this insight, we introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniCoD is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9\% and 12\% across simulation environments and real-world out-of-distribution tasks.

UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

TL;DR

UniCoD tackles the problem of learning generalist robot policies by unifying discrete vision–language understanding with continuous future-state prediction through a Mixture-of-Transformers framework. It pre-trains on large-scale embodied VQA and TI2E data, then fine-tunes with an action expert to map predictions to actions, achieving state-of-the-art results in both simulation and real-world robotics. The key contributions are the dual-representation (discrete and continuous) learning, the two-stage training regime, and extensive validation showing superior generalization to novel objects and tasks. This approach enables robust, scalable embodied AI with practical impact on flexible manipulation across diverse robots and environments.

Abstract

Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning, and continuous future representation learning. Building on this insight, we introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniCoD is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9\% and 12\% across simulation environments and real-world out-of-distribution tasks.

Paper Structure

This paper contains 43 sections, 3 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of UniCoD. Our proposed UniCoD, which utilizes both understanding and prediction tasks under discrete and continuous representation space, demonstrates strong semantic generalization capabilities on real-world manipulation tasks, particularly in its ability to handle completely novel objects not seen during training. The upper right displays benchmark evaluations across several simulations and 2 real-world robots.
  • Figure 2: Illustration of the UniCoD framework. UniCoD adopts a MoT framework to handle text understanding and planning, continuous visual prediction, and action execution. The continuous features are derived from future observations using a frozen vision encoder.
  • Figure 3: Our evaluation environments, including 2 simulation benchmarks and 2 real-world embodiments.
  • Figure 4: Results on real-world 7DOF robotarm experiment. More detailed quantitative results are provided in Table \ref{['tab:app-franka']}.
  • Figure 5: Results on real-world 12-DOF dexterous hands experiment. More detailed quantitative results can be found in Table \ref{['tab:app-xhand']}.
  • ...and 5 more figures