Table of Contents
Fetching ...

X-IL: Exploring the Design Space of Imitation Learning Policies

Xiaogang Jia, Atalay Donat, Xi Huang, Xuan Zhao, Denis Blessing, Hongyi Zhou, Han A. Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, Gerhard Neumann

TL;DR

X-IL introduces a modular, open-source framework for systematic exploration of imitation learning policies, decomposing the pipeline into observation representations, backbones, architectures, and policy representations. By supporting multi-modal inputs (RGB, point clouds, language), diverse backbones (Transformer, Mamba, xLSTM), and policy forms (BC, diffusion, flow), it enables controlled ablations and rapid prototyping. Empirical results on LIBERO and RoboCasa show state-of-the-art performance and data efficiency, with insights that sequence models like Mamba and xLSTM can outperform Transformers under comparable budgets and that robot-specific encoders and well-designed multi-modal fusion are crucial. Overall, X-IL provides a practical, scalable resource for practitioners and researchers to design, compare, and generalize IL policies across varied robotic tasks.

Abstract

Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies. In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework's modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks. Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning.

X-IL: Exploring the Design Space of Imitation Learning Policies

TL;DR

X-IL introduces a modular, open-source framework for systematic exploration of imitation learning policies, decomposing the pipeline into observation representations, backbones, architectures, and policy representations. By supporting multi-modal inputs (RGB, point clouds, language), diverse backbones (Transformer, Mamba, xLSTM), and policy forms (BC, diffusion, flow), it enables controlled ablations and rapid prototyping. Empirical results on LIBERO and RoboCasa show state-of-the-art performance and data efficiency, with insights that sequence models like Mamba and xLSTM can outperform Transformers under comparable budgets and that robot-specific encoders and well-designed multi-modal fusion are crucial. Overall, X-IL provides a practical, scalable resource for practitioners and researchers to design, compare, and generalize IL policies across varied robotic tasks.

Abstract

Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies. In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework's modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks. Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning.

Paper Structure

This paper contains 31 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of X-IL framework. X-IL supports multi-modal inputs (Language, RGB, and Point Cloud) and two architectures: Decoder-Only and Encoder-Decoder. Inside each architecture, the Backbone serves as the core computational unit, offering support for Transformer, Mamba, and xLSTM. For policy representations, X-IL supports Behavior Cloning (BC), Diffusion-based, and Flow-based Policies, enabling diverse learning paradigms for imitation learning. Notably, each component—input modality, architecture, backbone, and policy—can be easily swapped to efficiently explore various model configurations.
  • Figure 2: Network details of X-Block. X-Layer is the core part, which is used to process sequence tokens; AdaLn conditioning is used to inject the context information. Details can be found in Appendix \ref{['subsec:x-block']}.
  • Figure 3: Illustration of LIBERO and RoboCasa. While LIBERO demonstrates minimal variations in the same task, e.g. LIBERO-Spatial, RoboCasa provides diversities in different aspects. CoffeeServeMug is shown in the figure.
  • Figure 4: Mamba
  • Figure 5: xLSTM
  • ...and 5 more figures