Table of Contents
Fetching ...

CIVIL: Causal and Intuitive Visual Imitation Learning

Yinlong Dai, Robert Ramirez Sanchez, Ryan Jeronimus, Shahabedin Sagheb, Cara M. Nunez, Heramb Nemlekar, Dylan P. Losey

TL;DR

This work tackles causal confusion in visual imitation by arguing that demonstrations should include not only what actions to imitate but also why those actions are chosen. The authors introduce CIVIL, a two-phase method that uses physical markers and natural language prompts during training to extract task-relevant, low-dimensional causal features from high-dimensional observations, and then trains a transformer-based policy to act on these features. A separate causal network learns to recover the same features from unmasked test imagery, enabling autonomous execution without markers or prompts. Across simulations, real-world robot experiments, and a user study, CIVIL demonstrates improved sample efficiency, robustness to distractors, and better generalization to unseen task configurations, while reducing user-teaching time. The approach shows practical potential for faster, more reliable robot learning in manipulation tasks by aligning robot perception with human reasoning.

Abstract

Today's robots attempt to learn new tasks by imitating human examples. These robots watch the human complete the task, and then try to match the actions taken by the human expert. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding which features of the system or environment factor into the human's decisions, robot learners often misinterpret the human's examples. In practice, this results in causal confusion, inefficient learning, and robot policies that fail when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to intuitively indicate why they made those decisions. Under our paradigm human teachers attach markers to task-relevant objects and use natural language prompts to describe their state representation. Our proposed algorithm, CIVIL, leverages this augmented demonstration data to filter the robot's visual observations and extract a feature representation that aligns with the human teacher. CIVIL then applies these causal features to train a transformer-based policy that -- when tested on the robot -- is able to emulate human behaviors without being confused by visual distractors or irrelevant items. Our simulations and real-world experiments demonstrate that robots trained with CIVIL learn both what actions to take and why to take those actions, resulting in better performance than state-of-the-art baselines. From the human's perspective, our user study reveals that this new training paradigm actually reduces the total time required for the robot to learn the task, and also improves the robot's performance in previously unseen scenarios. See videos at our project website: https://civil2025.github.io

CIVIL: Causal and Intuitive Visual Imitation Learning

TL;DR

This work tackles causal confusion in visual imitation by arguing that demonstrations should include not only what actions to imitate but also why those actions are chosen. The authors introduce CIVIL, a two-phase method that uses physical markers and natural language prompts during training to extract task-relevant, low-dimensional causal features from high-dimensional observations, and then trains a transformer-based policy to act on these features. A separate causal network learns to recover the same features from unmasked test imagery, enabling autonomous execution without markers or prompts. Across simulations, real-world robot experiments, and a user study, CIVIL demonstrates improved sample efficiency, robustness to distractors, and better generalization to unseen task configurations, while reducing user-teaching time. The approach shows practical potential for faster, more reliable robot learning in manipulation tasks by aligning robot perception with human reasoning.

Abstract

Today's robots attempt to learn new tasks by imitating human examples. These robots watch the human complete the task, and then try to match the actions taken by the human expert. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding which features of the system or environment factor into the human's decisions, robot learners often misinterpret the human's examples. In practice, this results in causal confusion, inefficient learning, and robot policies that fail when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to intuitively indicate why they made those decisions. Under our paradigm human teachers attach markers to task-relevant objects and use natural language prompts to describe their state representation. Our proposed algorithm, CIVIL, leverages this augmented demonstration data to filter the robot's visual observations and extract a feature representation that aligns with the human teacher. CIVIL then applies these causal features to train a transformer-based policy that -- when tested on the robot -- is able to emulate human behaviors without being confused by visual distractors or irrelevant items. Our simulations and real-world experiments demonstrate that robots trained with CIVIL learn both what actions to take and why to take those actions, resulting in better performance than state-of-the-art baselines. From the human's perspective, our user study reveals that this new training paradigm actually reduces the total time required for the robot to learn the task, and also improves the robot's performance in previously unseen scenarios. See videos at our project website: https://civil2025.github.io

Paper Structure

This paper contains 18 sections, 29 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Human teaching a robot arm to prepare a cup of coffee. The robot must learn to grasp the cup and place it under a coffee machine based on visual observations. Within traditional approaches the human demonstrates what actions to take, and the robot learns to emulate these demonstrated actions. However, this approach is inefficient because the robot is not taught why the human chooses a specific behavior (i.e., what features of the environment factored into the human's decisions). Without this causal information that links features to actions the robot can misinterpret the human: for instance, if a bowl is always placed to the left of the cup during the demonstrations, the robot might learn to go beside the bowl instead of go to the cup. We hypothesize that robots can learn more efficient and robust control policies when the human teacher communicates the features behind their decisions (i.e., why they are choosing the actions they demonstrate). CIVIL shifts imitation learning towards holistic demonstrations with physical markers and natural language instructions.
  • Figure 2: Augmented data collection procedure for CIVIL. In Step 1, we enable humans to mark task-relevant positions (e.g., the coffee maker) with ArUco markers. In Step 2, as the human demonstrates the task they can provide natural language prompts that mention task-relevant objects (e.g., the cup). The resulting dataset for offline learning includes states $x$, images $y$, actions $u$, marker data $b$, and language prompts $l$. After providing data, the human removes the markers from the environment, and the robot processes its images to inpaint those markers so that they are not required at test time.
  • Figure 3: Network architecture of CIVIL. The model consists of encoder networks that map environment observations (images) to a compact feature representation $\phi$, and a policy transformer that takes a sequence of robot states and features as input and predicts the task action. The training of our model is split into two phases. (Left) In the first phase we supervise a subset of the features using a marker network $h$ to explicitly encode the relevant poses $b$ marked by the human expert. At the same time, we train the remaining features to implicitly capture other task-relevant information by masking the input images to highlight the relevant objects conveyed by the human through natural language instructions $l$. The features are trained together with the policy transformer by optimizing a dual loss function that aligns the robot's representation with human reasoning (the why) and minimizes the error between predicted and ground truth actions (the what). (Right) In the second phase we freeze the encoder network and policy network, and train a causal network $c$ to map the original images to the same features as those learned by the robot from the masked images in the first phase. This step ensures that the robot can extract the task-relevant features without needing the human to place markers or provide language prompts at runtime.
  • Figure 4: Manipulation tasks in the CALVIN environment: (1) Picking up a red block. The block is initialized on the left or right side of the table during training. Some of the possible block positions are shown using transparent overlays. (2) Opening the drawer or moving the sliding door based on the light bulb state. The bulb is located in the top right corner and appears yellow when on or white when off. (3) Stacking on the blue or pink block based on the light bulb state and block positions. The task starts with the red block in the robot's gripper and the blue and pink blocks in random positions on the table. In all tasks, the irrelevant objects are also initialized randomly.
  • Figure 5: Results from our ablation study. In Explicit the system is trained on the position data of the marked objects, and in Implicit the system is trained on the masked images. CIVIL takes advantage of the human's explicit and implicit guidance. We find that both components contribute to the overall effectiveness of CIVIL. Each policy is trained with $40$ demonstrations.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof