CIVIL: Causal and Intuitive Visual Imitation Learning
Yinlong Dai, Robert Ramirez Sanchez, Ryan Jeronimus, Shahabedin Sagheb, Cara M. Nunez, Heramb Nemlekar, Dylan P. Losey
TL;DR
This work tackles causal confusion in visual imitation by arguing that demonstrations should include not only what actions to imitate but also why those actions are chosen. The authors introduce CIVIL, a two-phase method that uses physical markers and natural language prompts during training to extract task-relevant, low-dimensional causal features from high-dimensional observations, and then trains a transformer-based policy to act on these features. A separate causal network learns to recover the same features from unmasked test imagery, enabling autonomous execution without markers or prompts. Across simulations, real-world robot experiments, and a user study, CIVIL demonstrates improved sample efficiency, robustness to distractors, and better generalization to unseen task configurations, while reducing user-teaching time. The approach shows practical potential for faster, more reliable robot learning in manipulation tasks by aligning robot perception with human reasoning.
Abstract
Today's robots attempt to learn new tasks by imitating human examples. These robots watch the human complete the task, and then try to match the actions taken by the human expert. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding which features of the system or environment factor into the human's decisions, robot learners often misinterpret the human's examples. In practice, this results in causal confusion, inefficient learning, and robot policies that fail when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to intuitively indicate why they made those decisions. Under our paradigm human teachers attach markers to task-relevant objects and use natural language prompts to describe their state representation. Our proposed algorithm, CIVIL, leverages this augmented demonstration data to filter the robot's visual observations and extract a feature representation that aligns with the human teacher. CIVIL then applies these causal features to train a transformer-based policy that -- when tested on the robot -- is able to emulate human behaviors without being confused by visual distractors or irrelevant items. Our simulations and real-world experiments demonstrate that robots trained with CIVIL learn both what actions to take and why to take those actions, resulting in better performance than state-of-the-art baselines. From the human's perspective, our user study reveals that this new training paradigm actually reduces the total time required for the robot to learn the task, and also improves the robot's performance in previously unseen scenarios. See videos at our project website: https://civil2025.github.io
