Table of Contents
Fetching ...

AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li, Erdem Bıyık

TL;DR

AutoFocus-IL tackles data inefficiency and causal confusion in visual imitation learning by leveraging vision-language models to automatically identify and track task-relevant objects, then generating temporal saliency maps to regularize behavior cloning. The method comprises context-aware VLM filtering, temporal saliency modeling, and saliency-guided regularization, enabling dense attention guidance without any additional human annotations. Across CARLA simulation and real-robot experiments, AutoFocus-IL outperforms standard BC and even gaze-based baselines, demonstrating improved data efficiency, generalization, and robustness to visual distractors. This approach offers a scalable path toward practical, data-efficient imitation learning in real-world robotics by exploiting pre-trained VLMs for object-centric saliency without explicit human supervision.

Abstract

AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.

AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

TL;DR

AutoFocus-IL tackles data inefficiency and causal confusion in visual imitation learning by leveraging vision-language models to automatically identify and track task-relevant objects, then generating temporal saliency maps to regularize behavior cloning. The method comprises context-aware VLM filtering, temporal saliency modeling, and saliency-guided regularization, enabling dense attention guidance without any additional human annotations. Across CARLA simulation and real-robot experiments, AutoFocus-IL outperforms standard BC and even gaze-based baselines, demonstrating improved data efficiency, generalization, and robustness to visual distractors. This approach offers a scalable path toward practical, data-efficient imitation learning in real-world robotics by exploiting pre-trained VLMs for object-centric saliency without explicit human supervision.

Abstract

AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.

Paper Structure

This paper contains 21 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An overview of three different approaches: While traditional imitation learning suffers from causal confusion, gaze-based IL solves this by utilizing an expensive solution of collecting human eye gaze data. However, AutoFocus-IL resolves this issue by getting a saliency map annotated by a VLM to retain the benefits of gaze-based IL without incurring the extra data collection costs.
  • Figure 2: Overview of the AutoFocus-IL pipeline.
  • Figure 3: Confounded overlay visualization in the CARLA simulator. Action-conditioned icons are rendered along the top margin of each frame to induce spurious correlations, while leaving the underlying dynamics and expert labels unchanged. The red circle simulates a brake light, while arrows denote the steering direction, with their thickness indicating the throttle applied in the previous timestep.
  • Figure 4: CARLA Driving Score (mean $\pm$ standard error) on seen and unseen routes in both original and confounded environments. Except BC, other baselines use human gaze to improve imitation learning, while AutoFocus-IL uses VLM-generated saliency maps.
  • Figure 5: Saliency fraction sweep. Performance across Seen/Unseen and Original/Confounded splits.
  • ...and 1 more figures