Table of Contents
Fetching ...

Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Sethu Vijayakumar, Alexandros Kouris, Oisin Mac Aodha, Chris Xiaoxuan Lu

TL;DR

The paper addresses the fragility of visuomotor policies that rely on pre-trained visual representations (PVRs) when faced with out-of-domain visual perturbations. It introduces Attentive Feature Aggregation (AFA), a cross-attention-based pooling mechanism that trains a query token to focus on task-relevant visual cues while ignoring distractors, without updating the PVR or relying on dataset augmentation. Across 14 PVRs and multiple pooling baselines in simulation and a real-world planar pushing task, AFA substantially improves out-of-domain performance while preserving in-domain accuracy, and the authors show that attention mass and attention entropy are strong predictors of OOD success. The findings suggest that effective feature pooling is a critical lever for deploying robust, generalizable visuomotor policies in visually dynamic environments.

Abstract

The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa

Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

TL;DR

The paper addresses the fragility of visuomotor policies that rely on pre-trained visual representations (PVRs) when faced with out-of-domain visual perturbations. It introduces Attentive Feature Aggregation (AFA), a cross-attention-based pooling mechanism that trains a query token to focus on task-relevant visual cues while ignoring distractors, without updating the PVR or relying on dataset augmentation. Across 14 PVRs and multiple pooling baselines in simulation and a real-world planar pushing task, AFA substantially improves out-of-domain performance while preserving in-domain accuracy, and the authors show that attention mass and attention entropy are strong predictors of OOD success. The findings suggest that effective feature pooling is a critical lever for deploying robust, generalizable visuomotor policies in visually dynamic environments.

Abstract

The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa

Paper Structure

This paper contains 12 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of attention heatmaps with and without AFA (PVR: DINO). AFA learns to attend to focused, task-relevant regions, ignoring scene changes (e.g., distractors).
  • Figure 2: Visualisation of the tasks used for evaluation. The first row illustrates representative scenes, as seen in the frames from the expert demonstrations (in-domain). The second row shows how the scenes are modified by randomly altering the brightness, orientation and position of the light source. Similarly, the third row presents changes to the tabletop texture.
  • Figure 3: Visualisation of (a) the standard visuomotor policy learning approach and (b) the proposed approach with AFA.
  • Figure 4: Success rate (%) of policies trained with features from 14 PVRs in and out-of-domain. For PVR, the raw output PVR features are utilised (i.e.,CLS token for ViTs and the channel average for ResNets). PVR+TL, PVR+SS, and PVR+AFA stack a TokenLearner, a Spatial Softmax, and an Attentive Feature Aggregation pooling module after the PVR, respectively.
  • Figure 5: Correlation plots for the OOD performance predictors. On the left, we visualise how the greater the attention mass percentage that falls within the masks of task-relevant areas (e.g., robot, object, target location, etc.) the more likely it is for the corresponding PVR to lead to a higher OOD policy success rate. Similarly, on the right, the entropy of the attention (i.e., how targetted the attention is) is strongly and negatively correlated with OOD performance. Both plots visualise with red the results from raw PVR features, with blue the results from AFA-filtered features and with gray the overall trend.
  • ...and 1 more figures