Table of Contents
Fetching ...

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi, Ranjay Krishna

TL;DR

The paper tackles the problem that general-purpose visual encoders inject task-irrelevant information into Embodied AI policies, hindering learning. It proposes a task-conditioned codebook bottleneck, with $K=256$ latent codes of dimension $D_c=10$, to filter visual representations and produce a compact embedding $\,hat{E}$ from the fused input $E$, trained end-to-end with PPO and dropout to prevent collapse. Empirically, the approach achieves state-of-the-art zero-shot performance on Object Navigation and Object Displacement across five benchmarks and shows faster convergence and improved generalization to Habitat, with analyses indicating a focus on goal-relevant cues and smoother exploration. The method is representation-agnostic, improving performance even when paired with different pretrained visual encoders such as CLIP or DINOv2, and it decouples cue learning from policy optimization to enhance transfer across domains.

Abstract

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for object goal navigation and object displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects. Code and pretrained models are available at our project website: https://embodied-codebook.github.io.

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

TL;DR

The paper tackles the problem that general-purpose visual encoders inject task-irrelevant information into Embodied AI policies, hindering learning. It proposes a task-conditioned codebook bottleneck, with latent codes of dimension , to filter visual representations and produce a compact embedding from the fused input , trained end-to-end with PPO and dropout to prevent collapse. Empirically, the approach achieves state-of-the-art zero-shot performance on Object Navigation and Object Displacement across five benchmarks and shows faster convergence and improved generalization to Habitat, with analyses indicating a focus on goal-relevant cues and smoother exploration. The method is representation-agnostic, improving performance even when paired with different pretrained visual encoders such as CLIP or DINOv2, and it decouples cue learning from policy optimization to enhance transfer across domains.

Abstract

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for object goal navigation and object displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects. Code and pretrained models are available at our project website: https://embodied-codebook.github.io.
Paper Structure (21 sections, 14 figures, 9 tables)

This paper contains 21 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Selective Attention. Imagine an agent is tasked to locate a key in an environment. Standard visual encoders such as CLIP encoder capture general purpose scene information which include details not relevant to the task, such as the color of the sofa or texture of the floor. This mirrors the concept of bottom-up processing, where perception is influenced by external stimuli in the environment. To address this, we equip the encoder with a codebook bottleneck that only retains the most task-relevant information such as identifying flat surfaces likely to hold the key and the walkable paths to these surfaces. This represents top-down selective processing where the perception is guided by internal goals and expectations.
  • Figure 2: An overview of EmbCLIP-Codebook. The 3 representations corresponding to the input frame, the goal, and the previous action get concatenated to form $E \in \mathcal{R}^{1568}$. The codebook module takes $E$ and generates a probability simplex $\mathcal{P} \in \mathcal{R}^{256}$ over the latent codes. The hidden compact representation $h \in \mathcal{R}^{10}$ is a convex combination of the codes weighted by $\mathcal{P}$. The final task-bottlenecked codebook representation $\hat{E}$ is derived by upsampling $h$ which is subsequently passed to the recurrent state encoder and the policy to produce an action.
  • Figure 3: Sample Trajectory. EmbCLIP agent takes many redundant rotations, resulting in a high average curvature, whereas ours navigates more smoothly.
  • Figure 4: Lightweight Finetuning of the Adaptation Module. We only finetune a few CNN layers, action/goal embedders, and the codebook scoring function when moving to new visual domain.
  • Figure 5: GradCAM Attention Visualization. While EmbCLIP is distracted by different objects and other visual cues even though the target object is visible in the frame, EmbCLIP-Codebook is able to effectively ignore such distractions and only focus on the object goal.
  • ...and 9 more figures