Table of Contents
Fetching ...

Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation

Dennis Hoftijzer, Gertjan Burghouts, Luuk Spreeuwers

TL;DR

This work addresses shortcut learning in ObjectNav caused by simulator biases, introducing an out-of-distribution test that biases training environments with room wall colors. It proposes Language-Based augmentation (LB augmentation) that operates in the feature space of a Vision-Language Model (CLIP/EmbCLIP) by adding a single extra layer, leveraging text descriptions of bias variations to improve domain generalization without simulator changes. Empirical results show that the baseline EmbCLIP method suffers large performance drops when wall colors change (up to a 69% decrease in success), while LB augmentation reduces this drop to about 23%, demonstrating improved robustness to biased visual cues. The approach provides a lightweight, post-hoc strategy to mitigate shortcut learning in embodied AI and offers a foundation for extending language-driven generalization to more complex biases beyond simple wall-color cues.

Abstract

Deep Reinforcement Learning (DRL) has shown great potential in enabling robots to find certain objects (e.g., `find a fridge') in environments like homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL methods are predominantly trained and evaluated using environment simulators. Although DRL has shown impressive results, the simulators may be biased or limited. This creates a risk of shortcut learning, i.e., learning a policy tailored to specific visual details of training environments. We aim to deepen our understanding of shortcut learning in ObjectNav, its implications and propose a solution. We design an experiment for inserting a shortcut bias in the appearance of training environments. As a proof-of-concept, we associate room types to specific wall colors (e.g., bedrooms with green walls), and observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case (e.g., bedrooms with blue walls). We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room. To solve this, we propose Language-Based (L-B) augmentation. Our key insight is that we can leverage the multimodal feature space of a Vision-Language Model (VLM) to augment visual representations directly at the feature-level, requiring no changes to the simulator, and only an addition of one layer to the model. Where the SOTA ObjectNav method's success rate drops 69%, our proposal has only a drop of 23%.

Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation

TL;DR

This work addresses shortcut learning in ObjectNav caused by simulator biases, introducing an out-of-distribution test that biases training environments with room wall colors. It proposes Language-Based augmentation (LB augmentation) that operates in the feature space of a Vision-Language Model (CLIP/EmbCLIP) by adding a single extra layer, leveraging text descriptions of bias variations to improve domain generalization without simulator changes. Empirical results show that the baseline EmbCLIP method suffers large performance drops when wall colors change (up to a 69% decrease in success), while LB augmentation reduces this drop to about 23%, demonstrating improved robustness to biased visual cues. The approach provides a lightweight, post-hoc strategy to mitigate shortcut learning in embodied AI and offers a foundation for extending language-driven generalization to more complex biases beyond simple wall-color cues.

Abstract

Deep Reinforcement Learning (DRL) has shown great potential in enabling robots to find certain objects (e.g., `find a fridge') in environments like homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL methods are predominantly trained and evaluated using environment simulators. Although DRL has shown impressive results, the simulators may be biased or limited. This creates a risk of shortcut learning, i.e., learning a policy tailored to specific visual details of training environments. We aim to deepen our understanding of shortcut learning in ObjectNav, its implications and propose a solution. We design an experiment for inserting a shortcut bias in the appearance of training environments. As a proof-of-concept, we associate room types to specific wall colors (e.g., bedrooms with green walls), and observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case (e.g., bedrooms with blue walls). We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room. To solve this, we propose Language-Based (L-B) augmentation. Our key insight is that we can leverage the multimodal feature space of a Vision-Language Model (VLM) to augment visual representations directly at the feature-level, requiring no changes to the simulator, and only an addition of one layer to the model. Where the SOTA ObjectNav method's success rate drops 69%, our proposal has only a drop of 23%.
Paper Structure (24 sections, 2 equations, 6 figures, 3 tables)

This paper contains 24 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We propose Language-Based (L-B) augmentation to generalize better to scenes with different wall colors. In this example, we interchange the wall color of the bedroom and living room, causing the SOTA objectNav method EmbCLIP to look for the sofa in the blue bedroom (wrong). With our augmentations, this is countered.
  • Figure 2: Setup for our o.o.d. generalization test. In this example, the target room is the kitchen (red walls in test set 0-room). We change the target room first (test set 1-room) and incrementally change more rooms (test set 2/3-room). The bottom row shows two examples of deceptive changes, where the wall color associated with the target room (red wall color) is moved to a different room type. The top row only shows nondeceptive changes.
  • Figure 3: Language-Based (L-B) augmentation via a the feature space of a vision-language space. Our key insight is that we can augment agent's visual representations ($I_t$) using differences ($\Delta$) between encoded text descriptions of variations of the dataset bias ($T_{1,...,n}$). The augmented embedding of an image 'A living room with green walls' resembles a 'living room with red or blue walls'. The RL model (RNN) is not able to use a shortcut strategy even if during training living rooms always have green walls.
  • Figure 4: Degradation for o.o.d. cases. Performance of EmbCLIP EmbCLIP to scenes with different wall colors. When only changing the wall color of the target object's room (1 wall color change), we already observe a large decrease in performance in all metrics.
  • Figure 5: Errors and shortcuts by the SOTA ObjectNav method. We show example trajectories from 2 different starting position (left vs right column). Notice how nondeceptive episodes (middle) are much longer than deceptive episodes (bottom), whilst both are unsuccessful. Also note the absolute lack of search in the bedroom when changing wall colors deceptively.
  • ...and 1 more figures