Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation
Dennis Hoftijzer, Gertjan Burghouts, Luuk Spreeuwers
TL;DR
This work addresses shortcut learning in ObjectNav caused by simulator biases, introducing an out-of-distribution test that biases training environments with room wall colors. It proposes Language-Based augmentation (LB augmentation) that operates in the feature space of a Vision-Language Model (CLIP/EmbCLIP) by adding a single extra layer, leveraging text descriptions of bias variations to improve domain generalization without simulator changes. Empirical results show that the baseline EmbCLIP method suffers large performance drops when wall colors change (up to a 69% decrease in success), while LB augmentation reduces this drop to about 23%, demonstrating improved robustness to biased visual cues. The approach provides a lightweight, post-hoc strategy to mitigate shortcut learning in embodied AI and offers a foundation for extending language-driven generalization to more complex biases beyond simple wall-color cues.
Abstract
Deep Reinforcement Learning (DRL) has shown great potential in enabling robots to find certain objects (e.g., `find a fridge') in environments like homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL methods are predominantly trained and evaluated using environment simulators. Although DRL has shown impressive results, the simulators may be biased or limited. This creates a risk of shortcut learning, i.e., learning a policy tailored to specific visual details of training environments. We aim to deepen our understanding of shortcut learning in ObjectNav, its implications and propose a solution. We design an experiment for inserting a shortcut bias in the appearance of training environments. As a proof-of-concept, we associate room types to specific wall colors (e.g., bedrooms with green walls), and observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case (e.g., bedrooms with blue walls). We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room. To solve this, we propose Language-Based (L-B) augmentation. Our key insight is that we can leverage the multimodal feature space of a Vision-Language Model (VLM) to augment visual representations directly at the feature-level, requiring no changes to the simulator, and only an addition of one layer to the model. Where the SOTA ObjectNav method's success rate drops 69%, our proposal has only a drop of 23%.
