ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun; Cathy Jiao; Shahriar Noroozizadeh; Jimin Sun; Rosa Vitiello

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa Vitiello

TL;DR

The paper tackles generalization in embodied instruction following (EIF) tasks under unseen environments by leveraging CLIP as an auxiliary module rather than replacing the visual encoder. It introduces ET-CLIP, a simple, architecture-agnostic approach that augments the Episodic Transformer (ET) with an auxiliary CLIP-based object-detection loss, avoiding replacement of the visual encoder. The joint loss is $L(obj) = α L_CLIP(obj) + (1-α) L_ET(obj)$ with $α ∈ [0,1]$, and both modules are trained end-to-end while inference uses ET alone. Experiments on ALFRED show improved performance on unseen validation, with larger gains for objects with fine-grained properties, small objects, and rare semantic terms, demonstrating enhanced generalization and multimodal alignment.

Abstract

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

TL;DR

with

, and both modules are trained end-to-end while inference uses ET alone. Experiments on ALFRED show improved performance on unseen validation, with larger gains for objects with fine-grained properties, small objects, and rare semantic terms, demonstrating enhanced generalization and multimodal alignment.

Abstract

Paper Structure (10 sections, 1 equation, 1 figure, 2 tables)

This paper contains 10 sections, 1 equation, 1 figure, 2 tables.

Introduction
Proposed Approach
Preliminary Experiments & Results
Experimental setting
Results
Analysis
Object properties
Small objects
Rare semantics
Conclusion

Figures (1)

Figure 1: ET-CLIP model as modified from pashevich2021episodic

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

TL;DR

Abstract

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (1)