Table of Contents
Fetching ...

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa Vitiello

TL;DR

The paper tackles generalization in embodied instruction following (EIF) tasks under unseen environments by leveraging CLIP as an auxiliary module rather than replacing the visual encoder. It introduces ET-CLIP, a simple, architecture-agnostic approach that augments the Episodic Transformer (ET) with an auxiliary CLIP-based object-detection loss, avoiding replacement of the visual encoder. The joint loss is $L(obj) = α L_CLIP(obj) + (1-α) L_ET(obj)$ with $α ∈ [0,1]$, and both modules are trained end-to-end while inference uses ET alone. Experiments on ALFRED show improved performance on unseen validation, with larger gains for objects with fine-grained properties, small objects, and rare semantic terms, demonstrating enhanced generalization and multimodal alignment.

Abstract

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

TL;DR

The paper tackles generalization in embodied instruction following (EIF) tasks under unseen environments by leveraging CLIP as an auxiliary module rather than replacing the visual encoder. It introduces ET-CLIP, a simple, architecture-agnostic approach that augments the Episodic Transformer (ET) with an auxiliary CLIP-based object-detection loss, avoiding replacement of the visual encoder. The joint loss is with , and both modules are trained end-to-end while inference uses ET alone. Experiments on ALFRED show improved performance on unseen validation, with larger gains for objects with fine-grained properties, small objects, and rare semantic terms, demonstrating enhanced generalization and multimodal alignment.

Abstract

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
Paper Structure (10 sections, 1 equation, 1 figure, 2 tables)

This paper contains 10 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: ET-CLIP model as modified from pashevich2021episodic