Table of Contents
Fetching ...

Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies

Ruiyu Wang, Zheyu Zhuang, Shutong Jin, Nils Ingelhag, Danica Kragic, Florian T. Pokorny

TL;DR

The paper investigates whether visual encoders in visuomotor policies act merely as feature extractors or actively influence control decisions. It introduces Visual Alignment Testing (VAT) to quantify encoder involvement by contrasting end-to-end (E2E) trained encoders with frozen out-of-domain (OOD) pretrained encoders, using a controlled benchmark across robotic manipulation tasks. Quantitative results show a substantial 42% average gap in performance for OOD-pretrained encoders, and VAT along with Full-Gradient saliency maps provide evidence that E2E encoders contribute to decision-making by focusing on task-relevant regions. The findings suggest that functional separation in OOD pretraining is incomplete and motivate developing task-conditioned or context-aware encoders for better integration in visuomotor policies.

Abstract

An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42\% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.

Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies

TL;DR

The paper investigates whether visual encoders in visuomotor policies act merely as feature extractors or actively influence control decisions. It introduces Visual Alignment Testing (VAT) to quantify encoder involvement by contrasting end-to-end (E2E) trained encoders with frozen out-of-domain (OOD) pretrained encoders, using a controlled benchmark across robotic manipulation tasks. Quantitative results show a substantial 42% average gap in performance for OOD-pretrained encoders, and VAT along with Full-Gradient saliency maps provide evidence that E2E encoders contribute to decision-making by focusing on task-relevant regions. The findings suggest that functional separation in OOD pretraining is incomplete and motivate developing task-conditioned or context-aware encoders for better integration in visuomotor policies.

Abstract

An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42\% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.
Paper Structure (16 sections, 7 equations, 11 figures, 5 tables)

This paper contains 16 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: End-to-end formulation.
  • Figure 2: OOD-pretrain formulation.
  • Figure 4: Visual Alignment Testing. The experimental framework we propose provides quantitative evidence that E2E-trained visual encoders play an active role in decision-making, detailed in Section III.C.
  • Figure 5: Mug$\rightarrow$Spam
  • Figure 6: $\textit{Square}^{\star}\rightarrow$Nut
  • ...and 6 more figures