DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Jianxiong Li; Jinliang Zheng; Yinan Zheng; Liyuan Mao; Xiao Hu; Sijie Cheng; Haoyi Niu; Jihao Liu; Yu Liu; Jingjing Liu; Ya-Qin Zhang; Xianyuan Zhan

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan

TL;DR

This paper discovers that via implicit preferences, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations, and proposes a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions.

Abstract

Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progressions; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

TL;DR

Abstract

Paper Structure (40 sections, 26 equations, 19 figures, 8 tables)

This paper contains 40 sections, 26 equations, 19 figures, 8 tables.

Introduction
Preliminaries
DecisionNCE
Implicit Preference Annotations
Random Segment Sampling
Implicit Preference Learning via Reward Reparameterization
Practical Implementation
Analyses and Insights
From Full Segments to Start-end Transitions
Mirroring Time Contrastive Learning
Positioning Task-Irrelevant Image Embeddings
Advanced Local/Global Trajectory-level Grounding
Experiments
Language-conditioned Behavior Cloning Results
Universal Reward Learning
...and 25 more sections

Figures (19)

Figure 1: Implicit Preference Learning: Matched segments and instructions are preferred to mismatches. Thus, implicit preference learning inherently performs a trajectory-level contrastive learning that compares segments rather than single images.
Figure 2: Overview of DecisionNCE framework. DecisionNCE focuses on jointly training vision and language encoders to achieve trajectory-level representation alignment. The learned representations can be applied to various downstream decision-making tasks.
Figure 3: Implicit Preference. Segment is near-optimal for its associated language instruction, but is sub-optimal for others.
Figure 4: Illustration of DecisionNCE-P and DecisionNCE-T.
Figure 5: Ablation on different numbers of frames used for preference learning.
...and 14 more figures

Theorems & Definitions (3)

Definition 3.1: DecisionNCE-P
Definition 3.2: DecisionNCE-T
proof

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

TL;DR

Abstract

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (3)