Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

Fanqi Yu; Matteo Tiezzi; Tommaso Apicella; Cigdem Beyan; Vittorio Murino

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino

TL;DR

This work introduces a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints, and introduces an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness.

Abstract

We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

TL;DR

Abstract

Paper Structure (28 sections, 12 equations, 6 figures, 11 tables)

This paper contains 28 sections, 12 equations, 6 figures, 11 tables.

Introduction
Related Work
Methodology
Problem Formulation
Our Approach
Implementation Details
Details for Base Policy.
Experimental Analysis
Benchmark Suite and Datasets.
Results
Comparison with SOTA Methods
Ablation Study
Conclusion
Extended Implementation Details
Our Framework
...and 13 more sections

Figures (6)

Figure 1: Illustration of Incremental Feature Adjustment (IFA). The figure displays a 2D projection of the global latent representations $\mathbf{g}$ during policy rollout for two related tasks, $T_j$ (previously learned) and $T_k$ (newly learned). The stars ($\mathbf{\color{red}\star}$ for $T_j$ and $\mathbf{\color{green}\star}$ for $T_k$) represent the stable language reference embeddings $\mathbf{h}^{(r)}$, while the circles ($\mathbf{\color{red}\bullet}$ for $T_j$ and $\mathbf{\color{green}\bullet}$ for $T_k$) are the global embeddings $\mathbf{g}$. (Left) Without IFA, the new task's embeddings ($\mathbf{\color{green}\bullet}$) exhibit representation drift by clustering close to the old task's embeddings ($\mathbf{\color{red}\bullet}$). (Right) With IFA, the loss $\mathcal{L}_{\text{IFA}}$ enforces a constraint on the distances: the distance to the own reference ($D_{\text{own}}$) plus a margin $\delta$ must be less than or equal to the distance to the other task's reference ($D_{\text{other}}$). This mechanism forces the $\mathbf{g} (T_k)$ ($\mathbf{\color{green}\bullet}$) cluster away from $\mathbf{h}^{(r)}(T_j)$ ($\mathbf{\color{red}\star}$) and closer to $\mathbf{h}^{(r)}(T_k)$ ($\mathbf{\color{green}\star}$) , achieving inter-task disentanglement.
Figure 2: Our method is a general multimodal architecture composed of modality-specific encoders (language, vision, and state), a modulation network, a temporal decoder, and a policy head. During the pretraining phase, all architecture modules are trained. In the lifelong learning phase, only the temporal decoder and policy head are updated using both the new task data and the samples stochastically stored in the replay buffer. The buffer stores the multimodal features, output of the modulation layer. The model is jointly supervised by both the Behavior Cloning and the Incremental Feature Adjustment loss, processing the current task and previously stored tasks.
Figure 3: Comparison between cosine distance and angle-based IFA loss calculation.
Figure 4: UMAP visualization of the global latent representation $g(T_k)$ obtained when not enforcing the IFA loss (left) or when enforcing it (right). Each color represents the global latent representations of one task.
Figure 5: UMAP visualization of the global latent representations $g(T_k)$ obtained when not enforcing the IFA loss, each color represents the global latent representations of one task.
...and 1 more figures

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

TL;DR

Abstract

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

Authors

TL;DR

Abstract

Table of Contents

Figures (6)