Table of Contents
Fetching ...

Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

Libang Zhao, Qixin Zeng, Hongyin Zhang, Donglin Wang

Abstract

When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.

Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

Abstract

When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.
Paper Structure (19 sections, 8 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Deterioration of cross-modal information structure. After Task 1 ends, compared to the same task, attention diffusion begins.
  • Figure 2: The proposed architecture consists of two main components. Left: this part depicts the model’s data flow, covering input feature composition, prediction generation, and the structured action space representation. Right: this section illustrates key approach, leveraging replay-anchored contrastive learning to retain historical task knowledge and vision–language mutual information regularization to enforce cross-modal structural consistency.
  • Figure 3: The evolution of success rates for our method compared with five baseline methods on the LIBERO-Long $B5\text{-}5N1$ benchmark. Solid curves represent the average success rates across three runs with different random seeds, and the shaded areas correspond to standard deviation.
  • Figure 4: Ablation Study on Mutual Information Structure Preservation. BASE denotes the representation structure immediately after learning the target task. ER and Ours illustrate the representation structure of the same task after subsequent training on a new task. Red boxes indicate similarity, while orange boxes indicate dissimilarity.
  • Figure 5: Ablation study. $\lambda_1$ and $\lambda_2$ are hyperparameters that balance the contributions of the RAC and CMI losses. When $\lambda_1$ is varied, $\lambda_2$ is fixed at 0.1 , $\lambda_2$ is varied, $\lambda_1$ is fixed at 0.1. Final average accuracy with varying values of $\lambda$.