The Temporal Trap: Entanglement in Pre-Trained Visual Representations for Visuomotor Policy Learning
Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Chris Xiaoxuan Lu, Oisin Mac Aodha
TL;DR
This work reveals temporal entanglement as a core limitation when leveraging time-invariant pre-trained visual representations for visuomotor policy learning. It introduces robust probes to quantify short-range and long-range temporal entanglement and demonstrates strong correlations with policy performance. A simple temporal disentanglement baseline, based on embedding the task progression signal, outperforms conventional feature-augmentation methods and enables substantial policy gains. The findings argue for temporally aware PVRs and provide practical baselines and insights that can guide future development of temporally structured representations for robotic control.
Abstract
The integration of pre-trained visual representations (PVRs) has significantly advanced visuomotor policy learning. However, effectively leveraging these models remains a challenge. We identify temporal entanglement as a critical, inherent issue when using these time-invariant models in sequential decision-making tasks. This entanglement arises because PVRs, optimised for static image understanding, struggle to represent the temporal dependencies crucial for visuomotor control. In this work, we quantify the impact of temporal entanglement, demonstrating a strong correlation between a policy's success rate and the ability of its latent space to capture task-progression cues. Based on these insights, we propose a simple, yet effective disentanglement baseline designed to mitigate temporal entanglement. Our empirical results show that traditional methods aimed at enriching features with temporal components are insufficient on their own, highlighting the necessity of explicitly addressing temporal disentanglement for robust visuomotor policy learning.
