Table of Contents
Fetching ...

Token Bottleneck: One Token to Remember Dynamics

Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun

TL;DR

Token Bottleneck is introduced, a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints and encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes.

Abstract

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the reconstruction step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of \ours~over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales. Code is available at https://github.com/naver-ai/tobo.

Token Bottleneck: One Token to Remember Dynamics

TL;DR

Token Bottleneck is introduced, a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints and encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes.

Abstract

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the reconstruction step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of \ours~over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales. Code is available at https://github.com/naver-ai/tobo.

Paper Structure

This paper contains 48 sections, 1 equation, 11 figures, 10 tables.

Figures (11)

  • Figure 1: (a) We describe the underlying mechanism of our Token Bottleneck (ToBo) pipeline during pre-training, which conservatively encode a reference scene into a bottleneck token and predict the subsequent target scene based on a scarce target patches and the bottleneck token. ToBo facilitates learning the capability of temporal progression recognition and preservation of observed information (top). Therefore, using bottleneck tokens from the current and recent past observations enables the robot to better understand its current state (bottom). (b) Our method significantly surpasses previous self-supervised visual representation learning methods designed for static dinosimclrmocov3mae and dynamic scenes siammaerspcroco on various robot manipulation and locomotion tasks.
  • Figure 2: Comparative analysis for motivation. We compare robot manipulation performance using MAE and SiamMAE as visual backbones. While SiamMAE employ temporal correspondence to the limitation of MAE, its improvement over MAE remains limited.
  • Figure 3: Overview of our Token Bottleneck (ToBo). Our ToBo reconstructs the masked patches from the bottleneck token representation of the reference scene $\mathbf{x}^{t}$ and extremely scarce patches from the target scene $\mathbf{x}^{t+k}$. Such extreme scarcity leads the decoder $d_{\phi}$ to rely heavily on the reference scene $\mathbf{x}^{t}$, facilitating the preservation of observed information in the bottleneck token.
  • Figure 4: Performance on real-world vision-based robot policy learning. Success rates (%) of imitation learning agents on three manipulation tasks: Cabinet Opening, Drawer Closing, and Cup Stacking. Agents are trained with ViT-S/16 representations pre-trained on Kinetics-400 kay2017kinetics for 400 epochs. The results demonstrate the generalizability of ToBo in real-world.
  • Figure 5: Semantic Part Propagation
  • ...and 6 more figures