Table of Contents
Fetching ...

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo

Abstract

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Abstract

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.
Paper Structure (10 sections, 4 equations, 7 figures, 3 tables)

This paper contains 10 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Understanding what-moves-where requires knowing what-is-where. Robots interact with dynamic environments through compact state representations derived from visual observations. To reason scene dynamics, the robot must understand which objects move and detect how their locations change between observations. This requires each state to capture what-is-where information in the scene so that subtle spatial differences between observations can be recognized.
  • Figure 2: Learning a visual state that captures pixel-level scene composition. We present a state representation learning (SRL) method that captures the entire scene composition, including object identities, locations, and their spatial relationship. Conceptually, the global state becomes the bottleneck token that contains the contextual information of the scene. By enforcing reconstruction of arbitrary cropped views from such contextual information, the model is encouraged to encode pixel-level scene composition information in the global state.
  • Figure 3: Overview of CroBo. Given a global source view $\mathbf{x}^g$ and a local target view $\mathbf{x}^l$ cropped from $\mathbf{x}^g$, the encoder maps the source view to a single bottleneck token (i.e., [CLS] token) and the target view to a few visible patch tokens under heavy masking (e.g., 90%). The decoder reconstructs the masked target patches from the visible target patch tokens together with source bottleneck token. Because only a few target patches remain visible, the decoder must rely on the bottleneck token to provide the missing scene context, encouraging it to capture fine-grained what-is-where scene composition.
  • Figure 4: Reconstruction of CroBo. We visualize image reconstructions on CLEVR johnson2017clevr (columns 1-3), DAVIS pont20172017 (columns 4-5), and Franka Kitchen gupta2019relay (column 6). Using the bottleneck token of the source view as context, CroBo reconstructs a highly masked (90%) target view cropped from the source view. The results show that CroBo representations effectively capture the global scene structure, including object identities, locations, and their spatial relationships.
  • Figure 5: Perceptual straightness of representation dynamics in video. (a) Local curvature of representation trajectories measured on DAVIS videos. Curvature is computed as the angle between consecutive representation differences, averaged across videos. Lower curvature indicates smoother temporal dynamics. CroBo consistently exhibits lower curvature than prior models such as DINOv2 and CropMAE, suggesting more temporally coherent representations that better preserve what-moves-where across frames. (b) Representation trajectories across video frames visualized using PCA. Each point corresponds to the representation of a frame, and colors indicate temporal progression. CroBo produces smooth and locally linear trajectories that follow the natural evolution of the scene, whereas DINOv2 and CropMAE produce irregular and tangled trajectories.
  • ...and 2 more figures