Table of Contents
Fetching ...

Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring

Abhiroop Ajith, Constantinos Chamzas

Abstract

Learning abstractions directly from data is a core challenge in robotics. Humans naturally operate at an abstract level, reasoning over high-level subgoals while delegating execution to low-level motor skills -- an ability that enables efficient problem solving in complex environments. In robotics, abstractions and hierarchical reasoning have long been central to planning, yet they are typically hand-engineered, demanding significant human effort and limiting scalability. Automating the discovery of useful abstractions directly from visual data would make planning frameworks more scalable and more applicable to real-world robotic domains. In this work, we focus on rearrangement tasks where the state is represented with raw images, and propose a method to induce discrete, graph-structured abstractions by combining structural constraints with an attention-guided visual distance. Our approach leverages the inherent bipartite structure of rearrangement problems, integrating structural constraints and visual embeddings into a unified framework. This enables the autonomous discovery of abstractions from vision alone, which can subsequently support high-level planning. We evaluate our method on two rearrangement tasks in simulation and show that it consistently identifies meaningful abstractions that facilitate effective planning and outperform existing approaches.

Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring

Abstract

Learning abstractions directly from data is a core challenge in robotics. Humans naturally operate at an abstract level, reasoning over high-level subgoals while delegating execution to low-level motor skills -- an ability that enables efficient problem solving in complex environments. In robotics, abstractions and hierarchical reasoning have long been central to planning, yet they are typically hand-engineered, demanding significant human effort and limiting scalability. Automating the discovery of useful abstractions directly from visual data would make planning frameworks more scalable and more applicable to real-world robotic domains. In this work, we focus on rearrangement tasks where the state is represented with raw images, and propose a method to induce discrete, graph-structured abstractions by combining structural constraints with an attention-guided visual distance. Our approach leverages the inherent bipartite structure of rearrangement problems, integrating structural constraints and visual embeddings into a unified framework. This enables the autonomous discovery of abstractions from vision alone, which can subsequently support high-level planning. We evaluate our method on two rearrangement tasks in simulation and show that it consistently identifies meaningful abstractions that facilitate effective planning and outperform existing approaches.

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Image-To-Plan real-robot pipeline. We learn an action-labeled task graph from observed transitions $\mathcal{D}$. At test time, given start and goal RGB observations (top-left), we localize each image to an abstract node in the learned graph. We then plan between the localized nodes via discrete graph search (e.g., BFS) to obtain a high-level pick/place sequence, which the robot executes to transition the scene from start to goal (bottom).
  • Figure 2: Observations coloring with the action–uniqueness constraint: Each small circle corresponds to an observation. Coloring the observations with the same color means that they belong to the same abstract node (same cluster). Top:invalid—between the same colored clusters we observe two different actions (e.g., $p_1$ and $p_2$); this violates our rule that there must be a unique action between any fixed pair of abstract states. Bottom:valid—only one action connects the pair; any second action must lead to a different destination cluster.
  • Figure 3: Vision–guided graph coloring.Stage 1 (top-left): seed the pick side by greedy, vision‑only grouping using $K_{\mathrm{vis}}$ (no relational checks). Stage 2 (top–middle): build the place conflict graph whose edges indicate merges that would violate Exactly‑One; color with DSATUR, choosing among admissible colors by visual affinity. Stage 3 (bottom–middle): freeze place and recolor the pick side the same way; the bipartite graph now satisfies the constraints.
  • Figure 4: Grid sweep. Evaluating $(k_{\text{pick}},k_{\text{place}})$. Lower $\mathcal{J}$ is better. For Fruit-2$\times$3, the best is $(3,3)$. Blank cells indicate no solution was found for that pair.
  • Figure 5: Scalability on Fruit-4$\times$6. Top: TransCov/TransPrec vs. $G^\star$ as $|\mathcal{D}|$ increases. Bottom: runtime breakdown (distance vs. search).