Table of Contents
Fetching ...

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

Shutong Jin, Ruiyu Wang, Kuangyi Chen, Florian T. Pokorny

TL;DR

PACA addresses the challenge of zero-shot scene rearrangement by deriving object-level representations directly from diffusion-based goal generation, enabling perspective-controlled 6-DoF manipulation. It introduces a joint cross-attention representation that fuses generation, segmentation, and feature encoding, mitigating multi-step error accumulation. The method supports perspective control through Hough Transform–driven alignment and ControlNet, extending beyond traditional 3-DoF top-down settings. Experiments on real robots demonstrate competitive human-satisfaction scores, effective zero-shot performance, and solid matching accuracy, highlighting practical impact for scalable robotic manipulation with web-scale generative models.

Abstract

Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

TL;DR

PACA addresses the challenge of zero-shot scene rearrangement by deriving object-level representations directly from diffusion-based goal generation, enabling perspective-controlled 6-DoF manipulation. It introduces a joint cross-attention representation that fuses generation, segmentation, and feature encoding, mitigating multi-step error accumulation. The method supports perspective control through Hough Transform–driven alignment and ControlNet, extending beyond traditional 3-DoF top-down settings. Experiments on real robots demonstrate competitive human-satisfaction scores, effective zero-shot performance, and solid matching accuracy, highlighting practical impact for scalable robotic manipulation with web-scale generative models.

Abstract

Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.

Paper Structure

This paper contains 30 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: These images were generated with Stable Diffusion and the prompt "One apple, one orange, and one banana in the plate". The generated images exhibit several issues: (a) distortion, (b) introduction of new objects, and (c) unmatched image perspectives. An illustration of multi-step design: (d) Generation, segmentation, and feature encoding are performed using three different models. During segmentation, repetitive cropping occurs due to the distortion in generated goals, introducing errors to the subsequent feature encoding process. The segmented patches were produced by Mask R-CNN he2017mask. Our integrated design: (e) Segmentation and feature encoding are conducted concurrently with the generation.
  • Figure 2: Pipeline of the proposed PACA. Generation, segmentation and feature encoding are integrated into a single step. The prompt is directly utilized to produce the generated goal. The real image is first inverted to its noisy counterpart and then reconstructed to extract segmentation and features for matching. The cross-attention maps highlight segmentation and encoded features specific to each object's descriptors, thereby facilitating the matching process and transformation calculations.
  • Figure 3: Controlled generation with Hough transform.
  • Figure 4: (a) Reconstructed images and cross-attention maps $M("Plate", t)$, $M("Fork", t)$ at different timesteps $t$ during denoising. (b) Conceptual embedding containing object region information at timestep $t = T/2$. (c) Perceptual embedding containing object feature information at timestep $t = 1$. Joint cross-attention representation combines cross-attention from these timesteps using Eq.\ref{['joint']}.
  • Figure 5: (a) Experimental setup; (b) Examples of 3-DoF top-down rearrangement; (c) Examples of 6-DoF rearrangement. In (b) and (c), the first row is the goal, the second row shows the final rearrangement. Objects relevant to the prompt are marked with red boundaries due to stochasticity, distortions, and new object generation. Objects not specified in the prompt do not affect the pipeline, as they lack object-level representation. Full executions can be found in the supplementary material.
  • ...and 1 more figures