PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

Shutong Jin; Ruiyu Wang; Kuangyi Chen; Florian T. Pokorny

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

Shutong Jin, Ruiyu Wang, Kuangyi Chen, Florian T. Pokorny

TL;DR

PACA addresses the challenge of zero-shot scene rearrangement by deriving object-level representations directly from diffusion-based goal generation, enabling perspective-controlled 6-DoF manipulation. It introduces a joint cross-attention representation that fuses generation, segmentation, and feature encoding, mitigating multi-step error accumulation. The method supports perspective control through Hough Transform–driven alignment and ControlNet, extending beyond traditional 3-DoF top-down settings. Experiments on real robots demonstrate competitive human-satisfaction scores, effective zero-shot performance, and solid matching accuracy, highlighting practical impact for scalable robotic manipulation with web-scale generative models.

Abstract

Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

TL;DR

Abstract

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)