Table of Contents
Fetching ...

Co-Scale Cross-Attentional Transformer for Rearrangement Target Detection

Haruka Matsuo, Shintaro Ishikawa, Komei Sugiura

TL;DR

This paper tackles Rearrangement Target Detection (RTD) by detecting objects that must be rearranged between a goal state and a current state. It introduces a Co-Scale Cross-Attentional Transformer with a Serial Encoder operating on two parallel streams (goal and current) and a Cross-Attentional Encoder to model their relationships, producing a change mask via a decoder. The approach outperforms baselines on a newly built RTD dataset in AI2-THOR, achieving $F_1$-score $80.3\%$ and $mIoU$ $48.6\%$, and demonstrates robustness to door/drawer angle changes and complex shapes. The work enables more reliable RTD in domestic service robots and provides a path toward extending to Scene Change Detection tasks, with future work including real-world robot deployment and broader SCD applications.

Abstract

Rearranging objects (e.g. vase, door) back in their original positions is one of the most fundamental skills for domestic service robots (DSRs). In rearrangement tasks, it is crucial to detect the objects that need to be rearranged according to the goal and current states. In this study, we focus on Rearrangement Target Detection (RTD), where the model generates a change mask for objects that should be rearranged. Although many studies have been conducted in the field of Scene Change Detection (SCD), most SCD methods often fail to segment objects with complex shapes and fail to detect the change in the angle of objects that can be opened or closed. In this study, we propose a Co-Scale Cross-Attentional Transformer for RTD. We introduce the Serial Encoder which consists of a sequence of serial blocks and the Cross-Attentional Encoder which models the relationship between the goal and current states. We built a new dataset consisting of RGB images and change masks regarding the goal and current states. We validated our method on the dataset and the results demonstrated that our method outperformed baseline methods on $F_1$-score and mean IoU.

Co-Scale Cross-Attentional Transformer for Rearrangement Target Detection

TL;DR

This paper tackles Rearrangement Target Detection (RTD) by detecting objects that must be rearranged between a goal state and a current state. It introduces a Co-Scale Cross-Attentional Transformer with a Serial Encoder operating on two parallel streams (goal and current) and a Cross-Attentional Encoder to model their relationships, producing a change mask via a decoder. The approach outperforms baselines on a newly built RTD dataset in AI2-THOR, achieving -score and , and demonstrates robustness to door/drawer angle changes and complex shapes. The work enables more reliable RTD in domestic service robots and provides a path toward extending to Scene Change Detection tasks, with future work including real-world robot deployment and broader SCD applications.

Abstract

Rearranging objects (e.g. vase, door) back in their original positions is one of the most fundamental skills for domestic service robots (DSRs). In rearrangement tasks, it is crucial to detect the objects that need to be rearranged according to the goal and current states. In this study, we focus on Rearrangement Target Detection (RTD), where the model generates a change mask for objects that should be rearranged. Although many studies have been conducted in the field of Scene Change Detection (SCD), most SCD methods often fail to segment objects with complex shapes and fail to detect the change in the angle of objects that can be opened or closed. In this study, we propose a Co-Scale Cross-Attentional Transformer for RTD. We introduce the Serial Encoder which consists of a sequence of serial blocks and the Cross-Attentional Encoder which models the relationship between the goal and current states. We built a new dataset consisting of RGB images and change masks regarding the goal and current states. We validated our method on the dataset and the results demonstrated that our method outperformed baseline methods on -score and mean IoU.
Paper Structure (19 sections, 12 equations, 5 figures, 3 tables)

This paper contains 19 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our method overview: Given images of the goal and current states, our method generates a change mask for objects that should be rearranged.
  • Figure 2: An example of RTD. From left to right: goal state, current state, and change mask.
  • Figure 3: The framework of our method. Our model consists of three main modules: Serial Encoder, Cross-Attentional Encoder, and Decoder.
  • Figure 4: Qualitative results of successful samples. From left to right: $\bm{x}_{\mathrm{goal}}$, $\bm{x}_{\mathrm{cur}}$, $\bm{y}$, $\hat{\bm{y}}$ obtained by the CSCDNet, and $\hat{\bm{y}}$ obtained by the proposed method. (a) The proposed method generated an almost complete segmentation of the drawer shape. (b) The proposed method successfully generated masks of all small objects.
  • Figure 5: Qualitative results of a failed sample. From left to right: $\bm{x}_{\mathrm{goal}}$, $\bm{x}_{\mathrm{cur}}$, $\bm{y}$, $\hat{\bm{y}}$ obtained by the CSCDNet, and $\hat{\bm{y}}$ obtained by the proposed method. Both the CSCDNet and the proposed method incorrectly segmented the shadows caused by the opening of the drawers.