Co-Scale Cross-Attentional Transformer for Rearrangement Target Detection
Haruka Matsuo, Shintaro Ishikawa, Komei Sugiura
TL;DR
This paper tackles Rearrangement Target Detection (RTD) by detecting objects that must be rearranged between a goal state and a current state. It introduces a Co-Scale Cross-Attentional Transformer with a Serial Encoder operating on two parallel streams (goal and current) and a Cross-Attentional Encoder to model their relationships, producing a change mask via a decoder. The approach outperforms baselines on a newly built RTD dataset in AI2-THOR, achieving $F_1$-score $80.3\%$ and $mIoU$ $48.6\%$, and demonstrates robustness to door/drawer angle changes and complex shapes. The work enables more reliable RTD in domestic service robots and provides a path toward extending to Scene Change Detection tasks, with future work including real-world robot deployment and broader SCD applications.
Abstract
Rearranging objects (e.g. vase, door) back in their original positions is one of the most fundamental skills for domestic service robots (DSRs). In rearrangement tasks, it is crucial to detect the objects that need to be rearranged according to the goal and current states. In this study, we focus on Rearrangement Target Detection (RTD), where the model generates a change mask for objects that should be rearranged. Although many studies have been conducted in the field of Scene Change Detection (SCD), most SCD methods often fail to segment objects with complex shapes and fail to detect the change in the angle of objects that can be opened or closed. In this study, we propose a Co-Scale Cross-Attentional Transformer for RTD. We introduce the Serial Encoder which consists of a sequence of serial blocks and the Cross-Attentional Encoder which models the relationship between the goal and current states. We built a new dataset consisting of RGB images and change masks regarding the goal and current states. We validated our method on the dataset and the results demonstrated that our method outperformed baseline methods on $F_1$-score and mean IoU.
