Table of Contents
Fetching ...

Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop Rearrangement

Hogun Kee, Wooseok Oh, Minjae Kang, Hyemin Ahn, Songhwai Oh

TL;DR

The paper addresses tabletop tidying without explicit target configurations by introducing TTU, a large-scale dataset, and a Tidiness Score-Guided Monte Carlo Tree Search (TSMCTS) framework. TSMCTS combines a vision-based tidiness discriminator, a tidying policy learned via offline reinforcement learning, and an MCTS high-level planner that optimizes object rearrangements under a tidiness utility $\\Psi(O, P)$ with $P$ representing 6-DoF placements. Key results show high generalization to unseen objects and real-world scenes, achieving around 85% success in real robots and atidiness score near 0.90 in simulations, with a human-aligned threshold of $\\\\\\xi \\approx 0.85$ for tidiness. The work demonstrates practical feasibility for RGB-D guided tabletop tidying and provides a dataset and framework that can be extended with language or semantic guidance for broader applications.

Abstract

In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset is available at: https://github.com/rllab-snu/TTU-Dataset.

Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop Rearrangement

TL;DR

The paper addresses tabletop tidying without explicit target configurations by introducing TTU, a large-scale dataset, and a Tidiness Score-Guided Monte Carlo Tree Search (TSMCTS) framework. TSMCTS combines a vision-based tidiness discriminator, a tidying policy learned via offline reinforcement learning, and an MCTS high-level planner that optimizes object rearrangements under a tidiness utility with representing 6-DoF placements. Key results show high generalization to unseen objects and real-world scenes, achieving around 85% success in real robots and atidiness score near 0.90 in simulations, with a human-aligned threshold of for tidiness. The work demonstrates practical feasibility for RGB-D guided tabletop tidying and provides a dataset and framework that can be extended with language or semantic guidance for broader applications.

Abstract

In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset is available at: https://github.com/rllab-snu/TTU-Dataset.

Paper Structure

This paper contains 13 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The hierarchical policy of TSMCTS iteratively finds pick-and-place actions to tidy up objects on a table. The high-level policy finds which object to pick and place according to the current configuration. The low-level policy finds grasp points and trajectories of the end effector. Details of each policy are described in Section V.
  • Figure 2: (a) Different arrangements can be created with the same combination of objects. R represents 'right', L stands for 'left', B for 'behind', and F denotes 'front' among the spatial relations. Various templates are collected to capture as many tidied arrangements as possible for each object set. (b) The TTU dataset consists of state-action sequences for each environment, ranging from a messy scene (t=1) to a perfectly tidied scene (t=T).
  • Figure 3: (a) We train the tidiness discriminator and tidying policy using the TTU dataset. The tidiness discriminator is trained in a supervised manner to predict the tidiness score of the table, while the tidying policy is trained to estimate the action distribution for pick-and-place actions using the IQL framework. (b) During inference, MCTS utilizes the tidiness discriminator $\Psi_\theta$ and the tidying policy $\pi_\rho$ to find the best pick-and-place actions. (c) From the current table image $s_t$, the policy networks take the table image $\mathcal{I}_{-o_i}$ and the object's patch $\mathcal{P}(o_i)$ as inputs to generate an action probability distribution. The action is defined by the selected object, its placement position, and its rotation.
  • Figure 4: Given a high-level action specifying which object to pick and where to place, the low-level planner uses the Contact-GraspNet to find a stable grasping point for the object. To place the object in the desired orientation, the initial orientation is determined by applying ellipse fitting to the object mask obtained through SAM, followed by calculating the rotation transformation to determine the placement.
  • Figure 5: The upper figure illustrates the process of next state prediction by directly moving object patches. The lower figure depicts a sequence of TSMCTS evaluations in the real world. The top row presents the predicted states $\hat{s}_{t}$ by moving image patches from the previous states. The bottom row displays the observed states $s_t$. $\psi_t$ denotes the tidiness score of each state $s_t$.
  • ...and 3 more figures