Table of Contents
Fetching ...

Semi-Supervised Neural Processes for Articulated Object Interactions

Emily Liu, Michael Noseworthy, Nicholas Roy

TL;DR

This work tackles the data efficiency challenge in adaptive robotic manipulation by proposing Semi-Supervised Neural Processes (SSNP), which jointly learn from abundant unlabeled visual context and a limited set of labeled interactions. By integrating a context-learner inspired by the Neural Statistician with a Neural Process, SSNP builds object-level and action-specific latent representations to predict interaction rewards while adapting to new objects in a few shots. The approach achieves lower prediction error and faster adaptation than fully supervised or pretrained baselines on a door-opening task, even when only a small fraction of objects carry labels, thereby reducing labeling and computational costs. Practically, SSNP enables robots to leverage passive observations to guide manipulation policies with minimal retraining, improving robustness and data efficiency in real-world settings.

Abstract

The scarcity of labeled action data poses a considerable challenge for developing machine learning algorithms for robotic object manipulation. It is expensive and often infeasible for a robot to interact with many objects. Conversely, visual data of objects, without interaction, is abundantly available and can be leveraged for pretraining and feature extraction. However, current methods that rely on image data for pretraining do not easily adapt to task-specific predictions, since the learned features are not guaranteed to be relevant. This paper introduces the Semi-Supervised Neural Process (SSNP): an adaptive reward-prediction model designed for scenarios in which only a small subset of objects have labeled interaction data. In addition to predicting reward labels, the latent-space of the SSNP is jointly trained with an autoencoding objective using passive data from a much larger set of objects. Jointly training with both types of data allows the model to focus more effectively on generalizable features and minimizes the need for extensive retraining, thereby reducing computational demands. The efficacy of SSNP is demonstrated through a door-opening task, leading to better performance than other semi-supervised methods, and only using a fraction of the data compared to other adaptive models.

Semi-Supervised Neural Processes for Articulated Object Interactions

TL;DR

This work tackles the data efficiency challenge in adaptive robotic manipulation by proposing Semi-Supervised Neural Processes (SSNP), which jointly learn from abundant unlabeled visual context and a limited set of labeled interactions. By integrating a context-learner inspired by the Neural Statistician with a Neural Process, SSNP builds object-level and action-specific latent representations to predict interaction rewards while adapting to new objects in a few shots. The approach achieves lower prediction error and faster adaptation than fully supervised or pretrained baselines on a door-opening task, even when only a small fraction of objects carry labels, thereby reducing labeling and computational costs. Practically, SSNP enables robots to leverage passive observations to guide manipulation policies with minimal retraining, improving robustness and data efficiency in real-world settings.

Abstract

The scarcity of labeled action data poses a considerable challenge for developing machine learning algorithms for robotic object manipulation. It is expensive and often infeasible for a robot to interact with many objects. Conversely, visual data of objects, without interaction, is abundantly available and can be leveraged for pretraining and feature extraction. However, current methods that rely on image data for pretraining do not easily adapt to task-specific predictions, since the learned features are not guaranteed to be relevant. This paper introduces the Semi-Supervised Neural Process (SSNP): an adaptive reward-prediction model designed for scenarios in which only a small subset of objects have labeled interaction data. In addition to predicting reward labels, the latent-space of the SSNP is jointly trained with an autoencoding objective using passive data from a much larger set of objects. Jointly training with both types of data allows the model to focus more effectively on generalizable features and minimizes the need for extensive retraining, thereby reducing computational demands. The efficacy of SSNP is demonstrated through a door-opening task, leading to better performance than other semi-supervised methods, and only using a fraction of the data compared to other adaptive models.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 2 algorithms.

Figures (6)

  • Figure 1: The Semi-Supervised Neural Process architecture consists of two autoencoder components. The context learner learns embeddings for unlabeled data (such as images) and the neural process learns an embedding of labeled action-reward pairs.
  • Figure 2: Overview of the Semi-supervised Neural Process in deployment. The robot is presented with a new object with unknown kinematic properties. (1)The robot can passively observe context data without interacting with it. (2) When the robot starts interacting with the object, it can iteratively build an interaction dataset (exploration). The SSNP model can also be used to predict optimal actions based on the limited interaction data (task execution).
  • Figure 3: Bayesian graphical models for NP, NS, and SSNP. (a) In the Neural Process, the observed variables are the pre-existing action-reward pairs ($a_C$, $r_C$) and the actions with unseen rewards ($a_T$). We learn the associated reward $r_T$ with $a_T$ by means of a latent variable $c_a$. (b) In the Neural Statistician, we observe hierarchical data $x$ (e.g., images), from which we learn shared contextual latent variables $c$ corresponding to the object, and unique instance-level latent variables $z$ for each individual sample. (c) The SSNP integrates the NS and NP probabilistic generative models, where the unobserved $r_T$ is conditioned upon both action-level latent variables $c_a$ and object-level latent variables $c$.
  • Figure 4: Semi-Supervised Neural Process vs Neural Statistician and Neural Process baselines, for datasets with 10%, 25%, and 50% labeled action data. The NS baseline (Finetuned NS) improves with the fraction of labeled data but is unable to adapt to any specific object. The NP baseline's performance degrades with smaller labeled datasets. On the other hand, SSNP exhibits adaptive behaviour and achieves good performance, even with few labeled objects. Regret is evaluated over 100 random actions. The number of labeled actions is capped at 10.
  • Figure 5: Root mean squared error on reward prediction on SSNP model for different numbers of actions, comparing across different levels of supervision and number of images in the context autoencoder. Having more images in the autoencoder benefits predictions even with low data supervision. The baseline (black line) is the standard deviation of the rewards in the test dataset.
  • ...and 1 more figures