Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding

Takeru Oba; Norimichi Ukita

Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding

Takeru Oba, Norimichi Ukita

TL;DR

This work targets long-horizon, probabilistic motion prediction from an initial image by introducing an Energy-Based Model (EBM) that evaluates motion-image consistency via $p(P|I)=\frac{\exp(-E_{\theta}(I,P))}{Z(I)}$. A novel spatially-aligned temporal encoding fuses image and motion data by sampling image features along the projected motion trajectory, enabling effective cross-domain reasoning. To address the inefficiency and hyperparameter sensitivity of sampling-based optimization, a data-driven Deep Motion Optimizer (DMO) is trained to refine motions toward high-probability solutions, with EBM and DMO sharing a unified feature extractor. Experiments on RLBench tasks show that the proposed approach often outperforms state-of-the-art baselines, highlighting the benefits of cross-domain encoding and learned optimization for long-horizon, stochastic motion prediction; future work suggests diffusion-based augmentation to broaden motion diversity.

Abstract

This paper proposes a probabilistic motion prediction method for long motions. The motion is predicted so that it accomplishes a task from the initial state observed in the given image. While our method evaluates the task achievability by the Energy-Based Model (EBM), previous EBMs are not designed for evaluating the consistency between different domains (i.e., image and motion in our method). Our method seamlessly integrates the image and motion data into the image feature domain by spatially-aligned temporal encoding so that features are extracted along the motion trajectory projected onto the image. Furthermore, this paper also proposes a data-driven motion optimization method, Deep Motion Optimizer (DMO), that works with EBM for motion prediction. Different from previous gradient-based optimizers, our self-supervised DMO alleviates the difficulty of hyper-parameter tuning to avoid local minima. The effectiveness of the proposed method is demonstrated with a variety of experiments with similar SOTA methods.

Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding

TL;DR

This work targets long-horizon, probabilistic motion prediction from an initial image by introducing an Energy-Based Model (EBM) that evaluates motion-image consistency via

. A novel spatially-aligned temporal encoding fuses image and motion data by sampling image features along the projected motion trajectory, enabling effective cross-domain reasoning. To address the inefficiency and hyperparameter sensitivity of sampling-based optimization, a data-driven Deep Motion Optimizer (DMO) is trained to refine motions toward high-probability solutions, with EBM and DMO sharing a unified feature extractor. Experiments on RLBench tasks show that the proposed approach often outperforms state-of-the-art baselines, highlighting the benefits of cross-domain encoding and learned optimization for long-horizon, stochastic motion prediction; future work suggests diffusion-based augmentation to broaden motion diversity.

Abstract

Paper Structure (17 sections, 8 equations, 8 figures, 5 tables)

This paper contains 17 sections, 8 equations, 8 figures, 5 tables.

INTRODUCTION
Related work
EBM
Motion Prediction
Inference and Training Methods
Notations
Motion Prediction using EBM and DMO
Training of EBM
Training of DMO
Network Architectures of EBM and DMO
Experimental Results
Dataset
Motion Optimizers
Model Architectures of Feature Extractor
EBM Training Methods
...and 2 more sections

Figures (8)

Figure 1: Overview of our proposed method. CNN-Transformer EBM estimates the probability of a motion. The motion is efficiently optimized to improve its probability by Deep Motion Optimizer.
Figure 2: Motion prediction in the inference stage. The image and motions are fed into EBM and the Motion Optimizer. EBM estimates the energy of each motion. Motion Optimizer updates the motion to increase its energy. Finally, the motion having the highest energy is selected as the predicted motion.
Figure 3: Training procedure of our EBM. VAE reconstructs the motions to acquire various negative motions. The image and motions are fed into the EBM. EBM estimates the energy of each motion. This EBM is trained to improve the energy of the positive sample and decrease that of each negative sample.
Figure 4: Training procedure of our Deep Motion Optimizer (DMO). VAE reconstructs the motion from hidden vector $\bm{z}^{i}$ which include small noise. DMO is optimized to refine this small difference to precisely predict the ground truth motion.
Figure 5: The architectures of EBM and DMO. EBM and DMO have the same feature extractor but the parameters are not shared. CNN has the UNet-shaped architecture to create the feature map $F$. The pose features are extracted by MLP. To merge the image and pose features, the image features $\bm{f}_{k}$ are extracted from $F$ along the motion trajectory, and concatenated with the pose features $\bm{g}_{k}$. These features are fed into Transformer to provide the features $\bm{v}_{k}$ to EBM and DMO.
...and 3 more figures

Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding

TL;DR

Abstract

Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)