Table of Contents
Fetching ...

InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion

Jihyun Lee, Shunsuke Saito, Giljoo Nam, Minhyuk Sung, Tae-Kyun Kim

TL;DR

A diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout is introduced and can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.

Abstract

We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.

InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion

TL;DR

A diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout is introduced and can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.

Abstract

We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.
Paper Structure (26 sections, 10 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 26 sections, 10 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Our network architecture. We use self-attention between the embeddings of the inputs (i.e., $\mathbf{x}_{t}$$\mathbf{x}_{l}$, $t$, and optional $\mathcal{O}$) to estimate the denoised hand parameter $\mathbf{x}_{r}$.
  • Figure 2: Two-hand interactions synthesized by InterHandGen. The sampled interactions are plausible and diverse.
  • Figure 3: Object-conditional two-hand interaction synthesized by InterHandGen. Ours can model plausible and diverse bimanual interactions.
  • Figure S1: Hands sampled by our prior trained on two-hand dataset moon2020interhand2 and additional single-hand datasets zimmermann2019freihandzimmermann2017learninggomez2019largezhang2017hand.
  • Figure S2: Qualitative results of our monocular two-hand reconstruction experiment in Section 4.3. The top four rows show results from the HIC dataset tzionas2016capturing, while the bottom four rows show results from the InterHand2.6M dataset moon2020interhand2. Brown boxes highlight areas where shape penetration occurs, and blue boxes denote regions with inaccurate hand interaction (e.g., contact is absent where it should occur). Utilizing our generative prior leads to more plausible reconstructions.
  • ...and 2 more figures