Table of Contents
Fetching ...

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu

TL;DR

SyncMV4D tackles the challenge of realistic hand-object interaction generation under occlusion by jointly diffusion-modeling appearance, motion, and geometry across multiple views. It introduces the Multi-view Joint Diffusion (MJD) to generate synchronized color videos, motion pseudo-videos, and a metric depth scale $s$, and the Diffusion Points Aligner (DPA) to produce globally aligned 4D point tracks. A closed-loop mutual enhancement cycle allows outputs to mutually refine during denoising, guided by projected 4D points. Evaluations on the HOI-focused TACO dataset show state-of-the-art performance in visual realism, motion plausibility, and cross-view consistency, using only a reference image and text prompts. This framework advances physics-aware video world modeling and enables robust HOI synthesis for occluded or real-world scenarios.

Abstract

Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

TL;DR

SyncMV4D tackles the challenge of realistic hand-object interaction generation under occlusion by jointly diffusion-modeling appearance, motion, and geometry across multiple views. It introduces the Multi-view Joint Diffusion (MJD) to generate synchronized color videos, motion pseudo-videos, and a metric depth scale , and the Diffusion Points Aligner (DPA) to produce globally aligned 4D point tracks. A closed-loop mutual enhancement cycle allows outputs to mutually refine during denoising, guided by projected 4D points. Evaluations on the HOI-focused TACO dataset show state-of-the-art performance in visual realism, motion plausibility, and cross-view consistency, using only a reference image and text prompts. This framework advances physics-aware video world modeling and enables robust HOI synthesis for occluded or real-world scenarios.

Abstract

Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our synchronized multi-view joint diffusion (SyncMV4D) simultaneously models multi-view geometry, visual appearance, and motion dynamics. It is capable of generating both multi-view hand-object interaction videos (left) and 4D motion sequences, comprising intermediate coarse pseudo videos (middle) and refined point tracks (right), with results achieving visual realism, dynamic plausibility, and geometric consistency.
  • Figure 2: Our SyncMV4D consists of two key components: First, the Multi-view Joint Diffusion (MJD) module generates synchronized multi-view color videos, intermediate motion pseudo videos, and metric depth scales (Sec. \ref{['sec:mjd']}). Second, the Diffusion Points Aligner (DPA) module takes the resulting coarse 4D motions as a conditioning signal to reconstruct globally aligned 4D point tracks (Sec. \ref{['sec:dpa']}). Furthermore, since both MJD and DPA are iterative denoisers, the refined 4D point tracks from DPA are fed back to guide MJD in subsequent denoising steps, forming a closed-loop mutual enhancement cycle (Sec. \ref{['sec:cycle']}).
  • Figure 3: Comparison of motion representations between that of DaS gu2025diffusion and our 4D point tracks. For each point, the first two dimensions represent the pixel coordinates of the tracked point in the first frame. The difference lies in the third dimension: DaS uses the static depth from the first frame, whereas we use the actual per-frame depth to enhance 3D perceptual capability.
  • Figure 4: Visualization of the generated multi-view videos from different methods. Red circles indicate multi-view inconsistencies, yellow boxes highlight video distortions, and blue boxes denote blurring artifacts.
  • Figure 5: Visualization of multi-view points reprojected onto the same coordinate system.