Table of Contents
Fetching ...

Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

Divyanshu Daiya, Aniket Bera

TL;DR

Sketch2Colab is presented, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts, and achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.

Abstract

We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.

Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

TL;DR

Sketch2Colab is presented, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts, and achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.

Abstract

We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.
Paper Structure (12 sections, 10 equations, 4 figures, 2 tables)

This paper contains 12 sections, 10 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Architecture of Sketch2Colab: storyboard$\to$3D HOH motion ( 1-- 6). 1 The user provides sketches (keyframes, per-joint 2D paths, object masks) and an optional text prompt; 2 paired 2D/3D encoders produce aligned embeddings and 3D proxies, which are fused into $\mathcal{C}$; 3 a PF-distilled rectified-flow student $v_\phi(\mathbf{z},t\mid\mathcal{C})$ evolves latents under a frozen teacher; 4 a CTMC over phases modulates sub-fields and contact weights via $\boldsymbol{\pi}_t$; 5 raw-space energies and latent anchors define dual-space guidance, implemented through a learned Jacobian surrogate and applied as $u_{\text{raw}}$ and $u_{\text{lat}}$ in the ODE update; 6 a frozen decoder $\mathcal{D}$ yields the final motion, optimized with a combination of RF, distillation, Lyapunov, CTMC, latent, and energy losses (Secs. §\ref{['subsec:gen-field']}--§\ref{['subsec:ctmc']}).
  • Figure 2: Sketch$\rightarrow$interaction motion: comparison of Sketch2Colab and COLLAGE (teacher) collage. Given storyboard keyframes and joint trajectories (top), Sketch2Colab (right) generates HOH motions that closely follow the sketches, execute interaction phases at the intended times, and adhere tightly to trajectories and keyframes. In contrast, COLLAGE (teacher, middle) struggles to respect storyboard constraints such as the handover and continued motion with a single human holding the object($1^{st}$ and $2^{nd}$ storyboards), and fails to match fine-grained keyframe constraints (e.g., the third storyboards, where the character must lift higher while moving). Additional examples and baseline comparisons are provided in the Supp. Sec. S.5
  • Figure 3: Qualitative motion frames, trajectories, and limitations.(A,B) Sketch-only comparisons to Collage teacher on InterHumanliang2023intergen (HH) and CORE4Dzhang2024core4d (3+ entity OOD). (C--F) Hard cases: heavy sketch noise ($\approx60\%$, Tab.\ref{['tab:main_one_table1']} (b,c)), self-intersecting paths, and OOD multi-object / sparse-constraints causing drift, floating, or collisions.
  • Figure 4: Impact of CTMC and energy guidance. (a) CTMC shifts the F1–FMD Pareto frontier outward while reducing collision rates. (b) Contact timing calibration improves significantly (lower ECE). (c) Flow maintains low curvature except at discrete mode transitions. (d) Energy gradient aligns constructively with base flow near transitions, eliminating gradient conflicts.