Table of Contents
Fetching ...

Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision

Zhenxiao Liang, Qixing Huang

Abstract

Editing animatable human avatars typically relies on sparse supervision, often a few edited keyframes, yet naively fitting a reconstructed avatar to these edits frequently causes identity leakage and pose-dependent temporal flicker. We argue that these failures are best understood as an ill-conditioned inversion: the available edited constraints do not sufficiently determine the latent directions responsible for the intended edit. We propose a conditioning-guided edited reconstruction framework that performs editing as a constrained inversion in a structured avatar latent space, restricting updates to a low-dimensional, part-specific edit subspace to prevent unintended identity changes. Crucially, we design the editing constraints during inversion by optimizing a conditioning objective derived from a local linearization of the full decoding-and-rendering pipeline, yielding an edit-subspace information matrix whose spectrum predicts stability and drives frame reweighting / keyframe activation. The resulting method operates on small subspace matrices and can be implemented efficiently (e.g., via Hessian-vector products), and improves stability under limited edited supervision.

Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision

Abstract

Editing animatable human avatars typically relies on sparse supervision, often a few edited keyframes, yet naively fitting a reconstructed avatar to these edits frequently causes identity leakage and pose-dependent temporal flicker. We argue that these failures are best understood as an ill-conditioned inversion: the available edited constraints do not sufficiently determine the latent directions responsible for the intended edit. We propose a conditioning-guided edited reconstruction framework that performs editing as a constrained inversion in a structured avatar latent space, restricting updates to a low-dimensional, part-specific edit subspace to prevent unintended identity changes. Crucially, we design the editing constraints during inversion by optimizing a conditioning objective derived from a local linearization of the full decoding-and-rendering pipeline, yielding an edit-subspace information matrix whose spectrum predicts stability and drives frame reweighting / keyframe activation. The resulting method operates on small subspace matrices and can be implemented efficiently (e.g., via Hessian-vector products), and improves stability under limited edited supervision.

Paper Structure

This paper contains 39 sections, 2 theorems, 49 equations, 4 figures, 2 tables.

Key Result

Theorem 4.1

Under the assumptions above, let $\hat{v}(w)$ denote the solution to Eq. eq:weighted_ridge. Then: (i) Posterior precision. If $r_t\equiv 0$ and $v_\star\sim\mathcal{N}(0,\Lambda_0^{-1})$, then the posterior is Gaussian: so the posterior covariance is exactly $S(w)^{-1}$. (ii) MSE bound with inconsistency and linearization error. Under the same Gaussian prior and the true model $b_t=A_t v_\star+\v

Figures (4)

  • Figure 1: Task Overview. Given a source monocular video (1st row) and sparse edited keyframes (2nd row), our method produces temporally stable avatar edits that preserve identity, along with per-keyframe importance weights (3rd row). The edited avatar supports downstream applications such as novel view synthesis and animation (4th row).
  • Figure 2: Overview of our pipeline.
  • Figure 3: Edited keyframes and renderings at unseen time steps are shown. The first row contains an incorrect edit, while the second row exhibits inconsistent editing appearance, where Edit2 introduces an arm tattoo.
  • Figure 4: We set the supervision budget to $K{=}3$ over five candidate keyframes. As optimization progresses, the weight mass shifts away from an obviously incorrect edit (frame 5), and subsequently from a frame with inconsistent appearance (frame 4, hat mismatch). The final selection concentrates on frames 1--3, yielding a consistent keyframe set.

Theorems & Definitions (2)

  • Theorem 4.1
  • Lemma 1.1: Trace--determinant inequality