Table of Contents
Fetching ...

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Nikolai Warner, Cameron Ethan Taylor, Irfan Essa, Apaar Sadhwani

Abstract

Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Abstract

Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
Paper Structure (61 sections, 5 equations, 3 figures, 17 tables)

This paper contains 61 sections, 5 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Each caption is a different sample from a distribution of valid descriptions. Three annotators describe the same motion (top) with different captions, each mixing motion-recoverable semantics $s$ (blue) with annotator-specific nuisance factors $a$ (red)---stylistic variation. Standard contrastive training treats each as the single correct target; MoCHA projects each onto $s$, producing a single deterministic positive.
  • Figure 2: MoCHA overview. (a) Motivated by the $(s, a)$ decomposition (Section \ref{['sec:theory']}), $C(\cdot)$ projects each caption onto $s$ by stripping stylistic variation $a$ (red). $C$ is implemented via LLM and distilled into FlanT5 for LLM-free inference. (b) Blend training balances both views: the denoised $C(t_i)$ anchors embeddings around $s$ to reduce gradient variance, while the original $t_i$ regularizes for natural-language queries.
  • Figure 3: Canonicalization projects captions onto $s$, improving retrieval. Top row (colored): ground truth; bottom row (gray): baseline rank-1 error. (a) Verbose $a$ buries the action; MoCHA extracts $s$ while preserving the metaphor. (b) Annotator uncertainty ($a$); canonicalization extracts shared kinematic content. (c) Complex description decomposed into sequential $s$, disambiguating from similar motions. (d) Over-specified caption dilutes the contrastive signal; MoCHA strips $a$, retains discriminative $s$.