Table of Contents
Fetching ...

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli

TL;DR

MOCHA addresses the challenge of adapting a general detector to user-specific objects under few-shot constraints by distilling multimodal, region-level semantics from a frozen vision-language teacher into a compact detector. It combines a feature translation module with dual objectives—local alignment and relational embedding—to preserve both semantic content and the teacher’s embedding geometry, formalized as $\mathcal{L}_{\mathrm{dist}}$ and $\mathcal{\ L}_{\mathrm{emb}}$, and an overall loss $\mathcal{L} = \mathcal{L}_{\mathrm{det}} + \lambda_{\mathrm{dist}} \mathcal{L}_{\mathrm{dist}} + \lambda_{\mathrm{emb}} \mathcal{L}_{\mathrm{emb}}$. The method uses PCA-based dimensionality reduction to compress teacher targets to $d_t$ dimensions (best around $512$) and employs a translation module $t_S$ to align student features with teacher embeddings $u'_i$. Across four personal benchmarks, MOCHA yields an average improvement of $+10.1$ over the YOLOv8n baseline and sustains competitive latency, demonstrating strong generalization to different student architectures and suitability for edge deployment. These results suggest that integrating multimodal priors during training can substantially enhance few-shot personalization without requiring at-test-time prompts or teacher inference.

Abstract

Personalized object detection aims to adapt a general-purpose detector to recognize user-specific instances from only a few examples. Lightweight models often struggle in this setting due to their weak semantic priors, while large vision-language models (VLMs) offer strong object-level understanding but are too computationally demanding for real-time or on-device applications. We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a distillation framework that transfers multimodal region-level knowledge from a frozen VLM teacher into a lightweight vision-only detector. MOCHA extracts fused visual and textual teacher's embeddings and uses them to guide student training through a dual-objective loss that enforces accurate local alignment and global relational consistency across regions. This process enables efficient transfer of semantics without the need for teacher modifications or textual input at inference. MOCHA consistently outperforms prior baselines across four personalized detection benchmarks under strict few-shot regimes, yielding a +10.1 average improvement, with minimal inference cost.

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

TL;DR

MOCHA addresses the challenge of adapting a general detector to user-specific objects under few-shot constraints by distilling multimodal, region-level semantics from a frozen vision-language teacher into a compact detector. It combines a feature translation module with dual objectives—local alignment and relational embedding—to preserve both semantic content and the teacher’s embedding geometry, formalized as and , and an overall loss . The method uses PCA-based dimensionality reduction to compress teacher targets to dimensions (best around ) and employs a translation module to align student features with teacher embeddings . Across four personal benchmarks, MOCHA yields an average improvement of over the YOLOv8n baseline and sustains competitive latency, demonstrating strong generalization to different student architectures and suitability for edge deployment. These results suggest that integrating multimodal priors during training can substantially enhance few-shot personalization without requiring at-test-time prompts or teacher inference.

Abstract

Personalized object detection aims to adapt a general-purpose detector to recognize user-specific instances from only a few examples. Lightweight models often struggle in this setting due to their weak semantic priors, while large vision-language models (VLMs) offer strong object-level understanding but are too computationally demanding for real-time or on-device applications. We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a distillation framework that transfers multimodal region-level knowledge from a frozen VLM teacher into a lightweight vision-only detector. MOCHA extracts fused visual and textual teacher's embeddings and uses them to guide student training through a dual-objective loss that enforces accurate local alignment and global relational consistency across regions. This process enables efficient transfer of semantics without the need for teacher modifications or textual input at inference. MOCHA consistently outperforms prior baselines across four personalized detection benchmarks under strict few-shot regimes, yielding a +10.1 average improvement, with minimal inference cost.

Paper Structure

This paper contains 45 sections, 9 equations, 10 figures, 11 tables, 3 algorithms.

Figures (10)

  • Figure 1: MOCHA recipe. (1) Pretraining student model. (2) Knowledge distillation on rich joint visual and textual features from a frozen teacher. (3) Few-shot personalization with frozen student and prototypical learner.
  • Figure 2: Feature distillation stage of the student network from the frozen teacher.
  • Figure 3: Personalization stage.
  • Figure 5: Effect of $\mathcal{L}_{\mathrm{emb}}$ on a set of ten 2D points optimized with respect to 3D ones.(a) 3D reference points, proxy for the teacher embeddings $u_i'$. (b) Evolution of the 2D points (proxy for student embeddings $f'_{A,i}$) updated via $\mathcal{L}_{\mathrm{emb}}$ from the 3D reference points $u_i'$. $\star$ marks the original location, timesteps increase with color saturation. (c) Percent rate of 2D top-$k$ nearest neighbors (k-NN) that match those of the reference 3D distribution.
  • Figure 6: mAP$^{50-95}$ at different PCA dimensionality. Average score across all evaluation datasets varying feature dimension $d_t$.
  • ...and 5 more figures