MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli
TL;DR
MOCHA addresses the challenge of adapting a general detector to user-specific objects under few-shot constraints by distilling multimodal, region-level semantics from a frozen vision-language teacher into a compact detector. It combines a feature translation module with dual objectives—local alignment and relational embedding—to preserve both semantic content and the teacher’s embedding geometry, formalized as $\mathcal{L}_{\mathrm{dist}}$ and $\mathcal{\ L}_{\mathrm{emb}}$, and an overall loss $\mathcal{L} = \mathcal{L}_{\mathrm{det}} + \lambda_{\mathrm{dist}} \mathcal{L}_{\mathrm{dist}} + \lambda_{\mathrm{emb}} \mathcal{L}_{\mathrm{emb}}$. The method uses PCA-based dimensionality reduction to compress teacher targets to $d_t$ dimensions (best around $512$) and employs a translation module $t_S$ to align student features with teacher embeddings $u'_i$. Across four personal benchmarks, MOCHA yields an average improvement of $+10.1$ over the YOLOv8n baseline and sustains competitive latency, demonstrating strong generalization to different student architectures and suitability for edge deployment. These results suggest that integrating multimodal priors during training can substantially enhance few-shot personalization without requiring at-test-time prompts or teacher inference.
Abstract
Personalized object detection aims to adapt a general-purpose detector to recognize user-specific instances from only a few examples. Lightweight models often struggle in this setting due to their weak semantic priors, while large vision-language models (VLMs) offer strong object-level understanding but are too computationally demanding for real-time or on-device applications. We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a distillation framework that transfers multimodal region-level knowledge from a frozen VLM teacher into a lightweight vision-only detector. MOCHA extracts fused visual and textual teacher's embeddings and uses them to guide student training through a dual-objective loss that enforces accurate local alignment and global relational consistency across regions. This process enables efficient transfer of semantics without the need for teacher modifications or textual input at inference. MOCHA consistently outperforms prior baselines across four personalized detection benchmarks under strict few-shot regimes, yielding a +10.1 average improvement, with minimal inference cost.
