Table of Contents
Fetching ...

Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation

Hongbo Jiang, Jie Li, Xinqi Cai, Tianyu Xie, Yunhang Shen, Pingyang Dai, Liujuan Cao

TL;DR

A unified framework based on a powerful cloud-edge architecture for deploying unified MLLM-level intelligence on resource-constrained devices and introduces a novel distillation strategy motivated by the low-rank property in the teacher's feature space.

Abstract

Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher's feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.

Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation

TL;DR

A unified framework based on a powerful cloud-edge architecture for deploying unified MLLM-level intelligence on resource-constrained devices and introduces a novel distillation strategy motivated by the low-rank property in the teacher's feature space.

Abstract

Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher's feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.
Paper Structure (19 sections, 13 equations, 7 figures, 5 tables)

This paper contains 19 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A Comparison of MLLM-based ReID paradigms. Images are from the QrCM-ReID dataset. (a) Existing methods indirectly apply MLLMs for VQA-based retrieval (limited by gallery size, prone to hallucination with long visual contexts) or textual distillation (restricting to text-only ReID). (b) Our MLLMEmbed-ReID directly uses a cloud-based MLLM as a unified teacher for diverse modalities. Its unified knowledge is distilled to a lightweight edge student for practical deployment.
  • Figure 2: Overview of the proposed MLLMEmbed-ReID framework. Images are from the QrCM-ReID dataset. It primarily consists of two components: cloud model fine-tuning and edge model distillation. The cloud model includes task instructions and modality prompts, an MLLM backbone (Qwen2-VL), pooling operations, Identity Identification (ID loss), triplet learning (Triplet loss), and Similarity Distribution Matching (SDM). The edge model primarily includes Vision Language Model (VLM) backbone (CLIP (ViT-L/14)), modality projection, distillation matching loss (e.g., cosine loss), Principal Component Mapping Loss (PCM loss), and Feature Relation Loss (FR loss). Within the MLLMEmbed-ReID framework, both cloud and edge models can end-to-end unify the completion of CM-ReID tasks.
  • Figure 3: SVD analysis of cloud-based model's ReID feature. The left y-axis shows the explained variance ratio per principal component, while the right y-axis shows the cumulative explained variance ratio. The x-axis is plotted on a logarithmic scale to better visualize the rapid decay of singular values.
  • Figure 4: (a) and (b) represent the t-SNE visualization of the cloud-based model and edge-based model, respectively. Scatter points of different shapes represent different modal data. Different scatter colors represent different pedestrian IDs.
  • Figure 5: (a) and (b) represent the recognition results of the cloud-based model and edge-based model, respectively. Images are from the QrCM-ReID dataset. (1) and (2) represent the IR$\to$R and S$\to$R tasks, respectively, while (3) and (4) correspond to the T$\to$R task. Caption A: This woman is a ponytail, wearing a black down jacket, black trousers, black boots, wearing a purple scarf and glasses. She walks with her hand in her pocket. Caption B: This woman is wearing a black coat, black trousers and black shoes. She was wearing glasses and a purple scarf. She walks while watching her cell phone.
  • ...and 2 more figures