A Step Toward Federated Pretraining of Multimodal Large Language Models

Baochen Xiong; Yifan Xu; Xiaoshan Yang; Yaguang Song; Yaowei Wang; Changsheng Xu

A Step Toward Federated Pretraining of Multimodal Large Language Models

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu

Abstract

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

A Step Toward Federated Pretraining of Multimodal Large Language Models

Abstract

A Step Toward Federated Pretraining of Multimodal Large Language Models

Abstract

Paper Structure

Table of Contents

Figures (9)