Table of Contents
Fetching ...

Representation Potentials of Foundation Models for Multimodal Alignment: A Survey

Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, Yun Fu

TL;DR

This survey addresses whether foundation models learn representations that converge into transferable, modality-agnostic abstractions. It surveys representative vision, language, speech, and multimodal models, along with metrics such as CKA, CCA, and MNN, to assess alignment within and across modalities. The findings indicate widespread evidence of cross-architecture and cross-modal representational similarities, with scale, training paradigms, and task diversity driving alignment, and with growing connections to neuroscience. The work underscores both the potential benefits for interoperability and the challenges in robust evaluation, data bias, and modality-specific divergence, guiding future theoretical and empirical work in representation potentials and cross-modal alignment.

Abstract

Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.

Representation Potentials of Foundation Models for Multimodal Alignment: A Survey

TL;DR

This survey addresses whether foundation models learn representations that converge into transferable, modality-agnostic abstractions. It surveys representative vision, language, speech, and multimodal models, along with metrics such as CKA, CCA, and MNN, to assess alignment within and across modalities. The findings indicate widespread evidence of cross-architecture and cross-modal representational similarities, with scale, training paradigms, and task diversity driving alignment, and with growing connections to neuroscience. The work underscores both the potential benefits for interoperability and the challenges in robust evaluation, data bias, and modality-specific divergence, guiding future theoretical and empirical work in representation potentials and cross-modal alignment.

Abstract

Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.

Paper Structure

This paper contains 20 sections, 7 equations.