Table of Contents
Fetching ...

Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Yassir Benhammou, Suman Kalyan, Sujay Kumar

TL;DR

The paper addresses the challenge of learning unified representations across visual, audio, and textual modalities to automate metadata generation in broadcast media. It introduces the Multimodal Autoencoder (MMAE), trained on the LUMA benchmark, to learn a modality-invariant latent space through joint reconstruction rather than contrastive objectives. Empirical results show MMAE outperforms linear baselines on clustering metrics (Silhouette, ARI, NMI), with a peak performance of Sil $0.63$, ARI $0.91$, NMI $0.96$ at $k=42$, and qualitative analyses confirming strong cross-modal alignment. The approach offers data-efficient, interpretable embeddings suited for scalable metadata tagging, cross-modal retrieval, and asset management in broadcast workflows, with future work extending to temporal dynamics and transformer-based encoders.

Abstract

Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.

Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

TL;DR

The paper addresses the challenge of learning unified representations across visual, audio, and textual modalities to automate metadata generation in broadcast media. It introduces the Multimodal Autoencoder (MMAE), trained on the LUMA benchmark, to learn a modality-invariant latent space through joint reconstruction rather than contrastive objectives. Empirical results show MMAE outperforms linear baselines on clustering metrics (Silhouette, ARI, NMI), with a peak performance of Sil , ARI , NMI at , and qualitative analyses confirming strong cross-modal alignment. The approach offers data-efficient, interpretable embeddings suited for scalable metadata tagging, cross-modal retrieval, and asset management in broadcast workflows, with future work extending to temporal dynamics and transformer-based encoders.

Abstract

Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.

Paper Structure

This paper contains 19 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Aligned LUMA triplets (image, audio waveform, caption). Each row shows one example, with strict alignment between the three modalities. Left: visual depiction and class label; Middle: corresponding audio waveform; Right: natural-language caption. The dataset’s structure parallels real-world audiovisual metadata—providing a realistic foundation for training reconstruction-based multimodal AI models for content understanding and automation.
  • Figure 2: Architecture of the proposed Multimodal Autoencoder (MMAE). The model consists of three modality-specific encoders for image, audio, and text inputs, respectively. Each encoder maps its modality to a shared latent representation $z$, which captures modality-invariant semantic features. From this latent vector, three corresponding decoders $Decoder_I$, $Decoder_A$, and $Decoder_T$ reconstruct each modality, enforcing cross-modal consistency through joint reconstruction losses. This design encourages the shared latent space to align semantically equivalent content across modalities while preserving their unique characteristics.
  • Figure 3: t-SNE projection of the MMAE latent space ($z=128$). Distinct, compact clusters reflect strong semantic alignment across modalities.
  • Figure 4: UMAP visualization confirming high cluster separability and smooth semantic transitions in the shared latent space.
  • Figure 5: t-SNE overlay of latent embeddings by modality (image, audio, text). Convergence within shared clusters indicates modality invariance and cross-modal semantic consistency.