Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding
Yassir Benhammou, Suman Kalyan, Sujay Kumar
TL;DR
The paper addresses the challenge of learning unified representations across visual, audio, and textual modalities to automate metadata generation in broadcast media. It introduces the Multimodal Autoencoder (MMAE), trained on the LUMA benchmark, to learn a modality-invariant latent space through joint reconstruction rather than contrastive objectives. Empirical results show MMAE outperforms linear baselines on clustering metrics (Silhouette, ARI, NMI), with a peak performance of Sil $0.63$, ARI $0.91$, NMI $0.96$ at $k=42$, and qualitative analyses confirming strong cross-modal alignment. The approach offers data-efficient, interpretable embeddings suited for scalable metadata tagging, cross-modal retrieval, and asset management in broadcast workflows, with future work extending to temporal dynamics and transformer-based encoders.
Abstract
Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.
