Table of Contents
Fetching ...

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Yu Sun, Yin Li, Ruixiao Sun, Chunhui Liu, Fangming Zhou, Ze Jin, Linjie Wang, Xiang Shen, Zhuolin Hao, Hongyu Xiong

TL;DR

This work tackles data efficiency and multimodal fusion in large-scale vision-language systems by addressing the shortcomings of traditional uncertainty-based active learning. It introduces kNN-based Latent Space Broadening (LSB) with a Lookalike Threshold (LT) to enrich the pool of informative, hard samples in the latent embedding space, and a Vision-Language Modeling with Audio Enhancement (VLMAE) that enables effective audio-VL cross-modal interaction via a learnable attention mechanism. The authors demonstrate, across three production tasks, that LSB-LT improves annotation quality and model performance, while VLMAE provides robust audio-augmented VL fusion, leading to measurable online gains in revenue and content-safety metrics. The proposed approach yields substantial practical impact by improving data efficiency and delivering better cross-modal content understanding in industry-scale systems.

Abstract

Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

TL;DR

This work tackles data efficiency and multimodal fusion in large-scale vision-language systems by addressing the shortcomings of traditional uncertainty-based active learning. It introduces kNN-based Latent Space Broadening (LSB) with a Lookalike Threshold (LT) to enrich the pool of informative, hard samples in the latent embedding space, and a Vision-Language Modeling with Audio Enhancement (VLMAE) that enables effective audio-VL cross-modal interaction via a learnable attention mechanism. The authors demonstrate, across three production tasks, that LSB-LT improves annotation quality and model performance, while VLMAE provides robust audio-augmented VL fusion, leading to measurable online gains in revenue and content-safety metrics. The proposed approach yields substantial practical impact by improving data efficiency and delivering better cross-modal content understanding in industry-scale systems.

Abstract

Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

Paper Structure

This paper contains 26 sections, 12 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Active Learning Pipeline with kNN-based LSB under LT. In this pipeline, candidate samples are randomly selected from impression logs, and a set of seed badcases are selected by comparing the human annotation and the vision-language transformer (VLT) model prediction. Based on kNN, seed badcases are expanded by choosing samples from candidate set which have the closest hidden embeddings. The broadened set is further filtered through a binary-class lookalike model given a LT value. The resulting badcases are then mixed with statistical-AL-selected candidates, forming the dataset for annotation for next round of VLT model updates.
  • Figure 2: Vision-Language Modeling with Audio Enhancement (VLMAE) where a learnable attention layer is introduced to improve the fusion between audio and VL information. This enables the model to effectively fit and recognize audio modality features.