Table of Contents
Fetching ...

ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding

Santosh Rajagopalan, Jonathan Vronsky, Songbai Yan, S. Alireza Golestaneh, Shubhra Chandra, Min Zhou

TL;DR

ALF tackles advertiser understanding by unifying structured data with multi-modal ad content in a scalable transformer framework. It introduces a dual-attention encoder with inter-sample attention, pretraining on vast advertiser data with reconstruction and contrastive losses, and calibrated predictions via SNGP heads. The model demonstrates strong offline and public benchmarks, and delivers tangible production gains in precision and recall on real-world policy tasks, despite higher latency and resource usage. This work shows how holistic multi-modal representations can enhance reliability and effectiveness in high-stakes advertising systems. The approach offers practical impact for fraud detection, policy enforcement, and advertiser trust assessment, with avenues for extending temporal dynamics and further scaling.

Abstract

We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video, and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF demonstrates significant real-world impact by delivering simultaneous gains in both precision and recall, for instance boosting recall by over 40 percentage points on one critical policy and increasing precision to 99.8% on another. The architecture's effectiveness stems from its novel combination of multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding

TL;DR

ALF tackles advertiser understanding by unifying structured data with multi-modal ad content in a scalable transformer framework. It introduces a dual-attention encoder with inter-sample attention, pretraining on vast advertiser data with reconstruction and contrastive losses, and calibrated predictions via SNGP heads. The model demonstrates strong offline and public benchmarks, and delivers tangible production gains in precision and recall on real-world policy tasks, despite higher latency and resource usage. This work shows how holistic multi-modal representations can enhance reliability and effectiveness in high-stakes advertising systems. The approach offers practical impact for fraud detection, policy enforcement, and advertiser trust assessment, with avenues for extending temporal dynamics and further scaling.

Abstract

We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video, and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF demonstrates significant real-world impact by delivering simultaneous gains in both precision and recall, for instance boosting recall by over 40 percentage points on one critical policy and increasing precision to 99.8% on another. The architecture's effectiveness stems from its novel combination of multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

Paper Structure

This paper contains 37 sections, 12 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: ALF model architecture showing the multi-modal encoders, dual attention mechanisms, and output heads.
  • Figure 2: ALF input processing for each feature type.
  • Figure 3: UMAP visualization of advertiser embeddings, colored by advertiser intent.
  • Figure 4: UMAP visualization of advertiser embeddings without intersample attention, colored by advertiser intent.