Table of Contents
Fetching ...

AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning

Binxiao Xu, Junyu Feng, Xiaopeng Lin, Haodong Li, Zhiyuan Feng, Bohan Zeng, Shaolin Lu, Ming Lu, Qi She, Wentao Zhang

TL;DR

AD-MIR addresses the gap between perception and persuasion in long-form advertising videos by coupling Structure-Aware Memory Construction with a Structured Reasoning Agent. Modeling the task as a POMDP, it builds a structured multimodal memory via Hybrid Semantic-Lexical Indexing and a Context-Anchored Subject Registry, then applies a prompt-guided ReAct controller with a toolset to iteratively refine hypotheses and ground them in pixel-level evidence. Reliability mechanisms enforce visual grounding and self-correction, reducing hallucinations and ensuring evidence-based conclusions. On AdsQA, AD-MIR achieves state-of-the-art performance, outperforming both end-to-end LMMs and other reasoning-driven baselines, and ablations demonstrate the indispensability of each component. The approach promises practical impact for advertising understanding, moderation, and transparency by explicitly tying abstract marketing strategies to concrete visual signals, all within a gradient-free, modular architecture.

Abstract

Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.

AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning

TL;DR

AD-MIR addresses the gap between perception and persuasion in long-form advertising videos by coupling Structure-Aware Memory Construction with a Structured Reasoning Agent. Modeling the task as a POMDP, it builds a structured multimodal memory via Hybrid Semantic-Lexical Indexing and a Context-Anchored Subject Registry, then applies a prompt-guided ReAct controller with a toolset to iteratively refine hypotheses and ground them in pixel-level evidence. Reliability mechanisms enforce visual grounding and self-correction, reducing hallucinations and ensuring evidence-based conclusions. On AdsQA, AD-MIR achieves state-of-the-art performance, outperforming both end-to-end LMMs and other reasoning-driven baselines, and ablations demonstrate the indispensability of each component. The approach promises practical impact for advertising understanding, moderation, and transparency by explicitly tying abstract marketing strategies to concrete visual signals, all within a gradient-free, modular architecture.

Abstract

Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.
Paper Structure (26 sections, 1 theorem, 7 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 7 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2.2

The agent selects the next action $a_t \in \mathcal{A}$ according to a domain-adaptive policy $\pi(a_t | \mathcal{H}_{t-1}, o_t; \Theta)$, where $\Theta$ represents the frozen parameters of a Large Multimodal Model. The policy is implemented via prompt-based in-context learning rather than gradient-

Figures (3)

  • Figure 1: Illustrative walkthrough of AD-MIR’s reasoning pipeline on an AdsQA example. Unlike general agents that focus on retrieval, AD-MIR bridges the cognitive gap by first constructing a high-level causal narrative via a communication expert (Phase 1-2), then performing targeted frame inspection to verify precise visual details (Phase 3), and finally synthesizing a visually grounded explanation that links the expert narrative to pixel-level evidence (Phase 4).
  • Figure 2: The overall architecture of AD-MIR. The framework comprises five synergistic components: (A) Input & Context construction for multi-modal preprocessing; (B) a ReAct Controller for iterative reasoning; (C) a Hierarchical Toolset featuring global browsing, communication experts, and fine-grained inspection; (D) a Unified Multimodal Database serving as shared memory; and (E) an Answer Refinement stage to ensure concise, evidence-based output.
  • Figure 3: Ablation and sensitivity analysis of AD-MIR on AdsQA. Subfigures (a) and (b) report strict (top) and relaxed (bottom) accuracy on all five dimensions under different lexical weights $\beta$ and maximum reasoning steps $T_{\text{max}}$, respectively, showing that performance is stable in a broad range and peaks around the default setting. Subfigure (c) presents component-level ablations for Hybrid Indexing, Subject Registry, Communication Expert, and Visual Anchor, where "w/" and "w/o" denote whether the corresponding module is enabled; the results highlight the complementary gains of structured indexing, domain expert reasoning, and visual anchor self-correction.

Theorems & Definitions (2)

  • Definition 2.1
  • Proposition 2.2