Table of Contents
Fetching ...

Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang

TL;DR

This work tackles SlideASR by showing that end-to-end multimodal models often underutilize or misinterpret slide context, effectively acting as OCR systems. It introduces Visually-Anchored Policy Optimization (VAPO), a post-training approach that enforces a structured <think><answer> reasoning format to first extract slide text via OCR and then generate transcription anchored to the visual content, guided by four rewards and GRPO optimization. To support research, the authors present SlideASR-Bench, comprising a synthetic SlideASR-S and a real-world SlideASR-R dataset with dense domain-specific entities. Across SlideSpeech and SlideASR-Bench, VAPO-7B demonstrates strong improvements in both general transcription accuracy and entity-focused metrics, validating an effective end-to-end paradigm for SlideASR and highlighting the value of explicit reasoning in multimodal transcription tasks.

Abstract

Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model's reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured "Look before Transcription" procedure using a <think><answer> format. Specifically, the model first performs OCR on the slide content within the think step, then generates the transcription by referencing this recognized visual information in the answer step. This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency. To support further research, we construct SlideASR-Bench, a new entity-rich benchmark consisting of a synthetic dataset for training and testing, and a challenging real-world set for evaluation. Extensive experiments demonstrate that VAPO significantly improves recognition of domain-specific terms, establishing an effective end-to-end paradigm for SlideASR.

Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

TL;DR

This work tackles SlideASR by showing that end-to-end multimodal models often underutilize or misinterpret slide context, effectively acting as OCR systems. It introduces Visually-Anchored Policy Optimization (VAPO), a post-training approach that enforces a structured <think><answer> reasoning format to first extract slide text via OCR and then generate transcription anchored to the visual content, guided by four rewards and GRPO optimization. To support research, the authors present SlideASR-Bench, comprising a synthetic SlideASR-S and a real-world SlideASR-R dataset with dense domain-specific entities. Across SlideSpeech and SlideASR-Bench, VAPO-7B demonstrates strong improvements in both general transcription accuracy and entity-focused metrics, validating an effective end-to-end paradigm for SlideASR and highlighting the value of explicit reasoning in multimodal transcription tasks.

Abstract

Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model's reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured "Look before Transcription" procedure using a <think><answer> format. Specifically, the model first performs OCR on the slide content within the think step, then generates the transcription by referencing this recognized visual information in the answer step. This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency. To support further research, we construct SlideASR-Bench, a new entity-rich benchmark consisting of a synthetic dataset for training and testing, and a challenging real-world set for evaluation. Extensive experiments demonstrate that VAPO significantly improves recognition of domain-specific terms, establishing an effective end-to-end paradigm for SlideASR.

Paper Structure

This paper contains 31 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: We compare the outputs of Qwen2.5-Omni-7B, Qwen3-Omni-30B-A3B (with OCR text as context), and our VAPO-7B on a real Chinese medical report sample. Red text indicates incorrectly transcribed named entities.
  • Figure 2: Comparison of OLLM outputs with and without slide context.
  • Figure 3: Overview of the Visually-Anchored Policy Optimization framework. The OLLM takes audio, slide, and instruction as input, generates a structured output, and is optimized via reward functions that guide the policy update.
  • Figure 4: Attention visualization. Left: input image and transcribed audio text. Right: attention flows.
  • Figure 5: Qualitative examples from SlideASR-R.
  • ...and 2 more figures