Table of Contents
Fetching ...

A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video

Andrea Filiberto Lucas, Dylan Seychell

TL;DR

This work tackles the challenge of extracting personal names from on-screen broadcast graphics with a transparent, auditable approach. It introduces the News Graphics Dataset (NGD) and the Accurate Name Extraction Pipeline (ANEP), a deterministic, modular workflow that integrates object detection, text recognition, and linguistic processing to produce auditable name extractions. Through a systematic comparison with generative multimodal baselines (Gemini and LLaMA), the study demonstrates a trade-off: ANEP provides data lineage and reliability with slightly lower F1 (77.08%) but higher interpretability, while the generative models achieve higher aggregate F1 (up to 84.18%) at the cost of transparency and reproducibility. The work advances practical, accountable AI for media analysis, highlighting the value of hybrid deterministic pipelines for journalism, accessibility, and governance, and outlining future directions in multilingual support and edge deployment.

Abstract

The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.

A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video

TL;DR

This work tackles the challenge of extracting personal names from on-screen broadcast graphics with a transparent, auditable approach. It introduces the News Graphics Dataset (NGD) and the Accurate Name Extraction Pipeline (ANEP), a deterministic, modular workflow that integrates object detection, text recognition, and linguistic processing to produce auditable name extractions. Through a systematic comparison with generative multimodal baselines (Gemini and LLaMA), the study demonstrates a trade-off: ANEP provides data lineage and reliability with slightly lower F1 (77.08%) but higher interpretability, while the generative models achieve higher aggregate F1 (up to 84.18%) at the cost of transparency and reproducibility. The work advances practical, accountable AI for media analysis, highlighting the value of hybrid deterministic pipelines for journalism, accessibility, and governance, and outlining future directions in multilingual support and edge deployment.

Abstract

The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.
Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Simplified workflow of ANEP, illustrating the sequential progression from video ingestion to object detection, text extraction, name recognition, and final result synthesis.
  • Figure 2: Training and validation curves for YOLOv12(m), showing box, classification, and object loss convergence alongside precision, recall, mAP@0.5, and mAP@0.5:0.95 metrics. The results demonstrate stable optimisation and consistent performance improvements across epochs, indicating robust generalisation and well-balanced precision-recall behaviour.
  • Figure 3: Grad-CAM visualisation for a representative frame illustrating the original input (left), activation map (centre), and overlay (right) derived from Layer 1 of YOLOv12(m), highlighting strong focus on text-rich regions.
  • Figure 4: Sample detection from a TVM news broadcast demonstrating precise localisation of all graphical overlays and accurate bounding across all graphics.
  • Figure 5: Detection example from a Sky News frame photographed from a television display. Despite glare and compression artefacts, the model successfully identified the principal graphical regions.