Table of Contents
Fetching ...

VKIE: The Application of Key Information Extraction on Video Text

Siyu An, Ye Liu, Haoyuan Peng, Di Yin

TL;DR

VKIE tackles extracting hierarchical, structured information from video texts by decomposing the task into text detection/recognition, box text classification, entity recognition, and entity linking. It presents two industrial solutions: PipVKIE, a sequential pipeline that processes BTC, ER, and EL per box, and UniVKIE, a unified multitask model with a shared multimodal backbone that processes all boxes per frame in parallel. Experiments on a real-world, annotated dataset show UniVKIE generally outperforms PipVKIE in accuracy and efficiency, confirming the benefits of multimodal fusion and joint optimization. The work offers practical, deployable methods for video indexing and retrieval via rich hierarchical tags, with a real-world industrial platform validation and plans for multilingual and LLM-integrated extensions in the future.

Abstract

Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.

VKIE: The Application of Key Information Extraction on Video Text

TL;DR

VKIE tackles extracting hierarchical, structured information from video texts by decomposing the task into text detection/recognition, box text classification, entity recognition, and entity linking. It presents two industrial solutions: PipVKIE, a sequential pipeline that processes BTC, ER, and EL per box, and UniVKIE, a unified multitask model with a shared multimodal backbone that processes all boxes per frame in parallel. Experiments on a real-world, annotated dataset show UniVKIE generally outperforms PipVKIE in accuracy and efficiency, confirming the benefits of multimodal fusion and joint optimization. The work offers practical, deployable methods for video indexing and retrieval via rich hierarchical tags, with a real-world industrial platform validation and plans for multilingual and LLM-integrated extensions in the future.

Abstract

Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.
Paper Structure (27 sections, 4 equations, 5 figures, 7 tables)

This paper contains 27 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example of hierarchical key information extracted by VKIE in a video frame CGTN_2023.
  • Figure 2: The overall architecture of two deployed solutions PipVKIE and UniVKIE.
  • Figure 3: Real-world cases on our AI platform, the red boxes illustrate the extraction results of current frame.
  • Figure 4: A real-world case on the AI platform for clear viewing.
  • Figure 5: Examples of ER on the annotation platform. The red box indicates the candidate labels of ER.