VKIE: The Application of Key Information Extraction on Video Text
Siyu An, Ye Liu, Haoyuan Peng, Di Yin
TL;DR
VKIE tackles extracting hierarchical, structured information from video texts by decomposing the task into text detection/recognition, box text classification, entity recognition, and entity linking. It presents two industrial solutions: PipVKIE, a sequential pipeline that processes BTC, ER, and EL per box, and UniVKIE, a unified multitask model with a shared multimodal backbone that processes all boxes per frame in parallel. Experiments on a real-world, annotated dataset show UniVKIE generally outperforms PipVKIE in accuracy and efficiency, confirming the benefits of multimodal fusion and joint optimization. The work offers practical, deployable methods for video indexing and retrieval via rich hierarchical tags, with a real-world industrial platform validation and plans for multilingual and LLM-integrated extensions in the future.
Abstract
Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.
