V-SAT: Video Subtitle Annotation Tool
Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, Aritra Sen, Srushti Anil Patil, Vishwanathan Raman
TL;DR
This work addresses pervasive subtitle quality problems—timing, content accuracy, formatting, and on-screen placement—by proposing V-SAT, a unified multimodal pipeline that jointly detects and corrects language- and image-based subtitle issues using LLMs, VLMs, image processing, and ASR, with a human-in-the-loop validation. The approach leverages audio-visual context to improve synchronization, readability, and contextual fidelity, and demonstrates substantial quality gains (e.g., reducing SUBER from $9.6$ to $3.54$ and achieving image-mode F1-scores around $0.80$). It provides an end-to-end demonstration interface, a modular toolbox of techniques (including saliency-based positioning and contextual spelling with LLMs), and a pathway toward real-time, multilingual, and streaming-platform integration. The work offers a practical, scalable solution with broad implications for accessibility, localization, and high-quality media consumption across platforms.
Abstract
The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.
