ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement

Jinwei Lin

ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement

Jinwei Lin

TL;DR

This work tackles the resource-demanding task of extracting accurate text from video audio by introducing ecVoice, a post-processing pipeline that uses idiom similarity replacement to correct transcription at the grammar level. The method builds an idiom-aware pipeline with seven components, computes a composite similarity score from four sub-scores, and replaces sequences with correct idioms when beneficial. Experimental results show the idiom-correction capability reaches around $90\%$ on average and that ecVoice improves transcription quality on lightweight Whisper configurations, enabling efficient video translation and multimedia editing under limited hardware. The approach offers a practical, memory-efficient enhancement to audio-to-text pipelines with potential deployment in video editing, localization, and AI-assisted game design.

Abstract

The Text Extraction of the Audio from the Video plays an important role in multimedia editing and processing. As a popular open source toolkit, Whisper performs fast in human voice recognition. However, the recognition performance is dependent on the computing resource, which makes the low computing memory running Whisper become difficult. Our paper presents an available solution to extract the human voice from the video and gain the high quality text generation from the voice. The generated voice can be used in video language translation and translated voice simulation. To improve the extraction and transform quality of human voice, we present ecVoice, a method using the idioms similarity computation and analysis to improve the quality of audio text extraction. Relative experiments are held to verify that the ecVoice can improve the idiom grammar correction rate to 90\% on average. The method is simple but fast which means this method will cause less bad influence of consuming computing resources when improving the voice recognition rate. Our method and solution can significantly enhance the Whisper recognition with low computing memory.

ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement

TL;DR

on average and that ecVoice improves transcription quality on lightweight Whisper configurations, enabling efficient video translation and multimedia editing under limited hardware. The approach offers a practical, memory-efficient enhancement to audio-to-text pipelines with potential deployment in video editing, localization, and AI-assisted game design.

Abstract

Paper Structure (18 sections, 6 equations, 7 figures, 9 tables)

This paper contains 18 sections, 6 equations, 7 figures, 9 tables.

Introduction
Literature Review
Extract Text from Voice
Voice in Multimedia Editing
Text Grammar Correction
Methodology Analysis
Running Architecture
Splitting Audio From Video
Audio Segmentation
Audio to Text
Redundancy Removal
Data Format Conversion
Similarity Computation
Experiment and Evaluation
Idiom Correction by Word Expression
...and 3 more sections

Figures (7)

Figure 1: Running and design architecture of this whole research.
Figure 2: Using Moviepy to splitting the audio data from video.
Figure 3: Using pydub to make the segmentation of the extracted audio.
Figure 4: Using the Whisper to get translate the speech of audio to text.
Figure 5: Comparing two closed sentences to make redundancy removal.
...and 2 more figures

ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement

TL;DR

Abstract

ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement

Authors

TL;DR

Abstract

Table of Contents

Figures (7)