VANPY: Voice Analysis Framework
Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan
TL;DR
VANPY addresses the lack of end-to-end tools for automated voice analysis by introducing an open-source framework that unifies preprocessing, feature extraction, and speaker-characterization model inference in a modular pipeline. It integrates more than fifteen components, including VAD, music–speech separation, speaker embeddings, and four in-house models for gender, emotion, age, and height, and demonstrates robust cross-dataset performance on VoxCeleb2, Mozilla Common Voice, and DARPA-TIMIT. A Pulp Fiction use-case showcases VANPY’s ability to extract gender, age, height, and emotion traits, with gender achieving 100% accuracy, age/height MAEs in the ranges reported, and detailed arousal-dominance-valence analyses for emotional states. The work emphasizes VANPY’s extensibility, cross-dataset generalization, and practical applicability in domains such as speaker profiling and human-computer interaction, while outlining future directions to broaden coverage to accents and emotion intensity. Overall, VANPY provides a scalable, extensible platform for voice characterization that can be readily extended with additional components and datasets.
Abstract
Voice data is increasingly being used in modern digital communications, yet there is still a lack of comprehensive tools for automated voice analysis and characterization. To this end, we developed the VANPY (Voice Analysis in Python) framework for automated pre-processing, feature extraction, and classification of voice data. The VANPY is an open-source end-to-end comprehensive framework that was developed for the purpose of speaker characterization from voice data. The framework is designed with extensibility in mind, allowing for easy integration of new components and adaptation to various voice analysis applications. It currently incorporates over fifteen voice analysis components - including music/speech separation, voice activity detection, speaker embedding, vocal feature extraction, and various classification models. Four of the VANPY's components were developed in-house and integrated into the framework to extend its speaker characterization capabilities: gender classification, emotion classification, age regression, and height regression. The models demonstrate robust performance across various datasets, although not surpassing state-of-the-art performance. As a proof of concept, we demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction." The results illustrate the framework's capability to extract multiple speaker characteristics, including gender, age, height, emotion type, and emotion intensity measured across three dimensions: arousal, dominance, and valence.
