PUCP-Metrix: An Open-source and Comprehensive Toolkit for Linguistic Analysis of Spanish Texts
Javier Alonso Villegas Luis, Marco Antonio Sobrevilla Cabezudo
TL;DR
PUCP-Metrix addresses the need for a comprehensive open-source toolkit for Spanish linguistic analysis by delivering 182 metrics across lexical, syntactic, semantic, discourse, and psycholinguistic dimensions. Built on spaCy, it enables scalable, interpretable text analysis and is validated on Automated Readability Assessment and Machine-Generated Text Detection, where it achieves competitive results against established tools and neural baselines. The work demonstrates broad coverage and practical utility for linguistic research and NLP applications, while outlining limitations and avenues for expanding discourse metrics and cross-variety adaptation. Overall, PUCP-Metrix advances Spanish text analytics by combining rich metric coverage with accessible tooling and empirical validation.
Abstract
Linguistic features remain essential for interpretability and tasks that involve style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source and comprehensive toolkit for linguistic analysis of Spanish texts. PUCP-Metrix includes 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. It enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive and extensible resource for Spanish, supporting diverse NLP applications.
