Specific language impairment (SLI) detection pipeline from transcriptions of spontaneous narratives
Santiago Arena, Antonio Quintero-Rincón
TL;DR
Este trabajo aborda la detección de Trastorno Específico del Lenguaje (SLI) a partir de transcripciones de narrativas espontáneas en 1,163 observaciones combinadas de tres datasets. Propone un pipeline en cascada de tres etapas: reducción de dimensionalidad con Random Forest e correlación de Spearman (de 43 a 11 variables), selección de variables predictivas mediante regresión logística (6 características finales) y clasificación final con un modelo $14$-NN. El método alcanzó una precisión de $97.13\%$ en la muestra de prueba, con alta sensibilidad y especificidad, y demostró que la reducción de variables mejora la performance. La solución es simple, de baja complejidad computacional y replicable, destacando el potencial de NLP y ML para la detección temprana de SLI a partir métricas cuantitativas de desempeño lingüístico, con disponibilidad de software para implementación pública.
Abstract
Specific Language Impairment (SLI) is a disorder that affects communication and can affect both comprehension and expression. This study focuses on effectively detecting SLI in children using transcripts of spontaneous narratives from 1063 interviews. A three-stage cascading pipeline was proposed f. In the first stage, feature extraction and dimensionality reduction of the data are performed using the Random Forest (RF) and Spearman correlation methods. In the second stage, the most predictive variables from the first stage are estimated using logistic regression, which is used in the last stage to detect SLI in children from transcripts of spontaneous narratives using a nearest neighbor classifier. The results revealed an accuracy of 97.13% in identifying SLI, highlighting aspects such as the length of the responses, the quality of their utterances, and the complexity of the language. This new approach, framed in natural language processing, offers significant benefits to the field of SLI detection by avoiding complex subjective variables and focusing on quantitative metrics directly related to the child's performance.
