Curiosity Over Hype: Modeling Motivation Language to Understand Early Outcomes in a Selective Quantum Track
Daniella Alexandra Crysti Vargas Saldana, Freddy Herrera Cueva
TL;DR
This study investigates whether brief Spanish admission responses encode motivation signals predictive of early engagement and performance in QuantumHub Peru's selective quantum pathway. It combines transparent topic modeling with a small multilingual language model to derive intrinsic versus instrumental themes and links these latent signals to Module 1 and Module 2 outcomes, revealing descriptive advantages for curiosity-oriented language. Despite small, underpowered samples and heterogeneous grading, the results show coherent semantic structure across LDA topics and EmbeddingGemma-300M clusters, suggesting that motivation language could inform early mentoring in rigorous STEM pipelines. The work proposes a portable, resource-efficient pipeline suitable for under-resourced contexts and highlights the need for larger, preregistered studies to establish predictive validity and fairness across groups.
Abstract
We study whether latent motivation signals in short Spanish admission responses predict engagement and performance in an early quantum computing pathway run by QuantumHub Peru. We analyze N=241 applicants' open responses and link them to outcomes from two selective modules: Module 1 (secondary; mathematics and computing foundations; n=23) and Module 2 (secondary + early undergraduate; quantum fundamentals; n=36, including M1 continuers). To ensure baseline comparability, the M2 university entrance exam matched the difficulty of the M1 final. Final grades followed the program's official cohort-specific weightings (attendance/assignments/exam), which we retain to preserve ecological validity. Methodologically, we model text with Latent Dirichlet Allocation (LDA, k=8) and, for robustness, with sentence embeddings from a small multilingual language model, EmbeddingGemma-300M, projected via UMAP and clustered with HDBSCAN. This combination leverages the transparency of bag-of-words topics and the semantic richness of small language model embeddings. Descriptively, curiosity/learning topics show higher grades and attendance than technology/career-oriented topics; inferential tests are underpowered (e.g., linear R2 ~ 0.03; logistic pseudo-R2 ~ 0.04) so effect-size estimates should be viewed as preliminary rather than confirmatory. Embedding-based clustering yields seven clusters with 11.2% noise and modest agreement with LDA (ARI=0.068; NMI=0.163). Results suggest that brief motivation responses encode promising signals that could support early mentoring in rigorous STEM pipelines, while highlighting the need for larger, pre-registered studies.
