Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages
Tsan Tsai Chan, Xin Tong, Thi Thu Uyen Hoang, Barbare Tepnadze, Wojciech Stempniak
TL;DR
The paper tackles faithfulness gaps in multilingual LLMs for resource-constrained languages by proposing rescoring with lightweight auxiliary models, specifically monolingual BERT variants trained from scratch on small datasets. It demonstrates that 4-layer BERTs can distinguish faithful from unfaithful summaries with high accuracy (mean 88.33% across Vietnamese, Polish, and Georgian) and generalise to additional tasks, suggesting potential for multi-purpose rescoring. The study further analyzes how morphological complexity interacts with regularisation, model depth, and training objectives, finding that shallow architectures often generalise well and that regularisation can aid morphologically complex languages. These results support typologically aware, computationally efficient rescoring as a practical strategy to reduce unfaithfulness in low-resource settings and guide future pipeline design. Limitations include not applying rescoring to actual LLM outputs and the narrow language/script set, pointing to avenues for broader language coverage and longer-context evaluation.
Abstract
Multilingual large language models (LLMs) are known to more frequently generate non-faithful output in resource-constrained languages (Guerreiro et al., 2023 - arXiv:2303.16104), potentially because these typologically diverse languages are underrepresented in their training data. To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures. As proof of the feasibility of such an approach, we show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33% in three genetically unrelated languages that differ in their morphological complexity - Vietnamese, Polish and Georgian. The same hyperparameter combination moreover generalises well to three other tasks, suggesting applications for rescoring beyond improving faithfulness. In order to inform typologically aware model selection, we also investigate how morphological complexity interacts with regularisation, model depth and training objectives, ultimately demonstrating that morphologically complex languages are more likely to benefit from dropout, while across languages downstream performance is enhanced most by shallow architectures as well as training using the standard BERT objectives.
