Table of Contents
Fetching ...

Inductive Linguistic Reasoning with Large Language Models

Raghav Ramji, Keshav Ramji

TL;DR

This work tackles the challenge of linguistic reasoning in extremely low-resource languages by framing it as inductive cross-lingual learning through linguistics puzzles. It introduces a two-stage analogical prompting framework where a strong model first identifies language families and generates cross-lingual exemplars, then a second model uses both seed and generated exemplars to solve test puzzles. The approach yields consistent improvements for frontier models (e.g., GPT-4o and Llama-3.1-405B-Instruct) and shows that weak-to-strong and inference-time exemplar distillation can further boost performance, with generalization to the LINGOLY dataset. Findings indicate that the ability to deduce and apply rules from diverse exemplars and knowledge of language families are key drivers of linguistic reasoning, offering a promising direction for multilingual reasoning research and robust test-time adaptation.

Abstract

Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. These gains are attributable to the analogical demonstrations, both when self-generated as well as when produced by weaker multilingual models. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving sizable improvements across all problem types and difficulty levels included in the LINGOLY dataset with GPT-4o. We also report several findings about interesting phenomena which drive linguistic reasoning performance, suggesting that such puzzles are a valuable benchmark for new reasoning methods.

Inductive Linguistic Reasoning with Large Language Models

TL;DR

This work tackles the challenge of linguistic reasoning in extremely low-resource languages by framing it as inductive cross-lingual learning through linguistics puzzles. It introduces a two-stage analogical prompting framework where a strong model first identifies language families and generates cross-lingual exemplars, then a second model uses both seed and generated exemplars to solve test puzzles. The approach yields consistent improvements for frontier models (e.g., GPT-4o and Llama-3.1-405B-Instruct) and shows that weak-to-strong and inference-time exemplar distillation can further boost performance, with generalization to the LINGOLY dataset. Findings indicate that the ability to deduce and apply rules from diverse exemplars and knowledge of language families are key drivers of linguistic reasoning, offering a promising direction for multilingual reasoning research and robust test-time adaptation.

Abstract

Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. These gains are attributable to the analogical demonstrations, both when self-generated as well as when produced by weaker multilingual models. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving sizable improvements across all problem types and difficulty levels included in the LINGOLY dataset with GPT-4o. We also report several findings about interesting phenomena which drive linguistic reasoning performance, suggesting that such puzzles are a valuable benchmark for new reasoning methods.

Paper Structure

This paper contains 63 sections, 4 figures, 16 tables.

Figures (4)

  • Figure 1: An illustration of our 2-stage analogical prompting approach, translating a phrase in Montenegrin to English. While prior works would solely provide exemplars translating between the source language and English and perform in-context learning, our method seeks diverse exemplars. Model $M_1$ first identifies the language family (Slavic) and higher-resource languages in the family which the model has knowledge of (Croatian), then produces exemplars in those languages. Finally, both the original and generated set of exemplars are passed with the test puzzle to model $M_2$ to perform the translation. $M_1 = M_2$ yields the self-generated analogical reasoning setting.
  • Figure 2: Figure (a) contains a comparison of the best baseline (in Table 1) with the best 2-stage analogical reasoning result (in Table 2), for our two frontier models as the deducer. We find analogical to improve GPT-4o by 8.1% and Llama-3.1-405B-Instruct by 5.9%. Figure (b) compares self-generated analogical reasoning methods, with prompt-determined language families ("inferred families") and human-annotated language family labels ("oracle families").
  • Figure 3: Two-Stage Analogical Prompting (Ours) Results with GPT-4o on LINGOLY. The size of the bubbles correspond to the number of subquestions of that type present in the dataset.
  • Figure 4: Baseline Results with GPT-4o on LINGOLY. The size of the bubbles correspond to the number of subquestions of that type present in the dataset.