Table of Contents
Fetching ...

IASC: Interactive Agentic System for ConLangs

Chihiro Taguchi, Richard Sproat

TL;DR

The paper presents IASC, a modular, agentic framework that uses LLMs to guide the construction of ConLangs across phonology, morphosyntax, lexicon, orthography, and handbook generation, enabling both fun language creation and systematic probing of LLM linguistic knowledge. It introduces a multi-stage pipeline with iterative prompting for phonotactics, story-based data for morphosyntax, and rigorous evaluation metrics (TER, SER, MFER, MSER, Lem F1) to assess morphosyntactic transformations across typologically diverse target features. Empirical results show LLMs vary in performance by language and feature, with stronger results for typologically common patterns (SVO/SVO-like orders) and weaker performance on morphosyntactic configurations that diverge from the training data, especially for analytic languages. The work highlights potential applications for aiding low-resource languages through guided morphosyntactic transformations while acknowledging limitations in morphology representation, evaluation complexity, and orthography generation, and it suggests directions for improving AI-assisted language design and translation for under-resourced languages.

Abstract

We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC

IASC: Interactive Agentic System for ConLangs

TL;DR

The paper presents IASC, a modular, agentic framework that uses LLMs to guide the construction of ConLangs across phonology, morphosyntax, lexicon, orthography, and handbook generation, enabling both fun language creation and systematic probing of LLM linguistic knowledge. It introduces a multi-stage pipeline with iterative prompting for phonotactics, story-based data for morphosyntax, and rigorous evaluation metrics (TER, SER, MFER, MSER, Lem F1) to assess morphosyntactic transformations across typologically diverse target features. Empirical results show LLMs vary in performance by language and feature, with stronger results for typologically common patterns (SVO/SVO-like orders) and weaker performance on morphosyntactic configurations that diverge from the training data, especially for analytic languages. The work highlights potential applications for aiding low-resource languages through guided morphosyntactic transformations while acknowledging limitations in morphology representation, evaluation complexity, and orthography generation, and it suggests directions for improving AI-assisted language design and translation for under-resourced languages.

Abstract

We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC

Paper Structure

This paper contains 54 sections, 5 equations, 1 figure, 16 tables.

Figures (1)

  • Figure 1: MSERs across LLMs and feature sets. We exclude GPT-4o-mini's results due to its failure to follow the instructions and its poor overall performance.