Table of Contents
Fetching ...

Adaptations of AI models for querying the LandMatrix database in natural language

Fatiha Ait Kbir, Jérémy Bourgoin, Rémy Decoupes, Marie Gradeler, Roberto Interdonato

TL;DR

This work tackles the challenge of accessing Land Matrix data via natural language by adapting Text-to-SQL techniques to generate executable REST and GraphQL queries from NL questions. It systematically compares three open-weight LLMs (Llama3-8B, Mixtral-8x7B-instruct, Codestral-22B) across three optimizations: Prompt Engineering, Retrieval Augmented Generation, and a two-agent query generation pipeline. Results show Codestral-22B with an Agentic setup delivering the strongest overall performance for both REST and GraphQL, particularly excelling in syntax validity and data accuracy, while GraphQL is generally easier for models to handle than REST. The authors provide an open, reproducible benchmark and a demo to facilitate future development of NL-to-API querying for land-data databases, with practical implications for policy and action in LMICs.

Abstract

The Land Matrix initiative (https://landmatrix.org) and its global observatory aim to provide reliable data on large-scale land acquisitions to inform debates and actions in sectors such as agriculture, extraction, or energy in low- and middle-income countries. Although these data are recognized in the academic world, they remain underutilized in public policy, mainly due to the complexity of access and exploitation, which requires technical expertise and a good understanding of the database schema. The objective of this work is to simplify access to data from different database systems. The methods proposed in this article are evaluated using data from the Land Matrix. This work presents various comparisons of Large Language Models (LLMs) as well as combinations of LLM adaptations (Prompt Engineering, RAG, Agents) to query different database systems (GraphQL and REST queries). The experiments are reproducible, and a demonstration is available online: https://github.com/tetis-nlp/landmatrix-graphql-python.

Adaptations of AI models for querying the LandMatrix database in natural language

TL;DR

This work tackles the challenge of accessing Land Matrix data via natural language by adapting Text-to-SQL techniques to generate executable REST and GraphQL queries from NL questions. It systematically compares three open-weight LLMs (Llama3-8B, Mixtral-8x7B-instruct, Codestral-22B) across three optimizations: Prompt Engineering, Retrieval Augmented Generation, and a two-agent query generation pipeline. Results show Codestral-22B with an Agentic setup delivering the strongest overall performance for both REST and GraphQL, particularly excelling in syntax validity and data accuracy, while GraphQL is generally easier for models to handle than REST. The authors provide an open, reproducible benchmark and a demo to facilitate future development of NL-to-API querying for land-data databases, with practical implications for policy and action in LMICs.

Abstract

The Land Matrix initiative (https://landmatrix.org) and its global observatory aim to provide reliable data on large-scale land acquisitions to inform debates and actions in sectors such as agriculture, extraction, or energy in low- and middle-income countries. Although these data are recognized in the academic world, they remain underutilized in public policy, mainly due to the complexity of access and exploitation, which requires technical expertise and a good understanding of the database schema. The objective of this work is to simplify access to data from different database systems. The methods proposed in this article are evaluated using data from the Land Matrix. This work presents various comparisons of Large Language Models (LLMs) as well as combinations of LLM adaptations (Prompt Engineering, RAG, Agents) to query different database systems (GraphQL and REST queries). The experiments are reproducible, and a demonstration is available online: https://github.com/tetis-nlp/landmatrix-graphql-python.

Paper Structure

This paper contains 16 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Detailed prompt used with three sections: role given to the LLM (instruction), an example of real natural language question (question), and a context with few-shot learning with the database schema (context).
  • Figure 2: Pipeline of the Agents optimization: another language model is used to extract the filters wanted by the user and its outputs are added to the previous prompt.