Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic
Saleh Almohaimeed, May Alsofyani, Saad Almohaimeed, Mansour Al Ghanim, Liqiang Wang
TL;DR
This work addresses the lack of Arabic context-dependent text-to-SQL resources by introducing Ar-SParC, a cross-domain dataset with 3,450 question sequences (≈10,225 questions) across 160 databases and 116 domains. It systematically evaluates 40 experiments using GPT-3.5-Turbo and GPT-4.5-Turbo with diverse prompt engineering—covering question representations, in-context learning, and the novel GAT-Corrector, which both detects and corrects SQL errors in one step. The results show consistent, modest improvements in $EX$ and $IX$ metrics (roughly 1.9% zero-shot; 1.72%$EX$ and 0.92%$IX$ in ICL), and ablation demonstrates GAT-Corrector’s superiority over GAT-Verifier for Arabic. Overall, the paper advances Arabic semantic parsing and highlights the importance of language-specific prompt engineering to improve cross-domain text-to-SQL systems for low-resource languages.
Abstract
In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.
