From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Omer Nacar; Deema Alquffari; Saleh Alsharideh; Adeem AlOtaibi; Abdulaziz Alabdulkarim; Leen Alhazmi; Nada Alomar; Wareef Alzubaidi; Nada Alsultan; Ahmed Alrabghi; Demah Alhoshan; Rana Alsayyari; Hamed Alruwaili; Albaraa Jaafar; Khaled Alusmani; Abdulaziz Alsohimy; Munirah Alsubaie; Shahd Aldukhayil; Arwa Alali; Yazeed BinShihah; Razan Alsulaymi; Nourah Alhumaid; Razan Abdulsalam; Reem Alamoudi; Mohammed Alkhalifa

From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Omer Nacar, Deema Alquffari, Saleh Alsharideh, Adeem AlOtaibi, Abdulaziz Alabdulkarim, Leen Alhazmi, Nada Alomar, Wareef Alzubaidi, Nada Alsultan, Ahmed Alrabghi, Demah Alhoshan, Rana Alsayyari, Hamed Alruwaili, Albaraa Jaafar, Khaled Alusmani, Abdulaziz Alsohimy, Munirah Alsubaie, Shahd Aldukhayil, Arwa Alali, Yazeed BinShihah, Razan Alsulaymi, Nourah Alhumaid, Razan Abdulsalam, Reem Alamoudi, Mohammed Alkhalifa

Abstract

Function-calling language models are essential for agentic AI systems that translate natural language into executable structured actions, yet existing models exhibit severe structural instability when applied to Arabic. We present AISA-AR-FunctionCall, a production-oriented Arabic function-calling framework built on a 270M-parameter FunctionGemma backbone and trained through systematic dataset auditing, schema repair, tool-aware prompt restructuring, and full-parameter supervised fine-tuning. On a held-out test set, fine-tuning reduces parse failures from 87\% to below 1\%, improves function name accuracy by more than eightfold, and substantially enhances argument alignment across dialects and domains. Error analysis reveals a transition from structural collapse to semantic misalignment, suggesting that serialization stability and decision-level reasoning are separable challenges. We further explore a reasoning-augmented LoRA variant that introduces explicit intermediate reasoning prior to tool invocation. All datasets and models are publicly released under the AISA framework.

From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Abstract

Paper Structure (18 sections, 12 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Methodology
Model Foundation
Dataset Auditing and Structural Repair
Enum Compliance Correction.
Enum Normalization and Tool Pruning.
Prompt Length Reduction via Tool Sampling
Chat Template Construction
Dataset Splitting and Training Configuration
Full-Parameter Fine-Tuning Protocol
Reasoning-Augmented Fine-Tuning (Exploratory Variant)
Experiments and Results
Failure Mode Analysis
Qualitative Error Analysis
...and 3 more sections

Figures (5)

Figure 1: End-to-end transformation pipeline for AISA-AR-FunctionCall. The process includes structural auditing, schema repair, tool optimization, stochastic tool sampling, chat serialization using the FunctionGemma format, and stratified train/validation/test splitting.
Figure 2: Example of serialized training instance using the FunctionGemma control-token format.
Figure 3: Core structured performance comparison between the baseline and the fully fine-tuned model. Metrics include function selection accuracy, full tool-call match, argument alignment (F1 and exact match), and format validity.
Figure 4: Structural stability comparison between the baseline and the fully fine-tuned model. Metrics include parse failure rate and hallucination rate.
Figure 5: Function Name Accuracy by Domain for the fully fine-tuned AISA-AR-FunctionCall-FT model. Highly structured transactional domains (Utilities, Travel, Islamic Services, Weather) achieve strong performance, while procedurally complex domains such as Government Services remain more challenging.

From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Abstract

From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Authors

Abstract

Table of Contents

Figures (5)