Table of Contents
Fetching ...

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens

TL;DR

The paper investigates how to design a single system prompt that yields accurate and robust LLM behavior across multiple languages. It introduces a four-dimensional multilingual evaluation framework (Acc_mean, Acc_var, Consistency, Len_var) and demonstrates through a large-scale study that certain prompt components (notably CoT, emotion, and scenario) enhance cross-lingual performance. It further shows that an automatic prompt optimization pipeline (Sprig-based) can automatically discover prompts that improve all metrics, and that better prompts foster more structured, consistent reasoning while reducing language-switching. By analyzing over 10 million reasoning units, the work links prompt design to measurable shifts in reasoning patterns and language use, offering a scalable approach to multilingual LLM steerability with practical implications for deployment. Overall, the study provides a principled framework and empirical evidence that prompt optimization is a viable path to reliable multilingual LLM behavior across languages and tasks.

Abstract

System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

TL;DR

The paper investigates how to design a single system prompt that yields accurate and robust LLM behavior across multiple languages. It introduces a four-dimensional multilingual evaluation framework (Acc_mean, Acc_var, Consistency, Len_var) and demonstrates through a large-scale study that certain prompt components (notably CoT, emotion, and scenario) enhance cross-lingual performance. It further shows that an automatic prompt optimization pipeline (Sprig-based) can automatically discover prompts that improve all metrics, and that better prompts foster more structured, consistent reasoning while reducing language-switching. By analyzing over 10 million reasoning units, the work links prompt design to measurable shifts in reasoning patterns and language use, offering a scalable approach to multilingual LLM steerability with practical implications for deployment. Overall, the study provides a principled framework and empirical evidence that prompt optimization is a viable path to reliable multilingual LLM behavior across languages and tasks.

Abstract

System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Paper Structure

This paper contains 47 sections, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Comparison of English-prompt and Self-language prompt settings, aggregated over three models and three benchmarks. English system prompts show slightly worse performance in all four metrics compared to using prompts in the question's native language.
  • Figure 2: The relationship between mean accuracy ($\mathsf{Acc_{mean}}$) and accuracy variance ($\mathsf{Acc_{var}}$), with each point representing one prompt aggregated on all benchmarks. $\mathsf{Acc_{var}}$ exhibits clear model dependency and a slight negative correlation with $\mathsf{Acc_{mean}}$.
  • Figure 3: Relationship between mean accuracy ($\mathsf{Acc_{mean}}$) and $\mathsf{Consistency}$, with each point represents one prompt aggregated on all benchmarks. $\mathsf{Acc_{mean}}$ exhibits a strong positive correlation with $\mathsf{Consistency}$.
  • Figure 4: The regression heatmap illustrating the impact of different system prompt components on multilingual model behavior (* p<0.05, ** p<0.01, *** p<0.001). (1) positively associated components (e.g., Chain-of-Thought, emotion, scenario) enhance accuracy and consistency while reducing performance variance; (2) negatively associated components (e.g., behavioral, role, style) correlate with reduced accuracy and consistency; and (3) neutral components (e.g., good-property, safety) show no significant impact. Results are averaged across benchmarks.
  • Figure 5: System prompts can be effectively optimized to improve multiple metrics in a multilingual setting. Using the off-the-shelf Sprig framework, we observe rapid early gains in $\mathsf{Acc_{mean}}$ and $\mathsf{Consistency}$, while $\mathsf{Acc_{var}}$ and $\mathsf{Len_{var}}$ decrease more gradually.
  • ...and 16 more figures