CRAFT: Extracting and Tuning Cultural Instructions from the Wild
Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, Nancy F. Chen
TL;DR
The paper addresses gaps in cultural reasoning in LLMs, particularly for underrepresented regions. It proposes CRAFT, a pipeline that mines culturally rich instruction data from a massive unlabeled English corpus via keyword filtering and self-instruction prompts, followed by LoRA-based instruction fine-tuning. Across Singapore, the Philippines, and the United States, CRAFT yields up to 6% improvements on region-focused evaluations while preserving general knowledge as measured by MMLU, with context-dependent answers outperforming context-free ones. By releasing both the model and the dataset, the work offers a scalable path to enhance regional cultural reasoning without requiring expensive multilingual pre-training.
Abstract
Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
