Table of Contents
Fetching ...

CRAFT: Extracting and Tuning Cultural Instructions from the Wild

Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, Nancy F. Chen

TL;DR

The paper addresses gaps in cultural reasoning in LLMs, particularly for underrepresented regions. It proposes CRAFT, a pipeline that mines culturally rich instruction data from a massive unlabeled English corpus via keyword filtering and self-instruction prompts, followed by LoRA-based instruction fine-tuning. Across Singapore, the Philippines, and the United States, CRAFT yields up to 6% improvements on region-focused evaluations while preserving general knowledge as measured by MMLU, with context-dependent answers outperforming context-free ones. By releasing both the model and the dataset, the work offers a scalable path to enhance regional cultural reasoning without requiring expensive multilingual pre-training.

Abstract

Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.

CRAFT: Extracting and Tuning Cultural Instructions from the Wild

TL;DR

The paper addresses gaps in cultural reasoning in LLMs, particularly for underrepresented regions. It proposes CRAFT, a pipeline that mines culturally rich instruction data from a massive unlabeled English corpus via keyword filtering and self-instruction prompts, followed by LoRA-based instruction fine-tuning. Across Singapore, the Philippines, and the United States, CRAFT yields up to 6% improvements on region-focused evaluations while preserving general knowledge as measured by MMLU, with context-dependent answers outperforming context-free ones. By releasing both the model and the dataset, the work offers a scalable path to enhance regional cultural reasoning without requiring expensive multilingual pre-training.

Abstract

Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
Paper Structure (6 sections, 2 figures, 2 tables)

This paper contains 6 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The CRAFT method involves creating instruction datasets tailored for culturally rich instruction by processing extensive unstructured data with large language models (LLMs). These specialized cultural instructions are then employed to improve the ability of LLMs to reason within cultural contexts through instruction fine-tuning.
  • Figure 2: Performance on SG-Eval and MMLU dataset. The CRAFT method with different ratios of Singapore cultural instruction data.