Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
Anni Zou, Zhuosheng Zhang, Hai Zhao, Xiangru Tang
TL;DR
This work introduces GeM-CoT, a generalizable chain-of-thought prompting framework for mixed-task scenarios where input types are unknown. GeM-CoT uses a Type Matching module to route each question to demonstrations from a corresponding type when a match is found, or otherwise performs zero-shot CoT and updates a data cache via density-based clustering to construct new demonstrations. The approach bridges generalization and performance by combining routing, dynamic demo construction, and continual demo pool maintenance, evaluated across 10 reasoning datasets and 23 BBH tasks. Results show that GeM-CoT improves generalization to unseen task types while maintaining or boosting reasoning accuracy, notably in streaming batch settings where more diverse demonstrations can be learned over time. The work offers a practical, training-free solution for robust real-world reasoning with LLMs and highlights the value of diversity in demonstrations and adaptive data augmentation.
Abstract
Large language models (LLMs) have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting, which generates intermediate reasoning chains to serve as the rationale for deriving the answer. However, current CoT methods either simply employ general prompts such as Let's think step by step, or heavily rely on pre-defined task-specific demonstrations to attain preferable performances, thereby engendering an inescapable gap between performance and generalization. To bridge this gap, we propose GeM-CoT, a Generalizable CoT prompting mechanism in Mixed-task scenarios where the type of input questions is unknown. GeM-CoT first categorizes the question type and subsequently samples or constructs demonstrations from the corresponding data pool in an automatic pattern. With this technical design, GeM-CoT simultaneously enjoys superior generalization capabilities and remarkable performances on 10 public reasoning tasks and 23 BBH tasks.
