Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration
Ran Xu, Wenqi Shi, Yuchen Zhuang, Yue Yu, Joyce C. Ho, Haoyu Wang, Carl Yang
TL;DR
Collab-RAG tackles the challenge of complex, multi-hop QA in retrieval-augmented generation by introducing a collaboration between a white-box small language model as a query decomposer and a black-box large language model as a context reader. The approach trains the SLM through iterative preference optimization using feedback from an affordable LLM (GPT-4o-mini), avoiding costly distillation from frontier models. Empirical results across five datasets show consistent improvements over both black-box-only and SLM-based baselines, with notable efficiency: a 3B SLM can surpass a 32B frozen LLM in decomposition, while an 8B decomposer yields strong gains overall. The method offers a scalable, generalizable pathway to enhance complex QA in RAG, with potential extensions to online reinforcement learning.
Abstract
Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.
