Table of Contents
Fetching ...

Distributed In-Context Learning under Non-IID Among Clients

Siqi Liang, Sumyeong Ahn, Jiayu Zhou

TL;DR

This work tackles in-context learning (ICL) when training data are distributed across non-IID clients by introducing a budget allocator that assigns per-client ICE budgets for each test query under a limited ICE budget. The server uses a proxy dataset to train the allocator, enabling query-dependent collaboration among clients and optional privacy-preserving paraphrasing. Empirical results across seven datasets demonstrate consistent improvements over strong baselines, including when access to all data is simulated, thereby validating the method's effectiveness under distributed non-IID ICL. The approach offers a practical pathway to efficient, privacy-conscious ICL in real-world, data-partitioned environments with heterogeneous client distributions.

Abstract

Advancements in large language models (LLMs) have shown their effectiveness in multiple complicated natural language reasoning tasks. A key challenge remains in adapting these models efficiently to new or unfamiliar tasks. In-context learning (ICL) provides a promising solution for few-shot adaptation by retrieving a set of data points relevant to a query, called in-context examples (ICE), from a training dataset and providing them during the inference as context. Most existing studies utilize a centralized training dataset, yet many real-world datasets may be distributed among multiple clients, and remote data retrieval can be associated with costs. Especially when the client data are non-identical independent distributions (non-IID), retrieving from clients a proper set of ICEs needed for a test query presents critical challenges. In this paper, we first show that in this challenging setting, test queries will have different preferences among clients because of non-IIDness, and equal contribution often leads to suboptimal performance. We then introduce a novel approach to tackle the distributed non-IID ICL problem when a data usage budget is present. The principle is that each client's proper contribution (budget) should be designed according to the preference of each query for that client. Our approach uses a data-driven manner to allocate a budget for each client, tailored to each test query. Through extensive empirical studies on diverse datasets, our framework demonstrates superior performance relative to competing baselines.

Distributed In-Context Learning under Non-IID Among Clients

TL;DR

This work tackles in-context learning (ICL) when training data are distributed across non-IID clients by introducing a budget allocator that assigns per-client ICE budgets for each test query under a limited ICE budget. The server uses a proxy dataset to train the allocator, enabling query-dependent collaboration among clients and optional privacy-preserving paraphrasing. Empirical results across seven datasets demonstrate consistent improvements over strong baselines, including when access to all data is simulated, thereby validating the method's effectiveness under distributed non-IID ICL. The approach offers a practical pathway to efficient, privacy-conscious ICL in real-world, data-partitioned environments with heterogeneous client distributions.

Abstract

Advancements in large language models (LLMs) have shown their effectiveness in multiple complicated natural language reasoning tasks. A key challenge remains in adapting these models efficiently to new or unfamiliar tasks. In-context learning (ICL) provides a promising solution for few-shot adaptation by retrieving a set of data points relevant to a query, called in-context examples (ICE), from a training dataset and providing them during the inference as context. Most existing studies utilize a centralized training dataset, yet many real-world datasets may be distributed among multiple clients, and remote data retrieval can be associated with costs. Especially when the client data are non-identical independent distributions (non-IID), retrieving from clients a proper set of ICEs needed for a test query presents critical challenges. In this paper, we first show that in this challenging setting, test queries will have different preferences among clients because of non-IIDness, and equal contribution often leads to suboptimal performance. We then introduce a novel approach to tackle the distributed non-IID ICL problem when a data usage budget is present. The principle is that each client's proper contribution (budget) should be designed according to the preference of each query for that client. Our approach uses a data-driven manner to allocate a budget for each client, tailored to each test query. Through extensive empirical studies on diverse datasets, our framework demonstrates superior performance relative to competing baselines.
Paper Structure (17 sections, 4 equations, 8 figures, 3 tables, 3 algorithms)

This paper contains 17 sections, 4 equations, 8 figures, 3 tables, 3 algorithms.

Figures (8)

  • Figure 1: Problem overview. When datasets are distributed among clients in a non-IID manner, it creates an obstacle in generating a good context (left). However, by assigning appropriate budgets to leverage per-client expertise, better context can be created (right).
  • Figure 2: Overview of the pipeline: First, the budget allocator assigns a budget to each client based on the question. Subsequently, each client retrieves their relevant samples and sends them back to the server. The server infers the answer by feeding the question, which is composed of concatenated context examples and the query.
  • Figure 3: Non-IID experimental results. It shows that centralized performance is comparable to the IID case, whereas non-IID scenarios exhibit a significant declined performance. This highlights the critical importance of addressing non-IIDness to find a solution.
  • Figure 4: t-SNE analysis of each client across two datasets. The top and bottom rows depict oracle budget of client 1 and client 2, respectively. Each figure demonstrates that the budgets can be segregated by training a simple classifier, as they exhibit clustered subgroups.
  • Figure 5: Overview of the budget allocator: We train a budget allocator on top of the frozen feature extractor $\mathcal{E}$, which inherits from the retriever. During inference, when a test query $x_q$ is provided, this module determines the quantized budget levels for each client and allocates them accordingly.
  • ...and 3 more figures