Table of Contents
Fetching ...

Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

Lena Libon, Meghana Bhange, Rushabh Solanki, Elliot Creager, Ulrich Aïvodji

TL;DR

This work addresses data portability and user autonomy for LLMs that rely on chain-of-thought reasoning by proposing Conscious Data Contribution (CDC), a multi-community knowledge distillation framework. It systematically studies how heterogeneity across communities, reasoning granularity, and coalition incentives affect distillation performance, using LLaMA-3 70B as teacher and T5-base as student across four diverse datasets. Key findings show CoT-driven distillation benefits reasoning-heavy tasks under utilitarian coalitions, with benefits modulated by task format compatibility and community diversity; granularity effects are nuanced, and strategic coalition design is essential for fair and robust participation. The work highlights practical implications for participatory AI governance and outlines future work on countermeasures, real-world data, and incentive mechanisms to sustain CDC-based collaborations.

Abstract

The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that "reason" using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.

Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

TL;DR

This work addresses data portability and user autonomy for LLMs that rely on chain-of-thought reasoning by proposing Conscious Data Contribution (CDC), a multi-community knowledge distillation framework. It systematically studies how heterogeneity across communities, reasoning granularity, and coalition incentives affect distillation performance, using LLaMA-3 70B as teacher and T5-base as student across four diverse datasets. Key findings show CoT-driven distillation benefits reasoning-heavy tasks under utilitarian coalitions, with benefits modulated by task format compatibility and community diversity; granularity effects are nuanced, and strategic coalition design is essential for fair and robust participation. The work highlights practical implications for participatory AI governance and outlines future work on countermeasures, real-world data, and incentive mechanisms to sustain CDC-based collaborations.

Abstract

The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that "reason" using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.

Paper Structure

This paper contains 22 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Results for RQ1 and RQ2.
  • Figure 2: Results for RQ3 and the greedy perspective.
  • Figure 3: Impact of varying CSQA and STQA proportions on utilitarian accuracy (left) and altruistic accuracy (right) under full CoT setting (level 6).
  • Figure 4: Prompt used for reasoning summarization with system and user instructions.