IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

Sayem Mohammad Imtiaz; Astha Singh; Fraol Batole; Hridesh Rajan

IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, Hridesh Rajan

TL;DR

This paper addresses data-driven errors in large language models by introducing IRepair, an intent-aware repair framework that selectively targets error-prone regions via dynamic model slicing. It uses a gradient-based sensitivity measure to identify the most relevant transformer blocks (intent) for repair and optimizes a dual objective consisting of a repair loss on refined data and a KL constraint to preserve general performance, with dynamic slicing allowing the repair to adapt as errors shift during training. Empirical evaluation on GPT-2 and GPT-Neo models for toxicity detoxification shows that IRepair achieves significantly greater toxicity reduction (up to about 88.7% with KL) while incurring substantially less degradation in language modeling quality (lower PPL increases) compared to strong baselines like DPO and DAPT+KL. The results demonstrate that errors are highly concentrated in a small subset of layers and that dynamic, threshold-free selection is crucial for robust repair, offering a practical, efficient approach for deploying safer LLMs in real-world settings.

Abstract

Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model's most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model's overall performance by altering a smaller portion of the model. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80\%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.

IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

TL;DR

Abstract

IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)