Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

Yang Liu; Armstrong Foundjem; Xingfang Wu; Heng Li; Foutse Khomh

Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

Yang Liu, Armstrong Foundjem, Xingfang Wu, Heng Li, Foutse Khomh

TL;DR

This study tackles the robustness of LLMs for code tasks under input perturbations by formalizing a black-box threat model and implementing robustness testing across character-, word-, and sentence-level perturbations. It investigates robustness-focused fine-tuning using SafeCoder instruction tuning, training on 33 variants per base model, and evaluating on both perturbed and unperturbed test sets with Pass@1 and Relative Degradation as core metrics. The results show that perturbation-aware fine-tuning significantly improves robustness (RD reductions from around 78% to the single-digit percent range for many models) at the cost of a modest drop in clean Pass@1 (roughly 1–3 percentage points on average), with character- and mixed-perturbation strategies delivering the strongest gains. These findings offer practical design guidance for deploying robust LLM4Code systems, emphasizing a balanced mix of perturbations and moderate data scaling to maximize resilience without sacrificing too much performance.

Abstract

Context: In the fast-paced evolution of software development, Large Language Models (LLMs) have become indispensable tools for tasks such as code generation, completion, analysis, and bug fixing. Ensuring the robustness of these models against potential vulnerabilities from handling diverse inputs is critical, as variations in input can lead to incorrect or insecure code outputs. Objective: This work aims to improve the robustness of LLMs for coding-related tasks against potential adversarial inputs. Specifically, we investigate how fine-tuning LLMs with perturbed datasets impacts their robustness against input perturbations. Method: We systematically evaluated LLM robustness by fine-tuning models using datasets perturbed at character-level, word-level, and sentence-level, comparing results against base models and models fine-tuned on unperturbed datasets. Results: Fine-tuning LLMs with perturbed datasets significantly improves model robustness (RD usually drops around 4\% - 6\%), especially for models with relatively weak robustness. However, this fine-tuning process typically results in a slight performance decrease (pass@1 usually drops around 1\% - 3\%) compared to fine-tuning with unperturbed datasets, although occasional performance improvements are observed. Conclusion \& Implications: Fine-tuning LLMs for coding tasks with perturbed data effectively enhances their robustness at the cost of a minor performance reduction, emphasizing the importance of balancing the robustness and performance of LLMs for coding applications.

Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

TL;DR

Abstract

Paper Structure (46 sections, 7 figures, 16 tables)

This paper contains 46 sections, 7 figures, 16 tables.

Background
Threat Model
Robustness Testing
Fine-tuning
Related Works
Pre-trained Large Language Models for Code
Adversarial Attacks and Robustness of LLM4Code
Fine-tuning LLMs for Code
Comparative Positioning
Methodology
Framework Overview
Three categories of models.
Research Questions (RQs)
Evaluation Metrics
Fine-tuning with SafeCoder instruction tuning
...and 31 more sections

Figures (7)

Figure 1: Overall framework for robustness-oriented fine-tuning of LLM4Code models and its evaluation the robustness of fine-tuned LLM4Code models. The framework begins with one unperturbed dataset (${T}_r$) and 32 perturbed datasets ($P_{T_r}$). Six baseline models ($M_{b_{1...6}}$) are fine-tuned separately on these datasets, producing 198 fine-tuned models in total. All models (six base and 198 fine-tuned) are then evaluated against $T_s$ and $P_{T_s}$ test sets (HumanEval with 161 tasks and MBPP with 427 tasks), along with their perturbed variants.
Figure 2: Test dataset construction before and after perturbation. Panel (a) shows an unperturbed task specification, while Panel (b) illustrates the same task with modified prompts (e.g., synonym substitutions, tense shifts, or character-level noise). Perturbations alter natural language phrasing but preserve the original problem semantics for robustness evaluation.
Figure 3: Comparison of base and fine-tuned models on HumanEval. Fine-tuning improves performance overall. Unperturbed fine-tuned models perform best on unperturbed test sets (a) but degrade under perturbation, whereas perturbation-aware fine-tuned models achieve higher robustness on perturbed test sets (b).
Figure 4: Relative Degradation (RD) comparison. Perturbation-aware fine-tuned models achieve the best robustness, unperturbed fine-tuned models are moderate, and base models are the least robust.
Figure 5: Performance and robustness of fine-tuned models with varying proportions of mix_all_level perturbed data, evaluated on the perturbed HumanEval test set. Higher Pass@1 indicates better performance, while lower RD indicates greater robustness. Pass@1 is measured on the unperturbed and Pass@1_perturbed on the perturbed datasets.
...and 2 more figures

Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

TL;DR

Abstract

Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)