Table of Contents
Fetching ...

Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, Maliheh Izadi

TL;DR

The work tackles the problem of harmful code generation by off-the-shelf LLMs and presents a SE-focused safety framework built around Hammurabi's Code, a taxonomy-driven prompt dataset, and an automatic harmfulness evaluator. It systematically assesses 70 LLMs (both code-specific and general-purpose) to reveal substantial disparities in harmlessness across families and model sizes, finding that larger models are generally more harmless though not universally so. The study introduces a 26-subcategory taxonomy, a 509-prompt corpus, and a lightweight embedding-based classifier that enables scalable evaluation and replication, accompanied by an open-source framework for exploration. Key findings show that code-specific models do not consistently outperform general-purpose ones, fine-tuning can degrade safety, and category-specific prompts (e.g., malware vs. copyright) elicit markedly different safety responses. The results underscore the need for targeted alignment strategies in software engineering tasks and provide a foundation for real-time safety tooling and future research in code-generation safety and explainable safeguards.

Abstract

Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.

Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

TL;DR

The work tackles the problem of harmful code generation by off-the-shelf LLMs and presents a SE-focused safety framework built around Hammurabi's Code, a taxonomy-driven prompt dataset, and an automatic harmfulness evaluator. It systematically assesses 70 LLMs (both code-specific and general-purpose) to reveal substantial disparities in harmlessness across families and model sizes, finding that larger models are generally more harmless though not universally so. The study introduces a 26-subcategory taxonomy, a 509-prompt corpus, and a lightweight embedding-based classifier that enables scalable evaluation and replication, accompanied by an open-source framework for exploration. Key findings show that code-specific models do not consistently outperform general-purpose ones, fine-tuning can degrade safety, and category-specific prompts (e.g., malware vs. copyright) elicit markedly different safety responses. The results underscore the need for targeted alignment strategies in software engineering tasks and provide a foundation for real-time safety tooling and future research in code-generation safety and explainable safeguards.

Abstract

Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.

Paper Structure

This paper contains 38 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of our approach. We prompt a set of LLMs with handwritten prompts (1), we collect the resulting responses (2). We take a sample of these responses (3). These samples are shown to human annotators (4) who manually evaluate the samples (5). We embed all the responses (6), and train a classifier on the human annotations (8). Finally, we use the model to classify the embeddings of generated responses by LLMs (9).
  • Figure 2: Performance comparison of different LLMs in response to potentially harmful prompts. Each row represents a model, sorted in ascending order by the proportion of harmful answers (A). The stacked bars show the distribution of response types: harmful answers (A)warnings (W), refusals (R), and harmless responses (H). The rank of each model is shown on the right.
  • Figure 3: Distribution of model responses across the 'Copyright' category (left) and its subcategories (right), The left panel shows the overall proportion of harmful (A), warning (W)refusal (R) and, harmless (H) and responses for the entire category and the right panel showcases the proportions for each of the subcategories.
  • Figure 4: Distribution of model responses across the 'Malware' category (left) and its subcategories (right).
  • Figure 5: Distribution of model responses across the 'Unfair/Dangerous' category (left) and its subcategories (right).
  • ...and 2 more figures