Table of Contents
Fetching ...

INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness

Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo

TL;DR

INDICT is introduced: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance, and can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.

Abstract

Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes ($+10\%$ absolute improvements in all models).

INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness

TL;DR

INDICT is introduced: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance, and can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.

Abstract

Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes ( absolute improvements in all models).
Paper Structure (35 sections, 5 equations, 12 figures, 16 tables)

This paper contains 35 sections, 5 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: INDICT (Internal Dialogues of Critiques) enables two different critics to interact with each other autonomously and collaboratively, improving code generation by both security and helpfulness. In this example, INDICT iteratively resolves the security weakness https://cwe.mitre.org/data/definitions/78.html (Improper Neutralization in an OS Command) and improves the code functionality with relevant supporting modules.
  • Figure 2: INDICT (Internal Dialogues of Critiques) is a framework to generate code by both safety and helpfulness. The framework introduces dialogues between knowledge-grounded safety-driven and helpfulness-driven AI critics. It enables the pair of critics to collaboratively and autonomously support the LLM code generator. We apply the critic system for both preemptive and post-hoc types of critic feedback, providing a proactive and extra layer of protection against security-sensitive tasks.
  • Figure 2: We evaluated INDICT with HarmBench against 6 different types of red-teaming optimization methods. We reported the safety measure as the percentage of outputs classified as benign by the given AI evaluator from HarmBench.
  • Figure 3: We define two types of tool-enabled actions the critics can perform: (1) "code search" queries external tools by a generated text query and optionally a corresponding code snippet. (2) "code review" uses the execution result of the code snippet (through a code interpreter) as additional input to complement the query. Both action types query tools like web search, Wikipedia, and OpenAI as the knowledge base.
  • Figure 3: We conducted an ablation analysis of INDICT when removing the proposed dual critic system and/or external tool enhancement. We conducted our experiments on Codellama(CL) models from 7B to 34B parameters and the CommandR model.
  • ...and 7 more figures