Table of Contents
Fetching ...

Can't say cant? Measuring and Reasoning of Dark Jargons in Large Language Models

Xu Ji, Jianyi Zhang, Ziyin Zhou, Zhangchi Zhao, Qianqian Qiao, Kaiying Han, Md Imran Hossen, Xiali Hei

TL;DR

This paper addresses the emergence of cant or dark jargon as a security challenge for large language models (LLMs). It introduces CantCounter, a four-stage evaluation framework (Fine-Tuning, Co-Tuning, Data-Diffusion, Data-Analysis) and two domain-specific datasets (Cant and Scene) to assess LLM recognition and reasoning across politics, drugs, racism, weapons, and LGBT domains. The study reveals that contemporary LLMs can bypass filters using cant prompts, with performance influenced by question type, learning setup, and prompt clues, and that newer models vary in their likelihood to refuse sensitive content. The work provides a public dataset and code, offering a practical platform for safety testing, domain-specific cant understanding, and improvement of content-filtering mechanisms in open-world dialogue systems.

Abstract

Ensuring the resilience of Large Language Models (LLMs) against malicious exploitation is paramount, with recent focus on mitigating offensive responses. Yet, the understanding of cant or dark jargon remains unexplored. This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework, employing Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis stages. Experiments reveal LLMs, including ChatGPT, are susceptible to cant bypassing filters, with varying recognition accuracy influenced by question types, setups, and prompt clues. Updated models exhibit higher acceptance rates for cant queries. Moreover, LLM reactions differ across domains, e.g., reluctance to engage in racism versus LGBT topics. These findings underscore LLMs' understanding of cant and reflect training data characteristics and vendor approaches to sensitive topics. Additionally, we assess LLMs' ability to demonstrate reasoning capabilities. Access to our datasets and code is available at https://github.com/cistineup/CantCounter.

Can't say cant? Measuring and Reasoning of Dark Jargons in Large Language Models

TL;DR

This paper addresses the emergence of cant or dark jargon as a security challenge for large language models (LLMs). It introduces CantCounter, a four-stage evaluation framework (Fine-Tuning, Co-Tuning, Data-Diffusion, Data-Analysis) and two domain-specific datasets (Cant and Scene) to assess LLM recognition and reasoning across politics, drugs, racism, weapons, and LGBT domains. The study reveals that contemporary LLMs can bypass filters using cant prompts, with performance influenced by question type, learning setup, and prompt clues, and that newer models vary in their likelihood to refuse sensitive content. The work provides a public dataset and code, offering a practical platform for safety testing, domain-specific cant understanding, and improvement of content-filtering mechanisms in open-world dialogue systems.

Abstract

Ensuring the resilience of Large Language Models (LLMs) against malicious exploitation is paramount, with recent focus on mitigating offensive responses. Yet, the understanding of cant or dark jargon remains unexplored. This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework, employing Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis stages. Experiments reveal LLMs, including ChatGPT, are susceptible to cant bypassing filters, with varying recognition accuracy influenced by question types, setups, and prompt clues. Updated models exhibit higher acceptance rates for cant queries. Moreover, LLM reactions differ across domains, e.g., reluctance to engage in racism versus LGBT topics. These findings underscore LLMs' understanding of cant and reflect training data characteristics and vendor approaches to sensitive topics. Additionally, we assess LLMs' ability to demonstrate reasoning capabilities. Access to our datasets and code is available at https://github.com/cistineup/CantCounter.
Paper Structure (21 sections, 2 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 2 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Construction of the Cant dataset involves collecting, summarizing security-related data, forming interconnected relationships between cants into an information network, and establishing the dataset through data classification and categorization, encompassing various domain-related entities and their corresponding cants.
  • Figure 2: The pipeline of CantCounter.
  • Figure 3: The overall structure and process of Co-Tuning.
  • Figure 4: Schematic diagram of Data-Diffusion.
  • Figure 5: The vertical axis refers to the number of correct answers under the four tips. The total number is 404. (A) and (E) stand out in Multiple-choice, being the correct answer and "I don't know" respectively. After carefully studying ChatGPT-3.5's interpretation of option (E), we find that when the context is ambiguous or the entities in the implicit context are rare, ChatGPT-3.5's accuracy will drop significantly; that is, it will prefer option (E).
  • ...and 1 more figures