Table of Contents
Fetching ...

Genshin: General Shield for Natural Language Processing with Large Language Models

Xiao Peng, Tao Liu, Ying Wang

TL;DR

Genshin tackles the opacity and robustness challenges of large language models by introducing a three-stage framework that uses an LLM as a one-time denoising plug-in, followed by a median-sized LM for analysis and an interpretable model for explanations. This design aims to achieve high predictive accuracy with improved transparency, addressing adversarial textual attacks through a denoise-then-analyze pipeline and SHAP-based interpretations. Empirical results on sentiment analysis and spam detection show strong recovery capabilities, with an average recovery of $81.6\%$ at a $15\%$ disturbance and insights from ablation studies that highlight the importance of prompt design and the potential trade-offs between attack strength and recovery. Limitations include the variability of LLM attacker strategies and domain-specific tasks, motivating future work on controllability, retrieval-augmented defenses, and multi-modal extensions to broaden applicability.

Abstract

Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system's defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs' recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.

Genshin: General Shield for Natural Language Processing with Large Language Models

TL;DR

Genshin tackles the opacity and robustness challenges of large language models by introducing a three-stage framework that uses an LLM as a one-time denoising plug-in, followed by a median-sized LM for analysis and an interpretable model for explanations. This design aims to achieve high predictive accuracy with improved transparency, addressing adversarial textual attacks through a denoise-then-analyze pipeline and SHAP-based interpretations. Empirical results on sentiment analysis and spam detection show strong recovery capabilities, with an average recovery of at a disturbance and insights from ablation studies that highlight the importance of prompt design and the potential trade-offs between attack strength and recovery. Limitations include the variability of LLM attacker strategies and domain-specific tasks, motivating future work on controllability, retrieval-augmented defenses, and multi-modal extensions to broaden applicability.

Abstract

Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system's defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs' recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.
Paper Structure (29 sections, 4 equations, 5 figures, 4 tables)

This paper contains 29 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Exemplary demonstration of Genshin recovering deliberately altered spam texts and providing classifications and interpretations afterwards.
  • Figure 2: The workflow of the Genshin framework
  • Figure 3: The ablation study results on different disturbance ratios, attackers, and datasets
  • Figure 4: Our interpreting results with the char attacker.
  • Figure 5: Our interpreting results with the LLM attacker.