Can Large Language Models (or Humans) Disentangle Text?

Nicolas Audinet de Pieuchon; Adel Daoud; Connor Thomas Jerzak; Moa Johansson; Richard Johansson

Can Large Language Models (or Humans) Disentangle Text?

Nicolas Audinet de Pieuchon, Adel Daoud, Connor Thomas Jerzak, Moa Johansson, Richard Johansson

TL;DR

The paper investigates whether LLMs can disentangle a forbidden textual variable, such as sentiment, from raw text while preserving other semantic information. It evaluates two models (Mistral 7B and GPT-4), prompt-based strategies (few-shot and prompt chaining), and human baselines on a dataset of 2000 Amazon reviews, measuring the ability to remove sentiment signals via downstream classifiers trained on text representations. The findings show that both LLMs and humans struggle to erase sentiment, with residual signals remaining detectable, though GPT-4 with prompt chaining achieves the best performance (≈75.7% sentiment-accuracy) while topic information remains preserved; mean-projection in representation space can better detach sentiment, suggesting a limit to text-level disentanglement. These results raise questions about the robustness and interpretability of disentanglement methods and motivate further research into approaches that truly separate sensitive attributes from content while maintaining semantic integrity.

Abstract

We investigate the potential of large language models (LLMs) to disentangle text variables--to remove the textual traces of an undesired forbidden variable in a task sometimes known as text distillation and closely related to the fairness in AI and causal inference literature. We employ a range of various LLM approaches in an attempt to disentangle text by identifying and removing information about a target variable while preserving other relevant signals. We show that in the strong test of removing sentiment, the statistical association between the processed text and sentiment is still detectable to machine learning classifiers post-LLM-disentanglement. Furthermore, we find that human annotators also struggle to disentangle sentiment while preserving other semantic content. This suggests there may be limited separability between concept variables in some text contexts, highlighting limitations of methods relying on text-level transformations and also raising questions about the robustness of disentanglement methods that achieve statistical independence in representation space.

Can Large Language Models (or Humans) Disentangle Text?

TL;DR

Abstract

Paper Structure (12 sections, 9 figures, 1 table)

This paper contains 12 sections, 9 figures, 1 table.

Introduction
Related work
Defining Disentanglement
Method
Dataset
Prompting
Evaluation Design
Results
Implications
Appendix
Disentanglement Example
Prompts

Figures (9)

Figure 1: The experimental setup for measuring the effectiveness of LLMs at removing a target variable from the raw text representation.
Figure 2: Excerpt from the few-shot prompt template. In our tests, [Review here] is replaced with the original text of each review.
Figure 3: Example of an original review from the Amazon dataset
Figure 4: GPT-4 response for the first stage of prompt chaining with the review from Figure \ref{['fig:review-example']}
Figure 5: GPT-4 response for the second stage of prompt chaining with the review from Figure \ref{['fig:review-example']} and the first stage response from Figure \ref{['fig:stage1-example']}
...and 4 more figures

Can Large Language Models (or Humans) Disentangle Text?

TL;DR

Abstract

Can Large Language Models (or Humans) Disentangle Text?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)