Table of Contents
Fetching ...

CAVE: Controllable Authorship Verification Explanations

Sahana Ramnath, Kartik Pandey, Elizabeth Boschee, Xiang Ren

TL;DR

This work tackles the need for offline, explainable authorship verification (AV) by introducing CAVE, an offline model that produces uniform, linguistically grounded free-text explanations. CAVE leverages Prompt-CAVE to generate silver-standard training data from a large oracle and couples it with a novel Cons-R-L consistency metric to filter data before finetuning a small model (Llama-3-8B) with LoRA. Across IMDb62, Blog-Auth, and FanFiction, CAVE delivers high-quality rationales validated by automatic metrics and human evaluation, with competitive AV accuracy against strong baselines, including GPT-4-Turbo variants. The results demonstrate the practicality of on-premises, explainable AV systems, while also outlining limitations such as rationale hallucinations and dataset biases, and outlining future directions like reward-based learning and feature-weighting strategies for robust explanations.

Abstract

Authorship Verification (AV) (do two documents have the same author?) is essential in many real-life applications. AV is often used in privacy-sensitive domains that require an offline proprietary model that is deployed on premises, making publicly served online models (APIs) a suboptimal choice. Current offline AV models however have lower downstream utility due to limited accuracy (eg: traditional stylometry AV systems) and lack of accessible post-hoc explanations. In this work, we address the above challenges by developing a trained, offline model CAVE (Controllable Authorship Verification Explanations). CAVE generates free-text AV explanations that are controlled to be (1) accessible (uniform structure that can be decomposed into sub-explanations grounded to relevant linguistic features), and (2) easily verified for explanation-label consistency. We generate silver-standard training data grounded to the desirable linguistic features by a prompt-based method Prompt-CAVE. We then filter the data based on rationale-label consistency using a novel metric Cons-R-L. Finally, we fine-tune a small, offline model (Llama-3-8B) with this data to create our model CAVE. Results on three difficult AV datasets show that CAVE generates high quality explanations (as measured by automatic and human evaluation) as well as competitive task accuracy.

CAVE: Controllable Authorship Verification Explanations

TL;DR

This work tackles the need for offline, explainable authorship verification (AV) by introducing CAVE, an offline model that produces uniform, linguistically grounded free-text explanations. CAVE leverages Prompt-CAVE to generate silver-standard training data from a large oracle and couples it with a novel Cons-R-L consistency metric to filter data before finetuning a small model (Llama-3-8B) with LoRA. Across IMDb62, Blog-Auth, and FanFiction, CAVE delivers high-quality rationales validated by automatic metrics and human evaluation, with competitive AV accuracy against strong baselines, including GPT-4-Turbo variants. The results demonstrate the practicality of on-premises, explainable AV systems, while also outlining limitations such as rationale hallucinations and dataset biases, and outlining future directions like reward-based learning and feature-weighting strategies for robust explanations.

Abstract

Authorship Verification (AV) (do two documents have the same author?) is essential in many real-life applications. AV is often used in privacy-sensitive domains that require an offline proprietary model that is deployed on premises, making publicly served online models (APIs) a suboptimal choice. Current offline AV models however have lower downstream utility due to limited accuracy (eg: traditional stylometry AV systems) and lack of accessible post-hoc explanations. In this work, we address the above challenges by developing a trained, offline model CAVE (Controllable Authorship Verification Explanations). CAVE generates free-text AV explanations that are controlled to be (1) accessible (uniform structure that can be decomposed into sub-explanations grounded to relevant linguistic features), and (2) easily verified for explanation-label consistency. We generate silver-standard training data grounded to the desirable linguistic features by a prompt-based method Prompt-CAVE. We then filter the data based on rationale-label consistency using a novel metric Cons-R-L. Finally, we fine-tune a small, offline model (Llama-3-8B) with this data to create our model CAVE. Results on three difficult AV datasets show that CAVE generates high quality explanations (as measured by automatic and human evaluation) as well as competitive task accuracy.

Paper Structure

This paper contains 36 sections, 3 equations, 2 figures, 19 tables.

Figures (2)

  • Figure 1: Cave generates uniformly structured free-text explanations grounded in relevant linguistic features, that can be automatically verified for consistency.
  • Figure 2: Pipeline to train Cave: We obtain silver train data from GPT-4-Turbo using Prompt-CAVE, filter it according to Cons-R-L and our output format. We then supervised-finetune a LLaMa-3-8B with the filtered data.