Table of Contents
Fetching ...

Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance

Rachith Aiyappa, Shruthi Senthilmani, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

TL;DR

It is shown that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models, including fine-tuned models.

Abstract

We investigate the performance of LLM-based zero-shot stance detection on tweets. Using FlanT5-XXL, an instruction-tuned open-source LLM, with the SemEval 2016 Tasks 6A, 6B, and P-Stance datasets, we study the performance and its variations under different prompts and decoding strategies, as well as the potential biases of the model. We show that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models. We provide various insights into its performance including the sensitivity to instructions and prompts, the decoding strategies, the perplexity of the prompts, and to negations and oppositions present in prompts. Finally, we ensure that the LLM has not been trained on test datasets, and identify a positivity bias which may partially explain the performance differences across decoding strategie

Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance

TL;DR

It is shown that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models, including fine-tuned models.

Abstract

We investigate the performance of LLM-based zero-shot stance detection on tweets. Using FlanT5-XXL, an instruction-tuned open-source LLM, with the SemEval 2016 Tasks 6A, 6B, and P-Stance datasets, we study the performance and its variations under different prompts and decoding strategies, as well as the potential biases of the model. We show that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models. We provide various insights into its performance including the sensitivity to instructions and prompts, the decoding strategies, the perplexity of the prompts, and to negations and oppositions present in prompts. Finally, we ensure that the LLM has not been trained on test datasets, and identify a positivity bias which may partially explain the performance differences across decoding strategie
Paper Structure (46 sections, 8 equations, 11 figures, 28 tables)

This paper contains 46 sections, 8 equations, 11 figures, 28 tables.

Figures (11)

  • Figure 1: FlanT5-XXL is capable of outperforming state-of-the-art baselines of stance detection in a zero-shot setting and matching the performance of fine-tuned LLMs in SemEval 2016 Task 6A while beating it in Task 6B. The $F_{avg}$ scores of FlanT5-XXL on (a) Task 6A and (b) Task 6B of SemEval 2016 are shown in comparison against some of the best-performing models. Each label on the $x$-axis corresponds to a prompt (see Tab. \ref{['tab:prompts']}) and each point on a given prompt ID corresponds to an instruction (see Appendix. \ref{['app:instrutions']}). The results of three decoding strategies---greedy, PMI, and AfT---are also shown.
  • Figure 2: FlanT5-XXL is capable of outperforming state-of-the-art baselines in stance detection in a zero-shot setting in the P-Stance dataset. The $F_{avg}$ scores of FlanT5-XXL on different targets in P-Stance are shown in comparison against some of the best-performing models. Each label on the x-axis corresponds to a prompt (see Tab. \ref{['tab:prompts']}) and each point on a given prompt ID corresponds to an instruction (see Appendix. \ref{['app:instrutions']}). The results of two decoding strategies---greedy and PMI---are also shown.
  • Figure 3: Correlation between prompt perplexity, per prompt ID, per instruction, and $F_{avg}$ scores (from greedy) across targets of SemEval 2016 Task 6. (a,c,e) Prompts with the $\langle tweet \rangle$ object. (b,d,f) Prompts without the $\langle tweet \rangle$ object---context-free prompt. (a,b) Task A, (c,d) Task B, (e,f) Task A+B. Correlation coefficients are indicated by $r$ and p-values by $p$.
  • Figure 4: Correlation between prompt perplexity, per prompt ID, per instruction, and $F_{avg}$ scores (from greedy) for each target in the P-Stance dataset. (a,d) Donald Trmup, (b,e) Joe Biden, (c,f) Bernie Sanders. (a,b,c) Prompts with the $\langle tweet \rangle$ object. (d,e,f) Prompts without the $\langle tweet \rangle$ object---context-free prompt. Correlation coefficients are indicated by $r$ and p-values by $p$.
  • Figure 5: Probability of FlanT5-XXL outputting a label with a positive, negative, or neutral connotation in the SemEval 2016 Task 6, in a context-free setting---a setting where the $\langle tweet \rangle$ item in Tab. \ref{['tab:prompts']} is not fed into the model during inference. We see that the model is biased towards labels with positive connotations regardless of prompts and instruction in both (a) Task 6A, and (b) Task 6B. Each point represents a (prompt, instruction, target) tuple (see Tab. \ref{['tab:prompts']}).
  • ...and 6 more figures