Table of Contents
Fetching ...

Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction

Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff

TL;DR

This work investigates whether large language models can replace or reduce human annotation for aspect sentiment quad prediction (ASQP) by evaluating zero- to fifty-shot prompts across five diverse datasets, including a newly introduced FlightABSA. Using Gemma-3-27B and 3-4B, the study demonstrates that few-shot prompting yields substantial gains, with notable improvements such as Rest16 ASQP increasing from 0-shot to 50-shot under self-consistency, and Rest16 TASD achieving competitive scores close to fine-tuned baselines. In specific cases, e.g., 20-shot Rest16 ASQP, LLMs reach 51.54, though a strong fine-tuned method (MVP) scores higher; TASD results at 30-shot (68.93) are also near fine-tuned performance (72.76). Self-consistency prompts provide large F1 boosts across tasks, but results depend on model size, data domain, and output validation constraints. Overall, LLM prompting reduces annotation needs and can outperform some fine-tuning setups in low-resource scenarios, while human annotators remain valuable for achieving optimal results and ensuring data quality in ASQP tasks.

Abstract

Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.

Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction

TL;DR

This work investigates whether large language models can replace or reduce human annotation for aspect sentiment quad prediction (ASQP) by evaluating zero- to fifty-shot prompts across five diverse datasets, including a newly introduced FlightABSA. Using Gemma-3-27B and 3-4B, the study demonstrates that few-shot prompting yields substantial gains, with notable improvements such as Rest16 ASQP increasing from 0-shot to 50-shot under self-consistency, and Rest16 TASD achieving competitive scores close to fine-tuned baselines. In specific cases, e.g., 20-shot Rest16 ASQP, LLMs reach 51.54, though a strong fine-tuned method (MVP) scores higher; TASD results at 30-shot (68.93) are also near fine-tuned performance (72.76). Self-consistency prompts provide large F1 boosts across tasks, but results depend on model size, data domain, and output validation constraints. Overall, LLM prompting reduces annotation needs and can outperform some fine-tuning setups in low-resource scenarios, while human annotators remain valuable for achieving optimal results and ensuring data quality in ASQP tasks.

Abstract

Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.

Paper Structure

This paper contains 29 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Annotated example for ASQP from Rest16 zhang2021aspect. One or multiple opinion-quadruple annotations are assigned to each sentence.
  • Figure 2: The prompt includes both a task description and specification of the output format. The LLM is run with five different seeds and in the case of self-consistency prompting, the tuple that appears most often across the five predictions is incorporated into the final label.
  • Figure 3: Example of a prompt employed for the ASQP task. The prompt comprises an explanation on the considered sentiment elements, output format and annotated examples in the case of few-shot learning.
  • Figure 4: Influence of the amount of few-shot examples on the performance of Gemma-3-4B and Gemma-3-27B. Visualization includes comparison with performance scores of SOTA supervised methods MVP, Paraphrase and DLO.