Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
TL;DR
This work investigates whether large language models can replace or reduce human annotation for aspect sentiment quad prediction (ASQP) by evaluating zero- to fifty-shot prompts across five diverse datasets, including a newly introduced FlightABSA. Using Gemma-3-27B and 3-4B, the study demonstrates that few-shot prompting yields substantial gains, with notable improvements such as Rest16 ASQP increasing from 0-shot to 50-shot under self-consistency, and Rest16 TASD achieving competitive scores close to fine-tuned baselines. In specific cases, e.g., 20-shot Rest16 ASQP, LLMs reach 51.54, though a strong fine-tuned method (MVP) scores higher; TASD results at 30-shot (68.93) are also near fine-tuned performance (72.76). Self-consistency prompts provide large F1 boosts across tasks, but results depend on model size, data domain, and output validation constraints. Overall, LLM prompting reduces annotation needs and can outperform some fine-tuning setups in low-resource scenarios, while human annotators remain valuable for achieving optimal results and ensuring data quality in ASQP tasks.
Abstract
Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.
