LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

Nils Constantin Hellwig; Jakob Fehle; Udo Kruschwitz; Christian Wolff

LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff

TL;DR

LA-ABSA is introduced, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks, and outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits.

Abstract

Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.

LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 4 figures, 7 tables, 2 algorithms)

This paper contains 27 sections, 1 equation, 4 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Fine-tuning Language Models for Tuple Prediction
Approaches for Minimizing Annotation Effort
Methodology
LLM-as-an-Annotator (LA-ABSA)
Annotator Module
Trainer Module
Baselines
Fine-tuning on Human-Annotated Training Data
LLMs for Zero-shot and Few-shot Prompting
Low-resource Enhancement Methods.
Evaluation and Datasets
Results
Overall Results
...and 12 more sections

Figures (4)

Figure 1: Illustration of LLM-as-an-Annotator (LA-ABSA). A LLM (Gemma-3-27B) is prompted to annotate training examples, which are subsequently used to fine-tune lightweight state-of-the-art models for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP).
Figure 2: Energy consumption analysis. Comparative evaluation of energy usage (kilowatt-hours, kWh) for ASQP and TASD tasks across different settings (0, 10 or 50 annotated examples given). Results are shown for LLM-prompting (Gemma-3-27B) and LA-ABSA methods, using either DLO hu2022improving or Paraphrase zhang2021aspect for fine-tuning. Each line represents the average energy usage to predict up to 100,000 examples per method and task across the five datasets. LA-ABSA approaches generally require much lower energy due to their smaller underlying base model T5-base.
Figure 3: Distribution of aspect categories and sentiments across datasets for TASD and ASQP tasks. Each subplot shows the top 10 aspect categories (sorted by total frequency), with stacked bars representing positive (green), neutral (yellow), and negative (red) sentiments. The 'Others' category aggregates the remaining aspects. Results are aggregated across five datasets: Rest15 zhang2021aspectpontiki2015semeval, Rest16 zhang2021aspectpontiki2016semeval, FlightABSA hellwig2025we, Coursera chebolu2024oats, and Hotels chebolu2024oats. This visualization highlights the imbalances in aspect-level sentiment annotations, showing varying distributions of polarities and aspect categories across datasets.
Figure 4: Prediction time analysis. Prediction time in minutes for ASQP and TASD tasks across different settings (0, 10, or 50 annotated examples given). Results are shown for LLM-prompting (Gemma-3-27B) and LA-ABSA using DLO hu2022improving and Paraphrase zhang2021aspect for fine-tuning. Each line represents the average time required to predict up to 100,000 examples per method and task across the five datasets. LA-ABSA approaches generally require substantially less time due to their smaller underlying base model T5-base.

LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

TL;DR

Abstract

LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)