Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

Marco Braga; Pranav Kasela; Alessandro Raganato; Gabriella Pasi

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

Marco Braga, Pranav Kasela, Alessandro Raganato, Gabriella Pasi

TL;DR

The potential of Large Language Models for generating synthetic documents to train an IR system for a Personalized Community Question Answering task and the introduction of a new dataset, named Sy-SE-PQA, suggest that LLMs have high potential in generating data tailored to users' needs.

Abstract

Personalization in Information Retrieval (IR) is a topic studied by the research community since a long time. However, there is still a lack of datasets to conduct large-scale evaluations of personalized IR; this is mainly due to the fact that collecting and curating high-quality user-related information requires significant costs and time investment. Furthermore, the creation of datasets for Personalized IR (PIR) tasks is affected by both privacy concerns and the need for accurate user-related data, which are often not publicly available. Recently, researchers have started to explore the use of Large Language Models (LLMs) to generate synthetic datasets, which is a possible solution to generate data for low-resource tasks. In this paper, we investigate the potential of Large Language Models (LLMs) for generating synthetic documents to train an IR system for a Personalized Community Question Answering task. To study the effectiveness of IR models fine-tuned on LLM-generated data, we introduce a new dataset, named Sy-SE-PQA. We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Starting from questions in SE-PQA, we generate synthetic answers using different prompt techniques and LLMs. Our findings suggest that LLMs have high potential in generating data tailored to users' needs. The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

TL;DR

Abstract

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)