Do LLMs Find Human Answers To Fact-Driven Questions Perplexing? A Case Study on Reddit

Parker Seegmiller; Joseph Gatto; Omar Sharif; Madhusudan Basak; Sarah Masud Preum

Do LLMs Find Human Answers To Fact-Driven Questions Perplexing? A Case Study on Reddit

Parker Seegmiller, Joseph Gatto, Omar Sharif, Madhusudan Basak, Sarah Masud Preum

TL;DR

The paper investigates whether large language models (LLMs) can model the breadth of human, fact-driven answers posted on social media, focusing on Reddit's r/AskTopic communities. It builds a dataset of 409 fact-driven questions and 7,534 answers from 15 subreddits, and evaluates a 1.3B parameter version of Sheared LLaMA under out-of-the-box (SL) and fine-tuned (SLFT) regimes against human ratings using perplexity as the evaluation metric. The study finds that LLM perplexity correlates with human preference, with lower perplexities for highly-rated answers, and that fine-tuning further improves alignment across topics, though notable outliers reveal limitations and blind spots. These results provide a data-rich framework for probing socio-technical aspects of LLM behavior in online discourse and offer a dataset to spur future social science and NLP research. The work also highlights directions for targeted fine-tuning and cross-domain extension to better mirror the diversity of human responses on social platforms.

Abstract

Large language models (LLMs) have been shown to be proficient in correctly answering questions in the context of online discourse. However, the study of using LLMs to model human-like answers to fact-driven social media questions is still under-explored. In this work, we investigate how LLMs model the wide variety of human answers to fact-driven questions posed on several topic-specific Reddit communities, or subreddits. We collect and release a dataset of 409 fact-driven questions and 7,534 diverse, human-rated answers from 15 r/Ask{Topic} communities across 3 categories: profession, social identity, and geographic location. We find that LLMs are considerably better at modeling highly-rated human answers to such questions, as opposed to poorly-rated human answers. We present several directions for future research based on our initial findings.

Do LLMs Find Human Answers To Fact-Driven Questions Perplexing? A Case Study on Reddit

TL;DR

Abstract

Paper Structure (10 sections, 1 equation, 2 figures, 2 tables)

This paper contains 10 sections, 1 equation, 2 figures, 2 tables.

Introduction
Related Works
Social Media Question Answering
Factual Question Answering
Background and Methodology
Definitions
Relevant Data Collection and Processing
Models
Results and Implications
Implications for Future Work:

Figures (2)

Figure 1: LLM (fine-tuned Sheared LLaMA 1.3B) modeling error (perplexity) and human ratings of 7,534 human answers to fact-driven questions posed on Reddit's r/Ask{Topic} communities in log scale. In general, LLM modeling and human perception are well-aligned. As an example, see the two divergent answers to the question asked on r/UK. The LLM assigns low perplexity to the highly-rated human answer, and higher perplexity to the low-rated human answer.
Figure 2: LLM perplexity of answers to fact-driven questions posed on 15 of Reddit's r/Ask{Topic} communities, compared with the peer-assigned score of answers. The perplexities of the top row are calculated by the vanilla SL LLM, and the bottom row are calculated the fine-tuned SLFT LLM. For each graph, the X and Y axes indicate peer-assigned scores and LLM's perplexity in log scale, respectively.

Do LLMs Find Human Answers To Fact-Driven Questions Perplexing? A Case Study on Reddit

TL;DR

Abstract

Do LLMs Find Human Answers To Fact-Driven Questions Perplexing? A Case Study on Reddit

Authors

TL;DR

Abstract

Table of Contents

Figures (2)