Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research

Gionnieve Lim; Simon T. Perrault

Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research

Gionnieve Lim, Simon T. Perrault

TL;DR

This paper argues that adopting LLMs in HCI research requires rigorous, transparent evaluation rather than assuming capabilities. It empirically assesses GPT-4's ability to identify seven common logical fallacies using the LOGIC dataset, reporting $A=0.79$ on full data and $A=0.90$ on a subset after removing None predictions, with Emotion fallacies showing particular weakness. The study details the evaluation design, dataset curation, and prompt-engineering process, and discusses how results should inform design decisions in misinformation interventions. The authors emphasize reporting performance to users and acknowledge limitations, highlighting best practices for evaluating LLMs in research contexts and the need for ongoing, thoughtful prompting strategies.

Abstract

There is increasing interest in the adoption of LLMs in HCI research. However, LLMs may often be regarded as a panacea because of their powerful capabilities with an accompanying oversight on whether they are suitable for their intended tasks. We contend that LLMs should be adopted in a critical manner following rigorous evaluation. Accordingly, we present the evaluation of an LLM in identifying logical fallacies that will form part of a digital misinformation intervention. By comparing to a labeled dataset, we found that GPT-4 achieves an accuracy of 0.79, and for our intended use case that excludes invalid or unidentified instances, an accuracy of 0.90. This gives us the confidence to proceed with the application of the LLM while keeping in mind the areas where it still falls short. The paper describes our evaluation approach, results and reflections on the use of the LLM for our intended task.

Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research

TL;DR

on full data and

on a subset after removing None predictions, with Emotion fallacies showing particular weakness. The study details the evaluation design, dataset curation, and prompt-engineering process, and discusses how results should inform design decisions in misinformation interventions. The authors emphasize reporting performance to users and acknowledge limitations, highlighting best practices for evaluating LLMs in research contexts and the need for ongoing, thoughtful prompting strategies.

Abstract

Paper Structure (12 sections, 1 figure, 3 tables)

This paper contains 12 sections, 1 figure, 3 tables.

Background
The Logical Fallacies
Technical Evaluation
Results
Discussion
Evaluate to Inform Design Decisions
Considerations in Evaluating LLMs
Evaluation Strategy
Dataset
Prompt Engineering
Conclusion
Prompt for Identifying Logical Fallacies

Figures (1)

Figure 1: Normalized confusion matrix of the LLM in identifying logical fallacies for the subset data ($N=685$).

Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research

TL;DR

Abstract

Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research

Authors

TL;DR

Abstract

Table of Contents

Figures (1)