Table of Contents
Fetching ...

Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

Subrata Karmaker

TL;DR

Problem: sarcasm detection on Reddit is challenging due to the mismatch between literal wording and intended meaning. Approach: a context-free baseline using word and character TF-IDF features plus simple stylistic indicators evaluated with four classical classifiers. Findings: NB and logistic regression achieve around F1=0.57 for sarcastic comments; NB shows AUC ~0.59; context absence limits performance. Significance: provides a reproducible, interpretable baseline for sarcasm detection that future work can compare against when adding context or neural representations.

Abstract

Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.

Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

TL;DR

Problem: sarcasm detection on Reddit is challenging due to the mismatch between literal wording and intended meaning. Approach: a context-free baseline using word and character TF-IDF features plus simple stylistic indicators evaluated with four classical classifiers. Findings: NB and logistic regression achieve around F1=0.57 for sarcastic comments; NB shows AUC ~0.59; context absence limits performance. Significance: provides a reproducible, interpretable baseline for sarcasm detection that future work can compare against when adding context or neural representations.

Abstract

Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.

Paper Structure

This paper contains 8 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Confusion matrix for the multinomial Naive Bayes classifier.
  • Figure 2: ROC curve for the multinomial Naive Bayes classifier (solid line) compared with a random baseline (dashed line).