Syntactic Evolution in Language Usage

Surbhit Kumar

Syntactic Evolution in Language Usage

Surbhit Kumar

TL;DR

This paper investigates how English syntax evolves across the lifespan by analyzing blogger.com text from 2002–2004 across three age groups. It combines extensive syntactic feature extraction with a PCA-based dimensionality reduction and a two-layer stacked ensemble to forecast age groups, and it benchmarks real blog data against GPT-4 generated text. Findings indicate that real blog text exhibits increasing syntactic complexity with age, though GPT-4 outputs show weaker, less consistent age-related patterns and yield moderate forecasting performance (around 40% on balanced data, ~30% on new GPT-4 text). The work highlights challenges in cross-domain style replication by AI and underscores the need for diverse data and robust modeling to accurately capture demographic-driven language variation in digital communication.

Abstract

This research aims to investigate the dynamic nature of linguistic style throughout various stages of life, from post teenage to old age. By employing linguistic analysis tools and methodologies, the study will delve into the intricacies of how individuals adapt and modify their language use over time. The research uses a data set of blogs from blogger.com from 2004 and focuses on English for syntactic analysis. The findings of this research can have implications for linguistics, psychology, and communication studies, shedding light on the intricate relationship between age and language.

Syntactic Evolution in Language Usage

TL;DR

Abstract

Paper Structure (8 sections, 7 figures)

This paper contains 8 sections, 7 figures.

Introduction
Research Design
Results
Comparisons with blog text and GPT-4 generated data
Forecasting accuracy on new text generated by GPT-4
Issues Encountered
Notes for Future Work
Conclusion

Figures (7)

Figure 1: Information Flow Diagram
Figure 2: WordCloud of BlogText dataset by age group
Figure 3: Word CLoud of GPT-4 prompts
Figure 4: Syntactic Feature Comparison: GPT-4 vs BlogText on the balanced dataset (about 51k rows). Represented as heatmap at row level
Figure 5: Syntactic Feature Comparison: GPT-4 vs BlogText on the full dataset (about 450k rows)
...and 2 more figures

Syntactic Evolution in Language Usage

TL;DR

Abstract

Syntactic Evolution in Language Usage

Authors

TL;DR

Abstract

Table of Contents

Figures (7)