Table of Contents
Fetching ...

Syntactic Evolution in Language Usage

Surbhit Kumar

TL;DR

This paper investigates how English syntax evolves across the lifespan by analyzing blogger.com text from 2002–2004 across three age groups. It combines extensive syntactic feature extraction with a PCA-based dimensionality reduction and a two-layer stacked ensemble to forecast age groups, and it benchmarks real blog data against GPT-4 generated text. Findings indicate that real blog text exhibits increasing syntactic complexity with age, though GPT-4 outputs show weaker, less consistent age-related patterns and yield moderate forecasting performance (around 40% on balanced data, ~30% on new GPT-4 text). The work highlights challenges in cross-domain style replication by AI and underscores the need for diverse data and robust modeling to accurately capture demographic-driven language variation in digital communication.

Abstract

This research aims to investigate the dynamic nature of linguistic style throughout various stages of life, from post teenage to old age. By employing linguistic analysis tools and methodologies, the study will delve into the intricacies of how individuals adapt and modify their language use over time. The research uses a data set of blogs from blogger.com from 2004 and focuses on English for syntactic analysis. The findings of this research can have implications for linguistics, psychology, and communication studies, shedding light on the intricate relationship between age and language.

Syntactic Evolution in Language Usage

TL;DR

This paper investigates how English syntax evolves across the lifespan by analyzing blogger.com text from 2002–2004 across three age groups. It combines extensive syntactic feature extraction with a PCA-based dimensionality reduction and a two-layer stacked ensemble to forecast age groups, and it benchmarks real blog data against GPT-4 generated text. Findings indicate that real blog text exhibits increasing syntactic complexity with age, though GPT-4 outputs show weaker, less consistent age-related patterns and yield moderate forecasting performance (around 40% on balanced data, ~30% on new GPT-4 text). The work highlights challenges in cross-domain style replication by AI and underscores the need for diverse data and robust modeling to accurately capture demographic-driven language variation in digital communication.

Abstract

This research aims to investigate the dynamic nature of linguistic style throughout various stages of life, from post teenage to old age. By employing linguistic analysis tools and methodologies, the study will delve into the intricacies of how individuals adapt and modify their language use over time. The research uses a data set of blogs from blogger.com from 2004 and focuses on English for syntactic analysis. The findings of this research can have implications for linguistics, psychology, and communication studies, shedding light on the intricate relationship between age and language.
Paper Structure (8 sections, 7 figures)

This paper contains 8 sections, 7 figures.

Figures (7)

  • Figure 1: Information Flow Diagram
  • Figure 2: WordCloud of BlogText dataset by age group
  • Figure 3: Word CLoud of GPT-4 prompts
  • Figure 4: Syntactic Feature Comparison: GPT-4 vs BlogText on the balanced dataset (about 51k rows). Represented as heatmap at row level
  • Figure 5: Syntactic Feature Comparison: GPT-4 vs BlogText on the full dataset (about 450k rows)
  • ...and 2 more figures