One Size Fits None: Modeling NYC Taxi Trips

Tomas Eglinskas

One Size Fits None: Modeling NYC Taxi Trips

Tomas Eglinskas

TL;DR

It is shown that building one universal model is a mistake and, due to Simpson's paradox, a combined model looks accurate on average but fails to predict tips for individual taxi categories requiring specialized models.

Abstract

The rise of app-based ride-sharing has fundamentally changed tipping culture in New York City. We analyzed 280 million trips from 2024 to see if we could predict tips for traditional taxis versus high-volume for-hire services. By testing methods from linear regression to deep neural networks, we found two very different outcomes. Traditional taxis are highly predictable ($R^2 \approx 0.72$) due to the in-car payment screen. In contrast, app-based tipping is random and hard to model ($R^2 \approx 0.17$). In conclusion, we show that building one universal model is a mistake and, due to Simpson's paradox, a combined model looks accurate on average but fails to predict tips for individual taxi categories requiring specialized models.

One Size Fits None: Modeling NYC Taxi Trips

TL;DR

Abstract

) due to the in-car payment screen. In contrast, app-based tipping is random and hard to model (

). In conclusion, we show that building one universal model is a mistake and, due to Simpson's paradox, a combined model looks accurate on average but fails to predict tips for individual taxi categories requiring specialized models.

Paper Structure (20 sections, 1 equation, 10 figures, 6 tables)

This paper contains 20 sections, 1 equation, 10 figures, 6 tables.

Introduction
Research Background
Data
Data Cleaning
Schema Merging
Outliers
Feature Analysis
Distance Distribution
Temporal Distribution
Baseline Correlation Analysis
Synthetic Features and Data Enrichment
Methods
Algorithms
Linear Regression
CatBoost Regressor
...and 5 more sections

Figures (10)

Figure 1: Total trips and with tips by category in 2024
Figure 2: Tip and Distance Distribution across taxi types
Figure 3: Heatmap of Tip Counts (Top) and Median Tip Amounts (Bottom)
Figure 4: Correlation Matrix without Synthetic Features
Figure 5: Correlation Matrix with Synthetic Features
...and 5 more figures

One Size Fits None: Modeling NYC Taxi Trips

TL;DR

Abstract

One Size Fits None: Modeling NYC Taxi Trips

Authors

TL;DR

Abstract

Table of Contents

Figures (10)