SRNI-CAR: A comprehensive dataset for analyzing the Chinese automotive market
Ruixin Ding, Bowei Chen, James M. Wilson, Zhi Yan, Yufei Huang
TL;DR
This paper addresses the lack of a comprehensive public dataset for the Chinese automotive market by introducing SRNI-CAR, a dataset spanning 2016–2022 that merges car-series sales, online reviews, and industry news. It details data collection from major Chinese platforms, augmentation with variables such as brand country of origin and model/brand launch and entry dates, and synchronization of sales with consumer feedback, including official and actual prices. The dataset comprises 1,236 car-series sales (39,496 observations across 155 brands), 217,292 online reviews (across 358 cities and 13,039 models), and 83,590 industry news items, stored in three CSV files of 3.6 MB, 480 MB, and 224.1 MB, respectively. Two analytics examples illustrate SRNI-CAR’s value: automobile sales forecasting and consumer behavior analytics using XGBoost, SHAP, and SnowNLP sentiment, uncovering insights such as the prominence of model and brand entry dates (first-mover effects) and the strong predictive power of review-text sentiment, with implications for automakers, policymakers, and researchers. The work positions SRNI-CAR as a practical, scalable resource for forecasting accuracy, marketing optimization, and policy analysis, with plans for ongoing updates and broader accessibility.
Abstract
The automotive industry plays a critical role in the global economy, and particularly important is the expanding Chinese automobile market due to its immense scale and influence. However, existing automotive sector datasets are limited in their coverage, failing to adequately consider the growing demand for more and diverse variables. This paper aims to bridge this data gap by introducing a comprehensive dataset spanning the years from 2016 to 2022, encompassing sales data, online reviews, and a wealth of information related to the Chinese automotive industry. This dataset serves as a valuable resource, significantly expanding the available data. Its impact extends to various dimensions, including improving forecasting accuracy, expanding the scope of business applications, informing policy development and regulation, and advancing academic research within the automotive sector. To illustrate the dataset's potential applications in both business and academic contexts, we present two application examples. Our developed dataset enhances our understanding of the Chinese automotive market and offers a valuable tool for researchers, policymakers, and industry stakeholders worldwide.
