Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization
K. Lian, L. S. Liebovitch, M. Wild, H. West, P. T. Coleman, F. Chen, E. Kimani, K. Sieck
TL;DR
This work addresses classifying countries as peaceful or non-peaceful from linguistic patterns in global media. It proposes a supervised classification pipeline that converts articles into $1536$-dimensional embeddings via the OpenAI text-embedding-3-small model, stores them in a ChromaDB vector database, and uses cosine similarity to infer a country’s peace level, evaluated under a leave-one-country-out framework. The study reports an overall accuracy of $94\%$ and finds a strong correspondence between computed peace percentages and the Human Development Index, with a linear fit giving $R^2 = 0.835$. Two main contributions are the development of the embedding-based supervised classifier and an analysis of how dataset size affects performance, including a systematic dataset-size reduction and evaluation of metrics. The findings highlight the potential of semantic, large-scale text analysis for peace studies and policy insights, while noting limitations such as data quality, threshold arbitrariness, and unequal media coverage, and suggest future work on real-time dashboards, multi-source data integration, and bias-mitigation strategies.
Abstract
This paper presents a machine learning approach to classify countries as peaceful or non-peaceful using linguistic patterns extracted from global media articles. We employ vector embeddings and cosine similarity to develop a supervised classification model that effectively identifies peaceful countries. Additionally, we explore the impact of dataset size on model performance, investigating how shrinking the dataset influences classification accuracy. Our results highlight the challenges and opportunities associated with using large-scale text data for peace studies.
