Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

K. Lian; L. S. Liebovitch; M. Wild; H. West; P. T. Coleman; F. Chen; E. Kimani; K. Sieck

Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

K. Lian, L. S. Liebovitch, M. Wild, H. West, P. T. Coleman, F. Chen, E. Kimani, K. Sieck

TL;DR

This work addresses classifying countries as peaceful or non-peaceful from linguistic patterns in global media. It proposes a supervised classification pipeline that converts articles into $1536$-dimensional embeddings via the OpenAI text-embedding-3-small model, stores them in a ChromaDB vector database, and uses cosine similarity to infer a country’s peace level, evaluated under a leave-one-country-out framework. The study reports an overall accuracy of $94\%$ and finds a strong correspondence between computed peace percentages and the Human Development Index, with a linear fit giving $R^2 = 0.835$. Two main contributions are the development of the embedding-based supervised classifier and an analysis of how dataset size affects performance, including a systematic dataset-size reduction and evaluation of metrics. The findings highlight the potential of semantic, large-scale text analysis for peace studies and policy insights, while noting limitations such as data quality, threshold arbitrariness, and unequal media coverage, and suggest future work on real-time dashboards, multi-source data integration, and bias-mitigation strategies.

Abstract

This paper presents a machine learning approach to classify countries as peaceful or non-peaceful using linguistic patterns extracted from global media articles. We employ vector embeddings and cosine similarity to develop a supervised classification model that effectively identifies peaceful countries. Additionally, we explore the impact of dataset size on model performance, investigating how shrinking the dataset influences classification accuracy. Our results highlight the challenges and opportunities associated with using large-scale text data for peace studies.

Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

TL;DR

This work addresses classifying countries as peaceful or non-peaceful from linguistic patterns in global media. It proposes a supervised classification pipeline that converts articles into

-dimensional embeddings via the OpenAI text-embedding-3-small model, stores them in a ChromaDB vector database, and uses cosine similarity to infer a country’s peace level, evaluated under a leave-one-country-out framework. The study reports an overall accuracy of

and finds a strong correspondence between computed peace percentages and the Human Development Index, with a linear fit giving

. Two main contributions are the development of the embedding-based supervised classifier and an analysis of how dataset size affects performance, including a systematic dataset-size reduction and evaluation of metrics. The findings highlight the potential of semantic, large-scale text analysis for peace studies and policy insights, while noting limitations such as data quality, threshold arbitrariness, and unequal media coverage, and suggest future work on real-time dashboards, multi-source data integration, and bias-mitigation strategies.

Abstract

Paper Structure (25 sections, 2 figures, 2 tables)

This paper contains 25 sections, 2 figures, 2 tables.

Introduction
Background
Objectives and Contributions
Methods
Embedding the Articles
Data Preprocessing
Embedding Generation
Storage
Machine Learning Classification
Query Embedding
Cosine Similarity Calculation
Article Classification
Country Classification
Impact of Dataset Size on Classification Performance
Data Sampling
...and 10 more sections

Figures (2)

Figure 1: HDI vs. Peace Percentage
Figure 2: Metrics vs. the number of rows

Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

TL;DR

Abstract

Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)