Table of Contents
Fetching ...

Towards More Accurate US Presidential Election via Multi-step Reasoning with Large Language Models

Chenxiao Yu, Zhaotian Weng, Yuangang Li, Zheng Li, Xiyang Hu, Yue Zhao

TL;DR

The paper tackles predicting US presidential outcomes with large language models (LLMs) by addressing data scarcity and evolving political contexts through a multi-step reasoning framework. It fuses real-world ANES time-series data with SynC-generated synthetic populations and compares three prompting pipelines that progressively add temporal context and Chain-of-Thought reasoning, with state-level aggregation to reflect election dynamics. The multi-step V3 pipeline—placing voters on a Conservative-Liberal spectrum and then simulating votes with time-aware prompts—achieves the best alignment with ground-truth results, attaining high discriminative performance (e.g., AUC around $0.90$ on state-level predictions) and strong performance in swing states. This work demonstrates a scalable, privacy-preserving approach to political forecasting using LLMs and points to future enhancements including multi-LLM ensembles and refined temporal models to further improve reliability and reduce biases.

Abstract

Can Large Language Models (LLMs) accurately predict election outcomes? While LLMs have demonstrated impressive performance in various domains, including healthcare, legal analysis, and creative tasks, their ability to forecast elections remains unknown. Election prediction poses unique challenges, such as limited voter-level data, rapidly changing political landscapes, and the need to model complex human behavior. To address these challenges, we introduce a multi-step reasoning framework designed for political analysis. Our approach is validated on real-world data from the American National Election Studies (ANES) 2016 and 2020, as well as synthetic personas generated by the leading machine learning framework, offering scalable datasets for voter behavior modeling. To capture temporal dynamics, we incorporate candidates' policy positions and biographical details, ensuring that the model adapts to evolving political contexts. Drawing on Chain of Thought prompting, our multi-step reasoning pipeline systematically integrates demographic, ideological, and time-dependent factors, enhancing the model's predictive power.

Towards More Accurate US Presidential Election via Multi-step Reasoning with Large Language Models

TL;DR

The paper tackles predicting US presidential outcomes with large language models (LLMs) by addressing data scarcity and evolving political contexts through a multi-step reasoning framework. It fuses real-world ANES time-series data with SynC-generated synthetic populations and compares three prompting pipelines that progressively add temporal context and Chain-of-Thought reasoning, with state-level aggregation to reflect election dynamics. The multi-step V3 pipeline—placing voters on a Conservative-Liberal spectrum and then simulating votes with time-aware prompts—achieves the best alignment with ground-truth results, attaining high discriminative performance (e.g., AUC around on state-level predictions) and strong performance in swing states. This work demonstrates a scalable, privacy-preserving approach to political forecasting using LLMs and points to future enhancements including multi-LLM ensembles and refined temporal models to further improve reliability and reduce biases.

Abstract

Can Large Language Models (LLMs) accurately predict election outcomes? While LLMs have demonstrated impressive performance in various domains, including healthcare, legal analysis, and creative tasks, their ability to forecast elections remains unknown. Election prediction poses unique challenges, such as limited voter-level data, rapidly changing political landscapes, and the need to model complex human behavior. To address these challenges, we introduce a multi-step reasoning framework designed for political analysis. Our approach is validated on real-world data from the American National Election Studies (ANES) 2016 and 2020, as well as synthetic personas generated by the leading machine learning framework, offering scalable datasets for voter behavior modeling. To capture temporal dynamics, we incorporate candidates' policy positions and biographical details, ensuring that the model adapts to evolving political contexts. Drawing on Chain of Thought prompting, our multi-step reasoning pipeline systematically integrates demographic, ideological, and time-dependent factors, enhancing the model's predictive power.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: Demonstration of three prompt designs in § \ref{['subsec:overview']}. V1 is the direct prompt on voter demographic information, while V2 introduces time-dependent information to capture candidates' agenda and V3 also uses multi-step reasoning. In this example for 2020 Ohio result prediction, only V3 can accurately predict the results, demonstrating the importance of leveraging both time-dependent information and multi-step reasoning for election result prediction.
  • Figure 2: Progressive design of LLM pipelines for election predictions. V1: Direct Prompt on Demographic (§ \ref{['subsec:v1']}) uses static demographic personas but lacks temporal context. V2: Time-dependent Prompts (§ \ref{['subsec:v2']}) incorporates election-year policy shifts and candidate information, but struggles with overloaded prompts that limit prediction accuracy. V3: Multi-step Reasoning (§ \ref{['subsec:v3']}) structures the decision-making process into sequential steps, allowing for more nuanced reasoning and yielding unbiased results that align closely with real-world outcomes. Each version aggregates individual results through state-level simulations to reflect broader election trends.
  • Figure 3: Comparison of the three pipelines on ANES 2016 and 2020 benchmarks. The y-axis shows the predicted Republican vote ratio (R / (R + D)), with 0.5 indicating a balanced outcome. V1 (Vanilla Pipeline) and V2 (Single-step Time-based Prompting) overestimate Republican support, particularly in 2016. V3 (Multi-step Reasoning) achieves the most accurate results, closely matching the ground truth ratios: 48.34% vs. 47.7% (2016) and 46.78% vs. 41.2% (2020). These results highlight the improved accuracy of V3.
  • Figure 4: LLM’s predictions for four states in the 2020 election compared with Ground Truth results. The figure presents results for one red state (Ohio, OH), one blue state (Illinois, IL), one swing state (Wisconsin, WI), and one tipping-point state (Florida, FL). V1 and V2 pipelines tend to underestimate Republican support, while V3 (Multi-step Reasoning) provides the closest alignment with actual outcomes, especially in swing and tipping-point states.
  • Figure 5: Aggregated results of the three pipelines (V1, V2, V3) on state-level simulations. Each confusion matrix presents the number of states where predictions align with or deviate from actual outcomes. V1 (AUC = 0.69) and V2 (AUC = 0.62) show lower accuracy, while V3 (AUC = 0.90) performs best, effectively capturing Republican victories without compromising Democratic predictions. It is worth noting that, so far, we have only tested the pipelines in 21 states. If the scope is expanded to include all states, the AUC of V3 is expected to improve further, while the AUC of V1 and V2 are expected to decline.