BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Jiangxi Chen; Qian Liu

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Jiangxi Chen, Qian Liu

TL;DR

BaziQA-Benchmark introduces a standardized, reproducible evaluation suite for large language models to test symbolic and temporally compositional reasoning using 200 competition-grade BaZi problems across $2021$--$2025$. The method fixes a pre-computed natal chart context, employs a multi-turn protocol, and adds a Structured Reasoning Protocol to analyze inference order effects without domain knowledge. Empirical results show models perform above random but far from saturation, with substantial variation across domains, years, and prompting protocols; SRP yields mixed, domain-dependent effects and reveals diverse, ensemble-worthy error modes. The work highlights temporal composition as a core difficulty, domain heterogeneity as a design consideration, and promotes protocol-based analysis as a diagnostic tool for understanding symbolic reasoning in non-standard symbolic systems, offering a framework for future benchmark development and evaluation methodology.

Abstract

We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

TL;DR

. The method fixes a pre-computed natal chart context, employs a multi-turn protocol, and adds a Structured Reasoning Protocol to analyze inference order effects without domain knowledge. Empirical results show models perform above random but far from saturation, with substantial variation across domains, years, and prompting protocols; SRP yields mixed, domain-dependent effects and reveals diverse, ensemble-worthy error modes. The work highlights temporal composition as a core difficulty, domain heterogeneity as a design consideration, and promotes protocol-based analysis as a diagnostic tool for understanding symbolic reasoning in non-standard symbolic systems, offering a framework for future benchmark development and evaluation methodology.

Abstract

Paper Structure (23 sections, 3 figures, 4 tables)

This paper contains 23 sections, 3 figures, 4 tables.

Introduction
Benchmark Definition and Evaluation Protocol
Problem Source and Benchmark Scope
Task Structure and Domain Coverage
Input Construction and Chart Representation
Multi-turn Evaluation Protocol
Structured Reasoning Protocol as an Evaluation Scaffold
Evaluation Settings and Aggregation
Benchmark Evaluation and Analysis
Overall Benchmark Performance
Year-wise Performance and Temporal Difficulty
Domain-wise Performance Profiles
Effect of Structured Reasoning Protocol
Domain-specific effects on the 2025 subset.
Model Agreement and Diversity
...and 8 more sections

Figures (3)

Figure 1: Year-wise accuracy across models (mean $\pm$ standard deviation).
Figure 2: Effect of Structured Reasoning Protocol (SRP).
Figure 3: Pairwise model agreement on the 2025 subset (Multi-turn).

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

TL;DR

Abstract

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)