BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
Jiangxi Chen, Qian Liu
TL;DR
BaziQA-Benchmark introduces a standardized, reproducible evaluation suite for large language models to test symbolic and temporally compositional reasoning using 200 competition-grade BaZi problems across $2021$--$2025$. The method fixes a pre-computed natal chart context, employs a multi-turn protocol, and adds a Structured Reasoning Protocol to analyze inference order effects without domain knowledge. Empirical results show models perform above random but far from saturation, with substantial variation across domains, years, and prompting protocols; SRP yields mixed, domain-dependent effects and reveals diverse, ensemble-worthy error modes. The work highlights temporal composition as a core difficulty, domain heterogeneity as a design consideration, and promotes protocol-based analysis as a diagnostic tool for understanding symbolic reasoning in non-standard symbolic systems, offering a framework for future benchmark development and evaluation methodology.
Abstract
We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.
