Teaching Transformers Causal Reasoning through Axiomatic Training
Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N Balasubramanian, Amit Sharma
TL;DR
This work introduces axiomatic training as a principled method to teach transformers causal reasoning by learning from symbolic demonstrations of axioms like Transitivity and d-separation. A 67M-parameter model trained from scratch on synthetic axiom triplets generalizes to longer, branched graphs and edge-order perturbations, while rotary positional encodings further enhance length generalization. Extending the approach to finetune Llama-3-8B-Instruct yields substantial gains on causal benchmarks such as CLEAR and Corr2Cause, sometimes surpassing GPT-4. The results establish axiomatic training as a scalable paradigm for enhancing causal reasoning in both small and large language models, with broad implications for knowledge-driven reasoning tasks and verifications in real-world applications.
Abstract
For text-based AI systems to interact in the real world, causal reasoning is an essential skill. Since active interventions are costly, we study to what extent a system can learn causal reasoning from symbolic demonstrations of causal axioms. Specifically, we present an axiomatic training method where the system learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the system would learn to generalize from the axiom demonstrations to more complex scenarios. Our results, based on applying axiomatic training to learn the transitivity axiom and d-separation rule, indicate that such generalization is possible. To avoid data contamination issues, we start with a 67 million parameter transformer model and train it from scratch. On both tasks, we find that a model trained on linear causal chains (along with some noisy variations) can generalize well to complex graphs, including longer causal chains, causal chains with reversed order, and graphs with branching.To handle diverse text inputs, the same method is extended to finetune language models. Finetuning Llama-3-8B-Instruct model on our axiomatic data leads to significant gains on causal benchmarks such as Corr2Cause and CLEAR, in some cases providing state-of-the-art performance surpassing GPT-4.
