Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models
Aggrey Muhebwa, Khalid K. Osman
TL;DR
The paper tackles the gap in explicit causal reasoning for explanations produced by compact language systems by introducing a causal distillation pipeline that transfers structured cause–effect reasoning from a large teacher (GPT-4) to small open-source learners. It formalizes a supervised fine-tuning objective $\theta^* = \arg\min_{\theta} \mathbb{E}_{X} [ D( M_{teacher}(X), M_{student}(X,\theta) ) ]$ and introduces the Causal Explanation Coherence (CEC) metric, defined as a bidirectional sentence-level semantic alignment $CEC_{sym}$, to evaluate the causal fidelity of explanations. Experiments on Climate-FEVER show that distilled learners achieve high CEC scores (≥0.86, with Phi-2 at 0.910), while traditional lexical metrics underestimate quality, underscoring the importance of causal coherence over surface similarity. The results suggest that causal distillation can equip compact language systems with interpretable, evidence-based explanations and has potential applicability beyond climate misinformation, though challenges remain in factuality checks, out-of-distribution generalization, and avoiding teacher biases.
Abstract
Large proprietary language models exhibit strong causal reasoning abilities that smaller open-source models struggle to replicate. We introduce a novel framework for distilling causal explanations that transfers causal reasoning skills from a powerful teacher model to a compact open-source model. The key idea is to train the smaller model to develop causal reasoning abilities by generating structured cause-and-effect explanations consistent with those of the teacher model. To evaluate the quality of the student-generated explanations, we introduce a new metric called Causal Explanation Coherence (CEC) to assess the structural and logical consistency of causal reasoning. This metric uses sentence-level semantic alignment to measure how well each part of the generated explanation corresponds to the teacher's reference, capturing both faithfulness and coverage of the underlying causal chain. Our framework and the CEC metric provide a principled foundation for training smaller models to perform robust causal reasoning and for systematically assessing the coherence of explanations in language model outputs.
