Table of Contents
Fetching ...

Generative causal testing to bridge data-driven models and scientific theories in language neuroscience

Richard Antonello, Chandan Singh, Shailee Jain, Aliyah Hsu, Sihang Guo, Jianfeng Gao, Bin Yu, Alexander Huth

TL;DR

The paper presents Generative Causal Testing (GCT), a framework that turns opaque, data-driven language encoding models into concise verbal explanations and then uses LLM-generated stimuli to causally test those explanations in fMRI. By applying GCT to voxelwise and ROI-level language selectivity, the authors demonstrate accurate, stable explanations, identify new micro-ROIs in the prefrontal cortex, and probe fine-grained differences across regions with similar selectivity. The approach provides a systematic way to bridge predictive models and scientific theories, enabling closed-loop hypothesis formation, testing, and refinement using naturalistic language stimuli. This work has broad implications for causal neuroscience and demonstrates how LLMs can operationalize abstract theories into testable, interpretable predictions across cortical domains.

Abstract

Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated stimuli.This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.

Generative causal testing to bridge data-driven models and scientific theories in language neuroscience

TL;DR

The paper presents Generative Causal Testing (GCT), a framework that turns opaque, data-driven language encoding models into concise verbal explanations and then uses LLM-generated stimuli to causally test those explanations in fMRI. By applying GCT to voxelwise and ROI-level language selectivity, the authors demonstrate accurate, stable explanations, identify new micro-ROIs in the prefrontal cortex, and probe fine-grained differences across regions with similar selectivity. The approach provides a systematic way to bridge predictive models and scientific theories, enabling closed-loop hypothesis formation, testing, and refinement using naturalistic language stimuli. This work has broad implications for causal neuroscience and demonstrates how LLMs can operationalize abstract theories into testable, interpretable predictions across cortical domains.

Abstract

Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated stimuli.This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.
Paper Structure (22 sections, 19 figures, 2 tables)

This paper contains 22 sections, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Driving single-voxel response with generative causal testing. (a) Voxelwise BOLD responses were recorded using fMRI as human subjects listened to 20 hours of narrative stories. An encoding model $f$ was fit to predict these responses from the story text. $f$ consists of a linear model fit on representations extracted from an LLM, which are not readily interpretable. Encoding models were tested by predicting responses on held-out fMRI data, and only well-performing models were selected for further analysis. (b) We used an automated procedure to find a verbal description of the function that $f$ computes for each voxel. First, we tested $f$ on a large catalog of $n$-grams ($n=1,2,3)$ and found those that maximally drove predicted responses. These $n$-grams were then summarized into stable explanation candidates using a powerful instruction-tuned LLM. Finally, we evaluated each explanation candidate by generating corresponding synthetic sentences and testing that these sentences yielded large predictions from $f$. (c) To test whether the generated explanations were causally related to activation in the brain, we used an LLM to produce synthetic narrative stories where each paragraph is designed to drive responses based on the generated explanation for one voxel. For each subject we constructed stories to drive 17 well-modeled voxels with diverse selectivity. These stories were then presented to the subjects in a second fMRI experiment. (d) Average BOLD response during its driving paragraph for each voxel, relative to baseline, i.e. average response to all non-driving paragraphs. On average, driven responses were significantly higher than baseline for each subject ($p=0.020$ (S01), $p<10^{-5}$ (S02), $p=0.009$ (S03); permutation test, FDR-corrected). For well-driven voxels, this means that the generated explanation is causally related to activation of that voxel, and thus that we have successfully translated the LLM-based encoding model into a verbal explanation. (e) Average BOLD response for each selected voxel to each of the driving paragraphs in one subject (S02). Responses to the driving paragraph generated using the explanation for that voxel appear along the main diagonal. Explanations that were used to construct the driving paragraphs are shown below. BOLD responses were generally high for the driving paragraphs for each voxel as well as semantically related paragraphs (e.g. directions and locations, emotional expression and laughter).
  • Figure 2: Driving ROI response with generative explanation-mediated validation for subject S02. (a) Explanations were generated and used to drive 8 well-defined regions of interest. Responses in all ROIs were significantly driven above baseline ($p<0.05$; permutation test, FDR-corrected). (b) To understand driving performance with more granularity, we color each voxel in each ROI by how well it was driven by its corresponding driving paragraph. The resulting composite flatmap occasionally shows subregions within ROIs that are more selectively driven for a particular explanation. (c) GCT can also be used to build more nuanced theories of cortical semantic selectivity. We focused on three ROIs that are known to have similar selectivity for place concepts: retrosplenial cortex (RSC), the parahippocampal place area (PPA), and the occipital place area (OPA). When explanations were generated for each ROI independently we found that each ROI was driven by all three driving paragraphs (left side). To distinguish these ROIs, we used GCT to find new explanations and construct stories that would selectively drive each area while suppressing the other two. Testing these stories in an fMRI experiment showed that we succeeded in finding selective explanations for two ROIs: RSC is selectively driven by location names and PPA by unappetizing foods. However, the explanation for OPA, spatial positioning & directions, drove responses in all three ROIs (right side). (d) Visualization of the place area driving experiment with voxel-level granularity. We show a 3-channel flatmap showing the outcome of each location-selective driving experiment; a voxel is more red/green/blue if that voxel was driven by the corresponding ROI explanation.
  • Figure 3: Evaluating hypothesized micro-ROIs in prefrontal cortex using GCT. (a) To measure GCT's ability to aid in the discovery of new brain regions, spatially-contiguous candidate ROIs were defined in a grid pattern throughout prefrontal cortex. Candidate ROIs with high stability scores (see \ref{['sec:factors']}) were filtered out to define stable ROIs. GCT was used to automatically generate explanations and driving stimuli for these ROIs. The ROI responses to the driving stimuli were then measured in an fMRI experiment, and their average driving scores are shown in a histogram for subject S02 and S03. In both subjects, high-stability candidate ROIs are driven using their corresponding explanation at statistically significant rates; 47 out of 73 are significantly driven for S02 and 21 out of 128 are significantly driven for S03 ($p<0.05$, 1-tailed t-test, FDR-corrected). (b) Significantly driven candidate ROIs across the two subjects are visualized, colored by their driving scores. Significant candidate ROIs for various explanations (Recognition, Time, Dialogue, and Measurements) are in consistent locations across the subjects, suggesting population level trends. (c) We next examined whether GCT can validate functional similarity claims between 5 regions in the language network, localized using an established localizer fedorenko2012language. We concatenated the driving scores of each of these ROIs into a vector and then computed the cosine similarity between the vectors for different ROIs. We find that the driving vectors of language network ROIs are on average more similar to each other than over 94% of randomly selected pairs of similarly-sized ROIs in UTS02 and over 99% of randomly selected pairs in UTS03 ($p < 10^{-5}$; 1-tailed $t$-test), supporting claims of functional similarity across the language network.
  • Figure 4: Analyzing factors that impact explanation-mediated validation. To evaluate whether GCT succeeds in generating effective stimulus stories, we assessed the driving paragraphs of the stimulus. (a) To confirm that generated paragraphs match the explanation used to construct them, a matching score was computed for each explanation and paragraph by using an LLM to evaluate the fraction of trigrams in the paragraph that are relevant to the paragraph’s generating explanation and then z-scoring the result. Each driving paragraph showed a strong match with its generating explanation. Plot shows one subject (S02), similar plots for other subjects are shown in \ref{['subsec:stratifying_driving_patterns']}. (b) To confirm that each driving paragraph effectively drives its corresponding encoding model predictive performance, we computed the predicted response in each selected voxel to each generated paragraph. This revealed strong matches for most voxels, along with some matches between driving paragraphs and voxels with semantically similar explanations, e.g. directions and locations. Plot shows one subject (S02), similar plots for other subjects are shown in \ref{['subsec:stratifying_driving_patterns']}. (c) After running the fMRI driving experiment, we found that a key factor determining whether a voxel is driven well by the GCT framework was the stability score for the voxel, i.e. the correlation between the n-gram rankings provided by the LLaMA-based encoding model and the OPT-based encoding model. (d) Another important factor for eliciting increased driving responses is the presence of key n-grams in the driving paragraphs. These n-grams induce a standard hemodynamic response curve that peaks at around 6 seconds, yielding a significant increase ($p = 0.009$; one-sided t-test). (e) Finally, to test whether the driving results were sensitive to the particular voxels that were selected, we evaluated whether the GCT stories drove alternative voxels in each subject that were assigned the same explanation as the target voxels being driven. Both the targed voxels and alternative voxels also showed significantly increased driving responses ($p < 0.05$; permutation test, FDR-corrected).
  • Figure 5: Results for driving polysemantic voxels. (a) Voxel response for driving paragraphs (blue) show a small increase relative to the baseline responses of the remaining paragraphs (gray). Each voxel appears as two points connected by a vertical line corresponding; each point shows the result when driving the voxel using a different explanation. Most voxels are only driven successfully for a single explanation.
  • ...and 14 more figures