Table of Contents
Fetching ...

Evaluating Large Language Models for Causal Modeling

Houssam Razouk, Leonie Benischke, Georg Niess, Roman Kern

TL;DR

It is determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective and perform better in distilling causal domain knowledge into causal variables compared to sparse expert models.

Abstract

In this paper, we consider the process of transforming causal domain knowledge into a representation that aligns more closely with guidelines from causal data science. To this end, we introduce two novel tasks related to distilling causal domain knowledge into causal variables and detecting interaction entities using LLMs. We have determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective. Specifically, LLMs, such as GPT-4-turbo and Llama3-70b, perform better in distilling causal domain knowledge into causal variables compared to sparse expert models, such as Mixtral-8x22b. On the contrary, sparse expert models such as Mixtral-8x22b stand out as the most effective in identifying interaction entities. Finally, we highlight the dependency between the domain where the entities are generated and the performance of the chosen LLM for causal modeling.

Evaluating Large Language Models for Causal Modeling

TL;DR

It is determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective and perform better in distilling causal domain knowledge into causal variables compared to sparse expert models.

Abstract

In this paper, we consider the process of transforming causal domain knowledge into a representation that aligns more closely with guidelines from causal data science. To this end, we introduce two novel tasks related to distilling causal domain knowledge into causal variables and detecting interaction entities using LLMs. We have determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective. Specifically, LLMs, such as GPT-4-turbo and Llama3-70b, perform better in distilling causal domain knowledge into causal variables compared to sparse expert models, such as Mixtral-8x22b. On the contrary, sparse expert models such as Mixtral-8x22b stand out as the most effective in identifying interaction entities. Finally, we highlight the dependency between the domain where the entities are generated and the performance of the chosen LLM for causal modeling.

Paper Structure

This paper contains 11 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Causal modeling of domain knowledge by human experts. This is a tedious and error prone process. We study the usefulness of LLMs to support this process, whereby we first divide this into two distinct tasks.
  • Figure 2: Experiments workflow on benchmarking LLMs causal modeling abilities. Each LLM is instructed to generate data sets according to the task and the domain. The generated data sets are sampled to construct positive and negative examples. Each LLM is instructed to evaluate positive and negative examples sampled from its own and other LLMs generated data in a zero-shot setting.
  • Figure 3: The agreement between the generated data for Task 1 and the classification results based on the cosine similarity threshold. Agreement at a lower cosine similarity threshold indicates that, in the generated data, two entities represent different values of the same causal variable and are not necessarily semantically similar. The data generated by Mixtral-8×22b tends to have a higher cosine similarity to other LLMs.
  • Figure 4: The agreement between LLMs predictions and the classification is based on the cosine similarity threshold. GPT-3.5-turbo, Mixtral-8×7b and Mistral-7b reach their peak agreement at a higher threshold, less aligned with the generated data. GPT-4-turbo, Llama3-70b and Llama3-8b reach their peak agreement at lower similarity thresholds, which aligns more with the generated data.
  • Figure 5: The agreement between the generated data for Task 2 and the classification results based on the cosine similarity threshold. Agreement at a lower cosine similarity threshold indicates that a text represents values of different causal variables and is not necessarily semantically similar to these variables. The data generated by Mixtral-8×22b tends to exhibit higher agreement at higher similarity thresholds compared to other LLMs.
  • ...and 1 more figures