Table of Contents
Fetching ...

Causal Discovery and Classification Using Lempel-Ziv Complexity

Dhruthi, Nithin Nagaraj, Harikrishnan N B

TL;DR

The causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models, indicating that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data.

Abstract

Inferring causal relationships in the decision-making processes of machine learning algorithms is a crucial step toward achieving explainable Artificial Intelligence (AI). In this research, we introduce a novel causality measure and a distance metric derived from Lempel-Ziv (LZ) complexity. We explore how the proposed causality measure can be used in decision trees by enabling splits based on features that most strongly \textit{cause} the outcome. We further evaluate the effectiveness of the causality-based decision tree and the distance-based decision tree in comparison to a traditional decision tree using Gini impurity. While the proposed methods demonstrate comparable classification performance overall, the causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models. This result indicates that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data. Based on the features used in the LZ causal measure based decision tree, we introduce a causal strength for each features in the dataset so as to infer the predominant causal variables for the occurrence of the outcome.

Causal Discovery and Classification Using Lempel-Ziv Complexity

TL;DR

The causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models, indicating that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data.

Abstract

Inferring causal relationships in the decision-making processes of machine learning algorithms is a crucial step toward achieving explainable Artificial Intelligence (AI). In this research, we introduce a novel causality measure and a distance metric derived from Lempel-Ziv (LZ) complexity. We explore how the proposed causality measure can be used in decision trees by enabling splits based on features that most strongly \textit{cause} the outcome. We further evaluate the effectiveness of the causality-based decision tree and the distance-based decision tree in comparison to a traditional decision tree using Gini impurity. While the proposed methods demonstrate comparable classification performance overall, the causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models. This result indicates that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data. Based on the features used in the LZ causal measure based decision tree, we introduce a causal strength for each features in the dataset so as to infer the predominant causal variables for the occurrence of the outcome.

Paper Structure

This paper contains 23 sections, 24 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) Average LZ Penalty vs Coupling coefficients for AR-1 process. (b) Average LZ Penalty vs Coupling coefficients for AR-5 process. (c) Average LZ Penalty vs Coupling coefficients for AR-20 process. (d) Average LZ Penalty vs Coupling coefficients for AR-100 process. The coupling coefficients are varied from 0 to 1 with a step size of 0.1. For each coupling coefficient, LZ penalty was averaged across 1000 independent random trials.
  • Figure 2: (a) LZ Penalty vs Coupling coefficients for coupled chaotic logistic map averaged across 1000 independent trials for each coupling coefficient. The coupling coefficients are varied from 0 to 0.9 with a step size of 0.1. (b) An instance of the timeseries X(t) and Y(t) for coupling coefficient $\eta = 0.4$ for first 50 iterations, indicating synchronization between X(t) and Y(t).
  • Figure 3: Accuracy vs. Decision rates for Tuebingen dataset using proposed LZ Penalty measure and LZ-P measure defined in pranay2021causal.
  • Figure 4: Bar Graph of macro F1 scores of predictions made by LZ distance metric based decision trees, LZ causal metric based decision trees and Gini impurity based decision trees, for various datasets. The datasets marked with a '*' are highly imbalanced.
  • Figure 5: Causal Decision Tree for Heart Disease dataset.
  • ...and 2 more figures