Detecting and Identifying Selection Structure in Sequential Data

Yujia Zheng; Zeyu Tang; Yiwen Qiu; Bernhard Schölkopf; Kun Zhang

Detecting and Identifying Selection Structure in Sequential Data

Yujia Zheng, Zeyu Tang, Yiwen Qiu, Bernhard Schölkopf, Kun Zhang

TL;DR

This work shifts the view of data selection from a nuisance to a core causal mechanism in sequential data. It proves nonparametric identifiability of selection structure, with and without latent confounders, and provides a provably correct, three-stage constraint-based algorithm that recovers selection, direct, and confounded relations using only observational data. The method scales as $O(N^2)$ and is validated on synthetic linear-Gaussian models and real music data, where discovered selection patterns correspond to meaningful musical structures. The findings offer a more faithful inductive bias for modeling sequential phenomena and open avenues for robust, selection-aware causal modeling and foundational model development.

Abstract

We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportunity to provide a deeper insight into the hidden generation process, as it is a fundamental mechanism underlying what we observe. In particular, overlooking selection in sequential data can lead to an incomplete or overcomplicated inductive bias in modeling, such as assuming a universal autoregressive structure for all dependencies. Therefore, rather than merely viewing it as a bias, we explore the causal structure of selection in sequential data to delve deeper into the complete causal process. Specifically, we show that selection structure is identifiable without any parametric assumptions or interventional experiments. Moreover, even in cases where selection variables coexist with latent confounders, we still establish the nonparametric identifiability under appropriate structural conditions. Meanwhile, we also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies. The framework has been validated empirically on both synthetic data and real-world music.

Detecting and Identifying Selection Structure in Sequential Data

TL;DR

and is validated on synthetic linear-Gaussian models and real music data, where discovered selection patterns correspond to meaningful musical structures. The findings offer a more faithful inductive bias for modeling sequential phenomena and open avenues for robust, selection-aware causal modeling and foundational model development.

Abstract

Paper Structure (18 sections, 4 theorems, 12 figures, 2 algorithms)

This paper contains 18 sections, 4 theorems, 12 figures, 2 algorithms.

Introduction
Preliminaries
Identifiability of Selection Structure
Identifiability Without Latent Confounders
Identifiability With Latent Confounders
Alternative Representations of Selection Structure
Identification Algorithm
Experiments
Synthetic Data
Real-World Data
Conclusion
Complete Algorithm
Illustration of the Algorithm
Additional Discussion on the Algorithm
Proofs
...and 3 more sections

Key Result

Theorem 3.1

Let the observed data be a large enough sample generated by a model defined in Section sec:pre. In addition to the faithfulness assumption and Markov condition, suppose the following assumptions hold: Then all selection pairs and direct relations in the causal graph $\mathcal{G}$ are identifiable.

Figures (12)

Figure 1: Example of the selection structure in the presence of latent confounders and direct relations. The figure depicts a sequence in a user's shopping history with a healthy lifestyle . Initially, the user buys an expensive painting from a gallery. Then the user visits a mall and first enjoys a steak meal at a restaurant. After that, the user spots a nearby watch store and then buys a pricey watch with its accompanying toolkit at the mall. Then, to adhere to a balanced diet, the user purchases salads from another nearby store within the mall. This data sequence was selected for a study investigating the daily habits of individuals identified as leading a healthy lifestyle. Therefore, the healthy lifestyle acts as a selection variable, of which one of the contributing factors is a balanced diet, exemplified by choices such as steak and salad . Concurrently, the user's financial status is a latent confounder (a common cause) for the user to buy both the painting and the watch . While there is no direct causal relation between having the steak and the painting , buying salad is directly caused by having the steak .
Figure 2: A running example for our identifiability theory.
Figure 3: Illustration of the intuition for the proof of Theorem \ref{['thm:1']}. In general, we distinguish the selection pair (a) from the direct relation (b) based on whether $X_j$ is a collider on the path.
Figure 4: Examples of the degenerate structures in Theorem \ref{['thm:1']}.
Figure 5: Intuition for identifying confounded pairs in Theorem \ref{['thm:2']}. In general, we distinguish the confounded pair (a) from the spurious direct relations (b) by identifying the collider $X_i$.
...and 7 more figures

Theorems & Definitions (10)

Definition 2.0
Definition 2.0
Definition 2.0
Theorem 3.1
Theorem 3.2
Remark 3.2
Theorem 3.1
proof
Theorem 3.1
proof

Detecting and Identifying Selection Structure in Sequential Data

TL;DR

Abstract

Detecting and Identifying Selection Structure in Sequential Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (10)