Infinite hierarchical contrastive clustering for personal digital envirotyping
Ya-Yun Huang, Joseph McClernon, Jason A. Oliver, Matthew M. Engelhard
TL;DR
This work tackles the challenge of envirotyping by automatically clustering daily-environment images into an unbounded set of personal environments and their higher-level types. It introduces infinite hierarchical contrastive clustering (IH-CC), combining a stick-breaking prior on cluster probabilities with a participant-specific head to induce meaningful intra- and inter-participant structure, trained end-to-end with a composite loss. On two cohorts, IH-CC identifies coherent environment clusters, reveals environment-type groupings shared across participants, and links environment clusters to smoking-related health outcomes, illustrating the method's potential to advance envirotyping. The approach offers a scalable, data-driven pathway to quantify how daily environments influence health and behavior, enabling environment-aware interventions.
Abstract
Daily environments have profound influence on our health and behavior. Recent work has shown that digital envirotyping, where computer vision is applied to images of daily environments taken during ecological momentary assessment (EMA), can be used to identify meaningful relationships between environmental features and health outcomes of interest. To systematically study such effects on an individual level, it is helpful to group images into distinct environments encountered in an individual's daily life; these may then be analyzed, further grouped into related environments with similar features, and linked to health outcomes. Here we introduce infinite hierarchical contrastive clustering to address this challenge. Building on the established contrastive clustering framework, our method a) allows an arbitrary number of clusters without requiring the full Dirichlet Process machinery by placing a stick-breaking prior on predicted cluster probabilities; and b) encourages distinct environments to form well-defined sub-clusters within each cluster of related environments by incorporating a participant-specific prediction loss. Our experiments show that our model effectively identifies distinct personal environments and groups these environments into meaningful environment types. We then illustrate how the resulting clusters can be linked to various health outcomes, highlighting the potential of our approach to advance the envirotyping paradigm.
