Table of Contents
Fetching ...

PROTECT: Protein circadian time prediction using unsupervised learning

Aram Ansary Ogholbake, Qiang Cheng

TL;DR

PROTECT tackles the problem of predicting circadian sample phases from proteomic data without time labels or prior rhythmic markers, addressing small, noisy, high‑dimensional datasets and ultradian rhythms. It introduces an unsupervised deep learning pipeline with greedy layer‑wise pre‑training and cosine‑based fine‑tuning, anchored by initial phases $\phi^{0}_i = \arctan\left(\frac{s_i}{c_i}\right)$ and a prediction model $\hat{x}_{ip} = L_p + A_p \cos(\omega_p \hat{\phi}_i + \phi_p)$, optimized by $\mathfrak{L}$ and a regularization term with $\lambda=0$. The method is validated on time‑labeled datasets across mouse, plant, and human samples, achieving high accuracy (e.g., $\text{nAUC} \approx 0.94$ for $O. tauri$ and $>0.80$ on other datasets) and showing robustness to limited samples and outliers. Applying PROTECT to unlabeled human brain regions and urine reveals AD‑associated circadian disruptions and identifies rhythmic proteins, enriched pathways, and hub drug targets, highlighting its potential to uncover disease mechanisms and therapeutic leads. Overall, PROTECT provides a general, seed‑free framework for circadian proteomics with broad applicability to other omics and disease contexts.

Abstract

Circadian rhythms regulate the physiology and behavior of humans and animals. Despite advancements in understanding these rhythms and predicting circadian phases at the transcriptional level, predicting circadian phases from proteomic data remains elusive. This challenge is largely due to the scarcity of time labels in proteomic datasets, which are often characterized by small sample sizes, high dimensionality, and significant noise. Furthermore, existing methods for predicting circadian phases from transcriptomic data typically rely on prior knowledge of known rhythmic genes, making them unsuitable for proteomic datasets. To address this gap, we developed a novel computational method using unsupervised deep learning techniques to predict circadian sample phases from proteomic data without requiring time labels or prior knowledge of proteins or genes. Our model involves a two-stage training process optimized for robust circadian phase prediction: an initial greedy one-layer-at-a-time pre-training which generates informative initial parameters followed by fine-tuning. During fine-tuning, a specialized loss function guides the model to align protein expression levels with circadian patterns, enabling it to accurately capture the underlying rhythmic structure within the data. We tested our method on both time-labeled and unlabeled proteomic data. For labeled data, we compared our predictions to the known time labels, achieving high accuracy, while for unlabeled human datasets, including postmortem brain regions and urine samples, we explored circadian disruptions. Notably, our analysis identified disruptions in rhythmic proteins between Alzheimer's disease and control subjects across these samples.

PROTECT: Protein circadian time prediction using unsupervised learning

TL;DR

PROTECT tackles the problem of predicting circadian sample phases from proteomic data without time labels or prior rhythmic markers, addressing small, noisy, high‑dimensional datasets and ultradian rhythms. It introduces an unsupervised deep learning pipeline with greedy layer‑wise pre‑training and cosine‑based fine‑tuning, anchored by initial phases and a prediction model , optimized by and a regularization term with . The method is validated on time‑labeled datasets across mouse, plant, and human samples, achieving high accuracy (e.g., for and on other datasets) and showing robustness to limited samples and outliers. Applying PROTECT to unlabeled human brain regions and urine reveals AD‑associated circadian disruptions and identifies rhythmic proteins, enriched pathways, and hub drug targets, highlighting its potential to uncover disease mechanisms and therapeutic leads. Overall, PROTECT provides a general, seed‑free framework for circadian proteomics with broad applicability to other omics and disease contexts.

Abstract

Circadian rhythms regulate the physiology and behavior of humans and animals. Despite advancements in understanding these rhythms and predicting circadian phases at the transcriptional level, predicting circadian phases from proteomic data remains elusive. This challenge is largely due to the scarcity of time labels in proteomic datasets, which are often characterized by small sample sizes, high dimensionality, and significant noise. Furthermore, existing methods for predicting circadian phases from transcriptomic data typically rely on prior knowledge of known rhythmic genes, making them unsuitable for proteomic datasets. To address this gap, we developed a novel computational method using unsupervised deep learning techniques to predict circadian sample phases from proteomic data without requiring time labels or prior knowledge of proteins or genes. Our model involves a two-stage training process optimized for robust circadian phase prediction: an initial greedy one-layer-at-a-time pre-training which generates informative initial parameters followed by fine-tuning. During fine-tuning, a specialized loss function guides the model to align protein expression levels with circadian patterns, enabling it to accurately capture the underlying rhythmic structure within the data. We tested our method on both time-labeled and unlabeled proteomic data. For labeled data, we compared our predictions to the known time labels, achieving high accuracy, while for unlabeled human datasets, including postmortem brain regions and urine samples, we explored circadian disruptions. Notably, our analysis identified disruptions in rhythmic proteins between Alzheimer's disease and control subjects across these samples.
Paper Structure (24 sections, 4 equations, 8 figures, 2 tables)

This paper contains 24 sections, 4 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Illustration of the overall diagram of PROTECT.
  • Figure 2: Accuracy of PROTECT on (a) Ostreococcus tauri, (b) Mouse hip articular cartilage, (c) Mouse liver, and (d) Human plasma. The top row shows ROC curves where the y-axis shows the fraction of correctly predicted samples, and the x-axis shows the size of errors. The bottom row shows the scatter plots of predictions vs ground truth.
  • Figure 3: Comparison of CYCLOPS and our method on mouse liver dataset. (a) CYCLOPS ROC curve and predicted sample phases vs ground truths without using proteins corresponding to seed genes. (b) CYCLOPS ROC curve and predicted sample phases vs ground truths after using proteins corresponding to seed genes. (c) our ROC curve and predicted sample phases vs ground truths.
  • Figure 4: Plots of four core clock proteins in mouse liver using predicted phases by PROTECT. The y-axis represents protein expression levels, and the x-axis represents the predicted phases (in degrees) as determined by PROTECT.
  • Figure 5: Plots of four randomly chosen proteins known to be strongly regulated by circadian cycle in human plasma using predicted phases by PROTECT. The y-axis represents protein expression levels, and the x-axis represents the predicted phases (in degrees) as determined by PROTECT.
  • ...and 3 more figures