Table of Contents
Fetching ...

SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

Rong Fu, Zijian Zhang, Wenxin Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong

TL;DR

SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery, while preserving or improving recall@k, cluster purity, and subgroup equity on large viral and tumor repertoires.

Abstract

Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.

SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

TL;DR

SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery, while preserving or improving recall@k, cluster purity, and subgroup equity on large viral and tumor repertoires.

Abstract

Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.
Paper Structure (76 sections, 3 theorems, 45 equations, 11 figures, 8 tables, 3 algorithms)

This paper contains 76 sections, 3 theorems, 45 equations, 11 figures, 8 tables, 3 algorithms.

Key Result

Theorem D.1

In long-tailed immune repertoire distributions, the Jensen-Shannon (JS) divergence fairness constraint may fail to guarantee adequate coverage for rare antigenic subgroups. Specifically, for a subgroup $g$ with cardinality $|g|$ satisfying $|g|/n \leq \epsilon$ where $\epsilon > 0$ is a small consta

Figures (11)

  • Figure 1: Overview of the SubQuad framework for near-quadratic-free, equity-aware repertoire inference. Scalable Preprocessing: Raw sequences $\mathcal{S}$ are processed via MinHash-based Indexing to generate a sparse candidate list $\mathcal{CAND}$ and optimized using hardware-aware batching $\mathcal{B}$. Representation Learning: A Dual-Phase Meta-Encoder utilizes ImmunoBERT-style pretraining followed by MetaNet fine-tuning. The Meta-Controller dynamically adjusts gating weights $\alpha_m$ for multi-paradigm fusion. Graph Construction: Multi-channel affinities are integrated via Dynamic Affinity Fusion to produce $\widetilde{a}_{ij}$. This similarity matrix is refined through RMT-based Thresholding (eigenvalue spectrum analysis) to produce a sparse weighted graph $G=(V,E,W)$. Fairness-Constrained Clustering: The graph is partitioned into clusters $\mathcal{C}$ by optimizing a joint objective of spatial cohesion and Jensen-Shannon Equity. An Automated Fairness Tuner dynamically calibrates the trade-off weight $\lambda$ to meet target disparity $\delta_{\max}$. Outputs: The pipeline yields antigen-aware clusters, topological maps (UMAP), and equity heatmaps for clinical interpretation.
  • Figure 2: Community structure in immune receptor networks. Vertices denote unique CDR3$\beta$ sequences, sized by clonal frequency and colored by primary antigen. Edges connect receptors with fused similarity above 0.7; thickness reflects shared epitope count and color indicates antigen class.
  • Figure 3: Latency scaling of HNSW retrieval under $10^7$ sequences. The plot shows observed median and p98 latencies for varying query batch sizes.
  • Figure 4: UMAP projection of ImmunoBERT embeddings showing conserved antigen clusters.
  • Figure 5: F1 Score Heatmap for MinHash Parameter Selection
  • ...and 6 more figures

Theorems & Definitions (7)

  • Theorem D.1: Coverage Lower Bound under JS Divergence
  • proof
  • Definition D.1: Weighted Coverage Divergence (WCD)
  • Theorem D.2: Coverage Lower Bound under WCD
  • proof
  • Theorem D.3: Convergence Rate of Fairness Calibrator
  • proof