Table of Contents
Fetching ...

A Geometric Approach to Problems in Optimization and Data Science

Naren Sarayu Manoj

TL;DR

This thesis develops a geometric framework for core optimization and data-science problems by embedding them in high-dimensional convex geometry. It introduces streaming ellipsoidal rounding and coreset techniques to approximate convex polytopes and hulls with near-optimal distortion, while using block Lewis weights to sparsify block-norm objectives and accelerate MSN-type regression. It also analyzes robustness to adversarial data through backdoor models and monotone adversaries in both optimization (dueling) and clustering (spectral) tasks, deriving both algorithmic guarantees and fundamental limits. Collectively, these results yield memory-efficient, scalable algorithms with provable guarantees for core ML tasks under streaming, distributed, and adversarial settings, and provide a principled link between geometric approximations and statistical robustness. The work has practical impact in fast, robust optimization and data-analysis pipelines, including multidistributional regression, sparsification, and robust spectral methods.

Abstract

We give new results for problems in computational and statistical machine learning using tools from high-dimensional geometry and probability. We break up our treatment into two parts. In Part I, we focus on computational considerations in optimization. Specifically, we give new algorithms for approximating convex polytopes in a stream, sparsification and robust least squares regression, and dueling optimization. In Part II, we give new statistical guarantees for data science problems. In particular, we formulate a new model in which we analyze statistical properties of backdoor data poisoning attacks, and we study the robustness of graph clustering algorithms to ``helpful'' misspecification.

A Geometric Approach to Problems in Optimization and Data Science

TL;DR

This thesis develops a geometric framework for core optimization and data-science problems by embedding them in high-dimensional convex geometry. It introduces streaming ellipsoidal rounding and coreset techniques to approximate convex polytopes and hulls with near-optimal distortion, while using block Lewis weights to sparsify block-norm objectives and accelerate MSN-type regression. It also analyzes robustness to adversarial data through backdoor models and monotone adversaries in both optimization (dueling) and clustering (spectral) tasks, deriving both algorithmic guarantees and fundamental limits. Collectively, these results yield memory-efficient, scalable algorithms with provable guarantees for core ML tasks under streaming, distributed, and adversarial settings, and provide a principled link between geometric approximations and statistical robustness. The work has practical impact in fast, robust optimization and data-analysis pipelines, including multidistributional regression, sparsification, and robust spectral methods.

Abstract

We give new results for problems in computational and statistical machine learning using tools from high-dimensional geometry and probability. We break up our treatment into two parts. In Part I, we focus on computational considerations in optimization. Specifically, we give new algorithms for approximating convex polytopes in a stream, sparsification and robust least squares regression, and dueling optimization. In Part II, we give new statistical guarantees for data science problems. In particular, we formulate a new model in which we analyze statistical properties of backdoor data poisoning attacks, and we study the robustness of graph clustering algorithms to ``helpful'' misspecification.

Paper Structure

This paper contains 240 sections, 214 theorems, 985 equations, 12 figures, 16 tables, 19 algorithms.

Key Result

Theorem 1.1.1

Let $\bm{c} + \mathcal{E}(K)$ be the ellipsoid of maximal volume contained within $K$. Then, we have where $\triangle \le d$. Further, if $K$ is origin-symmetric, then this improves to $\triangle \le \sqrt{d}$. Finally, there exist convex bodies $K$ for which no ellipsoid can approximate $K$ to distortion better than $d$, and there exist origin-symmetric convex bodies $K$ for which no ellipsoid c

Figures (12)

  • Figure 1: A monotone update step. For brevity, we refer to $\mathcal{E}$ and $\alpha \cdot \mathcal{E}$ as the previous ellipsoids $\mathcal{E}_{t-1}, \alpha \mathcal{E}_{t-1}$, and $\mathcal{E}'$ and $\alpha' \cdot \mathcal{E}'$ as the next ellipsoids $\mathcal{E}_{t}, \alpha_t \cdot \mathcal{E}_t$. $\mathcal{E}$ and $\alpha \mathcal{E}$ are, respectively, the larger and smaller black circles. $c + \mathcal{E}'$ and $c + \alpha' \mathcal{E}'$ are the larger and smaller blue ellipses. The dotted lines show $\partial(\mathsf{conv}\left(\alpha \mathcal{E} \cup \{\bm{z}\}\right)) \setminus \partial(\alpha \mathcal{E})$, i.e. the the boundary of $\mathsf{conv}\left(\alpha \cdot \mathcal{E} \cup \{\bm{z}\}\right)$ minus the boundary of $\alpha \mathcal{E}$.
  • Figure 2: Irregular update step. $\mathcal{E}_{t-1}$ and $\alpha \cdot \mathcal{E}_{t-1}$ are, respectively, the light blue strip on the $x$-axis and the dark blue strip on the $x$-axis. $\bm{z}_t = (0, \sqrt{1+2 \alpha})$ is the newly received point.
  • Figure 3: Outer ellipses of the update step. As before, $\mathcal{E}$ is the black circle and $c + \mathcal{E}'$ is the blue ellipse. $c_r + \mathcal{E}'$ is the magenta ellipse, with its center at $c_r$ and the dotted magenta line showing the position of $c_r$ along the $x$-axis. $c_r$ is defined so $c_r + \mathcal{E}'$ and $\mathcal{E}$ are tangent at two points. $Q$ is one of these two tangent points.
  • Figure 4: Inner ellipses of the update step. As before, $\alpha \mathcal{E}$ is the black circle and $c + \alpha' \mathcal{E}$ is the blue ellipse. $P_0$ is the shared leftmost point of $\alpha \mathcal{E}$ and $c + \alpha' \mathcal{E}'$. There are two lines through $\bm{v}$ that are tangent to $\alpha \mathcal{E}$, one of which we call $L$ and pictured in orange. We call the tangent points $P_1$ and $P_2$. The line segments $\overline{P_1 \bm{z}}, \overline{P_2 \bm{z}}$ are the dotted black lines. $P_1'$ and $P_2'$ are the two points of intersection between $c + \alpha' \mathcal{E}$ and the line segment $\overline{P_1 P_2}$. $P_1"$ and $P_2"$ are the two points of intersection between $\partial (c + \alpha' \mathcal{E}')$ and $\partial \alpha \mathcal{E}$ to the right of the $y$-axis. Note that $P_2, P_2', P_2"$ are the reflections of $P_1, P_1', P_1"$ across the $x$-axis.
  • Figure 5: Inner ellipses of the update step. As before, $\alpha \mathcal{E}$ is the black circle, $c + \alpha' \mathcal{E}$ is the blue ellipse, $L$ is the orange line through $\bm{z}$ and tangent to $\alpha \mathcal{E}$, $P_1$ and $P_2$ are the tangent points on the lines through $\bm{z}$ tangent to $\alpha \mathcal{E}$, and $\overline{P_1 \bm{z}}, \overline{P_2 \bm{z}}$ are the dotted black lines. $c_{+} + \alpha' \mathcal{E}'$ is the magenta ellipse, with its center at $c_{+}$ and magenta dotted line showing its position on the $x$-axis. $c_{+}$ is defined so that $c_{+} + \alpha' \mathcal{E}'$ is tangent to $\overline{P_1 \bm{z}}$ and $\overline{P_2 \bm{z}}$, with $Q$ as the tangent point of $c_{+} + \alpha' \mathcal{E}'$ and $\overline{P_1 \bm{z}}$.
  • ...and 7 more figures

Theorems & Definitions (440)

  • Theorem 1.1.1: John's Theorem john1948
  • Definition 1: Inradius
  • Definition 2: Circumradius
  • Definition 3: Aspect Ratio
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Definition 2.1.1: Approximation to Minimum Volume Outer Ellipsoid
  • ...and 430 more