Active Learning with Simple Questions

Vasilis Kontonis; Mingchen Ma; Christos Tzamos

Active Learning with Simple Questions

Vasilis Kontonis, Mingchen Ma, Christos Tzamos

TL;DR

This work analyzes active learning with region queries, where a learner can ask whether a labeled region $T$ contains only examples with a target label $z$, instead of querying individual labels. It introduces a VC-dimension based measure $VC\dim(Q)$ of the query language and shows a near-optimal trade-off: for any hypothesis class $\mathcal{H}$ with VC-dimension $d$, there exists a region-query family $Q$ with $VC\dim(Q)\le O(d)$ that lets a learner perfectly label any set of $n$ examples using $O(d\log n)$ queries, and this bound is tight in a minimax sense. The paper then provides efficient algorithms for natural classes—unions of intervals, axis-aligned boxes, and high-dimensional halfspaces—where the query language remains simple ($VC\dim(Q)=2$, $O(\log d)$, and $\tilde{O}(d^3)$, respectively) while achieving $O(d\log n)$ or near-constant factors in $n$ for the number of queries, even when labeling domains extend beyond the sample $S$ (i.e., $L\supseteq S$). A key technical thread uses Forster's transform to place data in approximately radially isotropic position and develops region-query-based learning (including a perceptron-like update) that yields sublinear or poly$(d,\log n)$ query complexities for halfspaces. Overall, the work formalizes a sharp, VC-dimension-guided trade-off between query complexity and query-language complexity, and demonstrates practical, efficient learning algorithms for several fundamental geometric hypothesis classes.

Abstract

We consider an active learning setting where a learner is presented with a pool S of n unlabeled examples belonging to a domain X and asks queries to find the underlying labeling that agrees with a target concept h^* \in H. In contrast to traditional active learning that queries a single example for its label, we study more general region queries that allow the learner to pick a subset of the domain T \subset X and a target label y and ask a labeler whether h^*(x) = y for every example in the set T \cap S. Such more powerful queries allow us to bypass the limitations of traditional active learning and use significantly fewer rounds of interactions to learn but can potentially lead to a significantly more complex query language. Our main contribution is quantifying the trade-off between the number of queries and the complexity of the query language used by the learner. We measure the complexity of the region queries via the VC dimension of the family of regions. We show that given any hypothesis class H with VC dimension d, one can design a region query family Q with VC dimension O(d) such that for every set of n examples S \subset X and every h^* \in H, a learner can submit O(d log n) queries from Q to a labeler and perfectly label S. We show a matching lower bound by designing a hypothesis class H with VC dimension d and a dataset S \subset X of size n such that any learning algorithm using any query class with VC dimension less than O(d) must make poly(n) queries to label S perfectly. Finally, we focus on well-studied hypothesis classes including unions of intervals, high-dimensional boxes, and d-dimensional halfspaces, and obtain stronger results. In particular, we design learning algorithms that (i) are computationally efficient and (ii) work even when the queries are not answered based on the learner's pool of examples S but on some unknown superset L of S

Active Learning with Simple Questions

TL;DR

This work analyzes active learning with region queries, where a learner can ask whether a labeled region

contains only examples with a target label

, instead of querying individual labels. It introduces a VC-dimension based measure

of the query language and shows a near-optimal trade-off: for any hypothesis class

with VC-dimension

, there exists a region-query family

with

that lets a learner perfectly label any set of

examples using

queries, and this bound is tight in a minimax sense. The paper then provides efficient algorithms for natural classes—unions of intervals, axis-aligned boxes, and high-dimensional halfspaces—where the query language remains simple (

, and

, respectively) while achieving

or near-constant factors in

for the number of queries, even when labeling domains extend beyond the sample

(i.e.,

). A key technical thread uses Forster's transform to place data in approximately radially isotropic position and develops region-query-based learning (including a perceptron-like update) that yields sublinear or poly

query complexities for halfspaces. Overall, the work formalizes a sharp, VC-dimension-guided trade-off between query complexity and query-language complexity, and demonstrates practical, efficient learning algorithms for several fundamental geometric hypothesis classes.

Abstract

Paper Structure (33 sections, 29 theorems, 21 equations, 1 figure, 1 table, 8 algorithms)

This paper contains 33 sections, 29 theorems, 21 equations, 1 figure, 1 table, 8 algorithms.

Introduction
Active Learning with Queries
Example: 2-d Halfspaces
2-d Halfspaces (cont.)
Our results
Characterizing the Complexity of Learning with Region Queries
Efficient Learning Algorithms for Natural Hypothesis Classes
Connection with Other Learning Models and Related Work
Active Learning with Enriched Queries
Mistake-Based Query and Self-Directed Learning
Learning Halfspace with the Power of Adaptivity
Organization of the Paper
Active Learning for General Hypothesis Class Using Simple Region Query
Construction of Simple Query Classes for General Hypothesis Classes
Lower Bound on the VC Dimension of the Query Class
...and 18 more sections

Key Result

Theorem 1.2

Let $\mathcal{X}$ be a space of example and $\mathcal{H}$ be a hypothesis class over $\mathcal{X}$ with VC dimension $d$. There is a region query family $Q$ over $\mathcal{X}$ with VC dimension at most $6d$ and a learning algorithm $\mathcal{A}$ such that for any set of $n$ examples $S \subseteq \ma

Figures (1)

Figure 1: Learning 2-dimensional Halfspaces with Region Queries

Theorems & Definitions (55)

Definition 1.1: Active Learning with Region Queries
Theorem 1.2
Theorem 1.3
Theorem 1.4
Corollary 2.1
Theorem 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
Definition A.1: VC Dimension of A Query Class
...and 45 more

Active Learning with Simple Questions

TL;DR

Abstract

Active Learning with Simple Questions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (55)