Table of Contents
Fetching ...

A Simple Approximation Algorithm for Optimal Decision Tree

Zhengjia Zhuo, Viswanath Nagarajan

TL;DR

The paper studies the Optimal Decision Tree (ODT) problem, where the goal is to identify the true hypothesis among $m$ using sequential queries with arbitrary costs and responses; ODT is NP-hard and hard to approximate beyond $\ln m$. It proposes a simple greedy policy that at each state selects the query maximizing the expected number of newly eliminated hypotheses per unit cost, and proves an approximation ratio of $8\cdot(1+\ln m)$. The analysis adapts adaptive-submodular cover techniques, introducing a Stem$(w)$ construction and a key lower bound relating the greedy progress to the optimal progress via $a_t$ and $o_{t/L}$. The result yields a practically implementable algorithm with competitive constants across general ODT settings, with implications for active learning, entity identification, and medical diagnosis tasks.

Abstract

Optimal decision tree (\odt) is a fundamental problem arising in applications such as active learning, entity identification, and medical diagnosis. An instance of \odt is given by $m$ hypotheses, out of which an unknown ``true'' hypothesis is drawn according to some probability distribution. An algorithm needs to identify the true hypothesis by making queries: each query incurs a cost and has a known response for each hypothesis. The goal is to minimize the expected query cost to identify the true hypothesis. We consider the most general setting with arbitrary costs, probabilities and responses. \odt is NP-hard to approximate better than $\ln m$ and there are $O(\ln m)$ approximation algorithms known for it. However, these algorithms and/or their analyses are quite complex. Moreover, the leading constant factors are large. We provide a simple algorithm and analysis for \odt, proving an approximation ratio of $8 \ln m$.

A Simple Approximation Algorithm for Optimal Decision Tree

TL;DR

The paper studies the Optimal Decision Tree (ODT) problem, where the goal is to identify the true hypothesis among using sequential queries with arbitrary costs and responses; ODT is NP-hard and hard to approximate beyond . It proposes a simple greedy policy that at each state selects the query maximizing the expected number of newly eliminated hypotheses per unit cost, and proves an approximation ratio of . The analysis adapts adaptive-submodular cover techniques, introducing a Stem construction and a key lower bound relating the greedy progress to the optimal progress via and . The result yields a practically implementable algorithm with competitive constants across general ODT settings, with implications for active learning, entity identification, and medical diagnosis tasks.

Abstract

Optimal decision tree (\odt) is a fundamental problem arising in applications such as active learning, entity identification, and medical diagnosis. An instance of \odt is given by hypotheses, out of which an unknown ``true'' hypothesis is drawn according to some probability distribution. An algorithm needs to identify the true hypothesis by making queries: each query incurs a cost and has a known response for each hypothesis. The goal is to minimize the expected query cost to identify the true hypothesis. We consider the most general setting with arbitrary costs, probabilities and responses. \odt is NP-hard to approximate better than and there are approximation algorithms known for it. However, these algorithms and/or their analyses are quite complex. Moreover, the leading constant factors are large. We provide a simple algorithm and analysis for \odt, proving an approximation ratio of .

Paper Structure

This paper contains 7 sections, 7 theorems, 28 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 1.1

Our greedy policy for general ${\sf ODT}$ has approximation ratio at most $8\cdot (1+\ln m)$.

Figures (1)

  • Figure 1: An example policy (decision tree). The initial state is $\emptyset$ and the rest are labeled $a$-$g$. The costs $c(e_1)=c(e_2)=3$ and $c(e_3)=c(e_4) =5$. We have ${\sf start}(b) = c(e_1)=3$ and ${\sf end}(b)={\sf start}(b)+c(e_3)=8$. Similarly, ${\sf start}(d) = c(e_1)+ c(e_2)=6$ and ${\sf end}(d) = {\sf start}(d)+ c(e_4)=11$. The active states at time $t=9$ are $\{d,g\}$.

Theorems & Definitions (18)

  • Theorem 1.1
  • Definition 2.1
  • Definition 2.2: Active states
  • Lemma 2.1
  • Definition 2.3: Score
  • Lemma 2.2
  • Definition 2.4: Heavy part
  • Lemma 2.3
  • proof
  • Definition 2.5
  • ...and 8 more