Table of Contents
Fetching ...

ImitAL: Learned Active Learning Strategy on Synthetic Data

Julius Gonsior, Maik Thiele, Wolfgang Lehner

TL;DR

No single active-learning strategy consistently dominates across diverse domains, and many rely on domain-specific data or costly runtimes. ImitAL reframes query selection as a learning-to-rank problem and trains a policy via imitation learning using large-scale synthetic AL simulations, blending informativeness and representativeness in a domain-independent way. A synthetic-data pipeline, an MDP formulation, listwise input/output encoding, and a pre-selection mechanism enable scalable learning and deployment. Across 13 real datasets and 7 baselines, ImitAL achieves superior or competitive F1-AUC with notably faster runtimes, suggesting practical, universal applicability for reducing labeling effort.

Abstract

Active Learning (AL) is a well-known standard method for efficiently obtaining annotated data by first labeling the samples that contain the most information based on a query strategy. In the past, a large variety of such query strategies has been proposed, with each generation of new strategies increasing the runtime and adding more complexity. However, to the best of our our knowledge, none of these strategies excels consistently over a large number of datasets from different application domains. Basically, most of the the existing AL strategies are a combination of the two simple heuristics informativeness and representativeness, and the big differences lie in the combination of the often conflicting heuristics. Within this paper, we propose ImitAL, a domain-independent novel query strategy, which encodes AL as a learning-to-rank problem and learns an optimal combination between both heuristics. We train ImitAL on large-scale simulated AL runs on purely synthetic datasets. To show that ImitAL was successfully trained, we perform an extensive evaluation comparing our strategy on 13 different datasets, from a wide range of domains, with 7 other query strategies.

ImitAL: Learned Active Learning Strategy on Synthetic Data

TL;DR

No single active-learning strategy consistently dominates across diverse domains, and many rely on domain-specific data or costly runtimes. ImitAL reframes query selection as a learning-to-rank problem and trains a policy via imitation learning using large-scale synthetic AL simulations, blending informativeness and representativeness in a domain-independent way. A synthetic-data pipeline, an MDP formulation, listwise input/output encoding, and a pre-selection mechanism enable scalable learning and deployment. Across 13 real datasets and 7 baselines, ImitAL achieves superior or competitive F1-AUC with notably faster runtimes, suggesting practical, universal applicability for reducing labeling effort.

Abstract

Active Learning (AL) is a well-known standard method for efficiently obtaining annotated data by first labeling the samples that contain the most information based on a query strategy. In the past, a large variety of such query strategies has been proposed, with each generation of new strategies increasing the runtime and adding more complexity. However, to the best of our our knowledge, none of these strategies excels consistently over a large number of datasets from different application domains. Basically, most of the the existing AL strategies are a combination of the two simple heuristics informativeness and representativeness, and the big differences lie in the combination of the often conflicting heuristics. Within this paper, we propose ImitAL, a domain-independent novel query strategy, which encodes AL as a learning-to-rank problem and learns an optimal combination between both heuristics. We train ImitAL on large-scale simulated AL runs on purely synthetic datasets. To show that ImitAL was successfully trained, we perform an extensive evaluation comparing our strategy on 13 different datasets, from a wide range of domains, with 7 other query strategies.
Paper Structure (15 sections, 5 equations, 4 figures, 3 tables)

This paper contains 15 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: General overview on the training procedure of ImitAL
  • Figure 2: Pre-selection process and action meaning for ImitAL, example for $j$= 4, $k$=6, and $b$=3, and encoding of a state-action-triple
  • Figure 3: Converting learning curves into single numbers
  • Figure 4: Average runtime duration in seconds per complete experiment, with timeout duration for experiments lasting longer than seven days