Table of Contents
Fetching ...

Machine Learning in Automated Text Categorization

Fabrizio Sebastiani

TL;DR

This paper surveys automated text categorization (TC), the task of assigning documents to predefined categories, and contrasts manual knowledge engineering with inductive, data-driven classifiers built from preclassified corpora. It frames TC as estimating a target function $\\breve{\\Phi}: D \\times C \\to \\{T,F\\}$ by learning from endogenous knowledge and category labels, and it discusses different problem settings (single-label vs multi-label, category-pivoted vs document-pivoted, hard vs ranking outputs). The main content reviews text representation and dimensionality reduction (e.g., tf-idf weighting, DIA, TSR, LSI), a wide range of inductive classifiers (Naive Bayes, decision trees, Rocchio, regression methods, neural nets, k-NN, SVMs) and ensemble approaches. Evaluation is discussed with precision/recall, micro/macro averaging, and established benchmarks such as Reuters, and the paper highlights practical findings that boosting, SVMs, and example-based and regression methods often perform strongly. Overall, the work underlines ML-driven TC as scalable for large document collections, enabling automatic indexing, organization, filtering, and retrieval across diverse domains, including noisy text and transcripts.

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Machine Learning in Automated Text Categorization

TL;DR

This paper surveys automated text categorization (TC), the task of assigning documents to predefined categories, and contrasts manual knowledge engineering with inductive, data-driven classifiers built from preclassified corpora. It frames TC as estimating a target function by learning from endogenous knowledge and category labels, and it discusses different problem settings (single-label vs multi-label, category-pivoted vs document-pivoted, hard vs ranking outputs). The main content reviews text representation and dimensionality reduction (e.g., tf-idf weighting, DIA, TSR, LSI), a wide range of inductive classifiers (Naive Bayes, decision trees, Rocchio, regression methods, neural nets, k-NN, SVMs) and ensemble approaches. Evaluation is discussed with precision/recall, micro/macro averaging, and established benchmarks such as Reuters, and the paper highlights practical findings that boosting, SVMs, and example-based and regression methods often perform strongly. Overall, the work underlines ML-driven TC as scalable for large document collections, enabling automatic indexing, organization, filtering, and retrieval across diverse domains, including noisy text and transcripts.

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Paper Structure

This paper contains 50 sections, 17 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Rule-based classifier for the Wheat category; keywords are indicated in italic, categories are indicated in Small Caps (from Apte94).
  • Figure 2: A decision tree equivalent to the DNF rule of Figure \ref{['fig:construe']}. Edges are labelled by terms and leaves are labelled by categories (underlining denotes negation).
  • Figure 3: A comparison between the TC behaviour of (a) the Rocchio classifier, and (b) the $k$-NN classifier. Small crosses and circles denote positive and negative training instances, respectively. The big circles denote the "influence area" of the classifier. Note that, for ease of illustration, document similarities are here viewed in terms of Euclidean distance rather than, as more common, in terms of dot product or cosine.
  • Figure 4: Learning support vector classifiers. The small crosses and circles represent positive and negative training examples, respectively, whereas lines represent decision surfaces. Decision surface $\sigma_i$ (indicated by the thicker line) is, among those shown, the best possible one, as it is the middle element of the widest set of parallel decision surfaces (i.e. its minimum distance to any training example is maximum). Small boxes indicate the support vectors.
  • Figure : Rule-based classifier for the Wheat category; keywords are indicated in italic, categories are indicated in Small Caps (from Apte94).