Table of Contents
Fetching ...

An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization

Sean Lester C. Benavides, Cid Antonio F. Masapol, Jonathan C. Morano, Dan Michael A. Cortez

TL;DR

The paper tackles the limitations of Jiang et al.'s compression-based text classification, which relies on gzip and Normalized Compression Distance (NCD) but is hampered by a fixed sliding window. It proposes unigram extraction and a union-based concatenation strategy, compressing unigram sets to compute $NCD = \frac{C_{xy}-\min(C_x,C_y)}{\max(C_x,C_y)}$, thereby focusing on word-level similarity. Across six datasets with varying lengths and label counts, the approach achieves an average accuracy gain of 5.73%, with peak improvements up to 11% on longer documents, demonstrating robustness and scalability. The method remains lightweight and hardware-efficient, making it suitable for resource-constrained environments and practical for news article categorization.

Abstract

This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. By compressing extracted unigrams, the algorithm mitigates sliding window limitations inherent to gzip, improving compression efficiency and similarity detection. The optimized concatenation strategy replaces direct concatenation with the union of unigrams, reducing redundancy and enhancing the accuracy of Normalized Compression Distance (NCD) calculations. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents. Notably, these improvements are more pronounced in datasets with high-label diversity and complex text structures. The methodology achieves these results while maintaining computational efficiency, making it suitable for resource-constrained environments. This study provides a robust, scalable solution for text classification, emphasizing lightweight preprocessing techniques to achieve efficient compression, which in turn enables more accurate classification.

An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization

TL;DR

The paper tackles the limitations of Jiang et al.'s compression-based text classification, which relies on gzip and Normalized Compression Distance (NCD) but is hampered by a fixed sliding window. It proposes unigram extraction and a union-based concatenation strategy, compressing unigram sets to compute , thereby focusing on word-level similarity. Across six datasets with varying lengths and label counts, the approach achieves an average accuracy gain of 5.73%, with peak improvements up to 11% on longer documents, demonstrating robustness and scalability. The method remains lightweight and hardware-efficient, making it suitable for resource-constrained environments and practical for news article categorization.

Abstract

This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. By compressing extracted unigrams, the algorithm mitigates sliding window limitations inherent to gzip, improving compression efficiency and similarity detection. The optimized concatenation strategy replaces direct concatenation with the union of unigrams, reducing redundancy and enhancing the accuracy of Normalized Compression Distance (NCD) calculations. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents. Notably, these improvements are more pronounced in datasets with high-label diversity and complex text structures. The methodology achieves these results while maintaining computational efficiency, making it suitable for resource-constrained environments. This study provides a robust, scalable solution for text classification, emphasizing lightweight preprocessing techniques to achieve efficient compression, which in turn enables more accurate classification.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Unigram Extraction
  • Figure 2: Unigram Compression
  • Figure 3: Unigram Concatenation
  • Figure 4: Enhanced Jiang, Z., et al.'s Compression-Based Classification Algorithm
  • Figure 5: News Article Categorization System Architecture