Table of Contents
Fetching ...

A Classification System Approach in Predicting Chinese Censorship

Matt Prodani, Tianchu Ze, Yushen Hu

TL;DR

The paper addresses predicting censorship of Weibo posts under the Chinese internet regime. It builds a classification system using 4 probability-based logistic regression models and 2 Transformer-based models, trained on Fu2021-derived data with Jieba tokenization and 60/20/20 data splits, evaluated with macro-F1 and ROC-AUC. Results show that while Cosine Similarity-based approaches and more sophisticated probability models perform reasonably well, Fine-Tuned BERT (bert-base-chinese) achieves the highest ROC-AUC of 0.941, with the Scratch Transformer also performing strongly (0.893). The study demonstrates the feasibility and challenges of reverse-engineering censorship signals from social-media text, highlighting the trade-offs between performance and computation and outlining directions for robustness, real-time labeling, and dataset updates to capture evolving censorship dynamics.

Abstract

This paper is dedicated to using a classifier to predict whether a Weibo post would be censored under the Chinese internet. Through randomized sampling from \citeauthor{Fu2021} and Chinese tokenizing strategies, we constructed a cleaned Chinese phrase dataset with binary censorship markings. Utilizing various probability-based information retrieval methods on the data, we were able to derive 4 logistic regression models for classification. Furthermore, we experimented with pre-trained transformers to perform similar classification tasks. After evaluating both the macro-F1 and ROC-AUC metrics, we concluded that the Fined-Tuned BERT model exceeds other strategies in performance.

A Classification System Approach in Predicting Chinese Censorship

TL;DR

The paper addresses predicting censorship of Weibo posts under the Chinese internet regime. It builds a classification system using 4 probability-based logistic regression models and 2 Transformer-based models, trained on Fu2021-derived data with Jieba tokenization and 60/20/20 data splits, evaluated with macro-F1 and ROC-AUC. Results show that while Cosine Similarity-based approaches and more sophisticated probability models perform reasonably well, Fine-Tuned BERT (bert-base-chinese) achieves the highest ROC-AUC of 0.941, with the Scratch Transformer also performing strongly (0.893). The study demonstrates the feasibility and challenges of reverse-engineering censorship signals from social-media text, highlighting the trade-offs between performance and computation and outlining directions for robustness, real-time labeling, and dataset updates to capture evolving censorship dynamics.

Abstract

This paper is dedicated to using a classifier to predict whether a Weibo post would be censored under the Chinese internet. Through randomized sampling from \citeauthor{Fu2021} and Chinese tokenizing strategies, we constructed a cleaned Chinese phrase dataset with binary censorship markings. Utilizing various probability-based information retrieval methods on the data, we were able to derive 4 logistic regression models for classification. Furthermore, we experimented with pre-trained transformers to perform similar classification tasks. After evaluating both the macro-F1 and ROC-AUC metrics, we concluded that the Fined-Tuned BERT model exceeds other strategies in performance.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Team Workflow Diagram
  • Figure 2:
  • Figure 3:
  • Figure 4:
  • Figure 5: