Table of Contents
Fetching ...

AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell'Orletta, Andrea Esuli

TL;DR

The paper demonstrates that a relatively old LLM fine-tuned on a modest Italian news corpus can generate convincingly news-like Italian text, challenging native readers to detect synthetic content. It systematically evaluates human detection versus automatic detectors based on token likelihood and supervised classification, showing that automatic methods outperform humans but remain impractical in real-world settings due to access and data requirements. A proxy-model approach shows promise only when the generator and detector share the same base LLM, while ensemble methods offer limited gains. The work advocates model-identity tracking and watermarking as potential safeguards and underscores an urgent need for model-agnostic detection techniques. Overall, it provides a concrete Italian-language case study and public data to spur further research on trustworthy detection of synthetic text across languages.

Abstract

Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real "content farm". We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.

AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

TL;DR

The paper demonstrates that a relatively old LLM fine-tuned on a modest Italian news corpus can generate convincingly news-like Italian text, challenging native readers to detect synthetic content. It systematically evaluates human detection versus automatic detectors based on token likelihood and supervised classification, showing that automatic methods outperform humans but remain impractical in real-world settings due to access and data requirements. A proxy-model approach shows promise only when the generator and detector share the same base LLM, while ensemble methods offer limited gains. The work advocates model-identity tracking and watermarking as potential safeguards and underscores an urgent need for model-agnostic detection techniques. Overall, it provides a concrete Italian-language case study and public data to spur further research on trustworthy detection of synthetic text across languages.

Abstract

Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real "content farm". We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.
Paper Structure (43 sections, 8 figures, 15 tables)

This paper contains 43 sections, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Detecting synthetic Italian news text generated by fine-tuned Llama-65B: error rates for DetectGPT, native speakers of Italian and random guess.
  • Figure 2: Example: without fine-tuning on Italian, Llama is prone to switching to English.
  • Figure 3: ROC curve for DetectGPT and log-likelihood. In (\ref{['subfig:roc_pretrain_ita']}) for Llama 65B measured over 100 sentences from the CHANGE-it data-set (Italian), in (\ref{['subfig:roc_fine_tune_e2']}) the same measure for Llama 65B model after 20,000 fine tuning steps on CHANGE-it training set and in (\ref{['subfig:roc_fine_tune_e6']}) after 60,000 fine-tuning steps.
  • Figure 4: Accuracy of classifier based on xlm-RoBERTa-large for human/synthetic text classification task, for synthetic texts generated by three LLMs fine-tuned on CHANGE-it. The classifier was trained on 50% synthetic texts and either 50% CHANGE-it texts (in domain), or 25% texts from CHANGE-it and 25% from DICE (mixed source). Classification is only successful at at least 4K labeled samples, and the mixed source scenario is consistently more challenging.
  • Figure 5: Accuracy of classifier based on RoBERTa-large for human/synthetic text classification task, for synthetic texts generated by three LLMs fine-tuned on CHANGE-it. The classifier was trained on 50% synthetic texts and either 50% CHANGE-it texts (in domain), or 25% texts from CHANGE-it and 25% from DICE (mixed source). Classification is only successful at at least 4K labeled samples, and the mixed source scenario is consistently more challenging.
  • ...and 3 more figures