Table of Contents
Fetching ...

RoBERTa-Augmented Synthesis for Detecting Malicious API Requests

Udi Aharon, Revital Marbel, Ran Dubin, Amit Dvir, Chen Hajaj

TL;DR

The paper addresses the scarcity of labeled API traffic data for robust anomaly detection. It proposes a GAN-inspired, domain-aware augmentation framework that leverages a RoBERTa masked language model for realistic API request synthesis and a discriminator for validation, forming a Datastore Generator and an downstream anomaly detector trained on the augmented corpus. The approach employs a token masking strategy with reserved-token filtering and an outlier token mechanism, demonstrating improved semantic quality and detection performance on CSIC 2010 and ATRDF 2023, including up to 21.10% F1 gains. It also shows efficiency advantages with a compact, task-specific RoBERTa model and demonstrates consistent improvements across SOTA HTTP anomaly detectors, highlighting the method’s practical potential for securing API traffic against emerging threats.

Abstract

Web applications and APIs face constant threats from malicious actors seeking to exploit vulnerabilities for illicit gains. To defend against these threats, it is essential to have anomaly detection systems that can identify a variety of malicious behaviors. However, a significant challenge in this area is the limited availability of training data. Existing datasets often do not provide sufficient coverage of the diverse API structures, parameter formats, and usage patterns encountered in real-world scenarios. As a result, models trained on these datasets often struggle to generalize and may fail to detect less common or emerging attack vectors. To enhance detection accuracy and robustness, it is crucial to access larger and more representative datasets that capture the true variability of API traffic. To address this, we introduce a GAN-inspired learning framework that extends limited API traffic datasets through targeted, domain-aware synthesis. Drawing on techniques from Natural Language Processing (NLP), our approach leverages Transformer-based architectures, particularly RoBERTa, to enhance the contextual representation of API requests and generate realistic synthetic samples aligned with security-specific semantics. We evaluate our framework on two benchmark datasets, CSIC 2010 and ATRDF 2023, and compare it with a previous data augmentation technique to assess the importance of domain-specific synthesis. In addition, we apply our augmented data to various anomaly detection models to evaluate its impact on classification performance. Our method achieves up to a 4.94% increase in F1 score on CSIC 2010 and up to 21.10% on ATRDF 2023. The source codes of this work are available at https://github.com/ArielCyber/GAN-API.

RoBERTa-Augmented Synthesis for Detecting Malicious API Requests

TL;DR

The paper addresses the scarcity of labeled API traffic data for robust anomaly detection. It proposes a GAN-inspired, domain-aware augmentation framework that leverages a RoBERTa masked language model for realistic API request synthesis and a discriminator for validation, forming a Datastore Generator and an downstream anomaly detector trained on the augmented corpus. The approach employs a token masking strategy with reserved-token filtering and an outlier token mechanism, demonstrating improved semantic quality and detection performance on CSIC 2010 and ATRDF 2023, including up to 21.10% F1 gains. It also shows efficiency advantages with a compact, task-specific RoBERTa model and demonstrates consistent improvements across SOTA HTTP anomaly detectors, highlighting the method’s practical potential for securing API traffic against emerging threats.

Abstract

Web applications and APIs face constant threats from malicious actors seeking to exploit vulnerabilities for illicit gains. To defend against these threats, it is essential to have anomaly detection systems that can identify a variety of malicious behaviors. However, a significant challenge in this area is the limited availability of training data. Existing datasets often do not provide sufficient coverage of the diverse API structures, parameter formats, and usage patterns encountered in real-world scenarios. As a result, models trained on these datasets often struggle to generalize and may fail to detect less common or emerging attack vectors. To enhance detection accuracy and robustness, it is crucial to access larger and more representative datasets that capture the true variability of API traffic. To address this, we introduce a GAN-inspired learning framework that extends limited API traffic datasets through targeted, domain-aware synthesis. Drawing on techniques from Natural Language Processing (NLP), our approach leverages Transformer-based architectures, particularly RoBERTa, to enhance the contextual representation of API requests and generate realistic synthetic samples aligned with security-specific semantics. We evaluate our framework on two benchmark datasets, CSIC 2010 and ATRDF 2023, and compare it with a previous data augmentation technique to assess the importance of domain-specific synthesis. In addition, we apply our augmented data to various anomaly detection models to evaluate its impact on classification performance. Our method achieves up to a 4.94% increase in F1 score on CSIC 2010 and up to 21.10% on ATRDF 2023. The source codes of this work are available at https://github.com/ArielCyber/GAN-API.
Paper Structure (22 sections, 4 equations, 3 figures, 6 tables)

This paper contains 22 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Architecture of the proposed system. The system consists of two main modules: Module 1 – Datastore Generator, and Module 2 – Anomaly Detection. In Module 1, the original API request dataset is processed through a masking pipeline that identifies reserved tokens (marked as "<IGN>") based on frequency thresholds and excludes them from substitution. The remaining tokens are evaluated for semantic fit using cosine similarity between word and sentence-level embeddings, and the outlier token is replaced with a "<MASK>" token. In the example shown, "insertar" is identified as the outlier and replaced, resulting in the masked input "get /pagar.jsp modo=<MASK>". A generator network predicts suitable replacements, and a discriminator network assigns a PL confidence score to the generated variant. This results in a synthetic request, such as "get /pagar.jsp modo=macreyno", which is included in the augmented dataset if the discriminator assigns it a high-confidence score. Module 2 takes both the original and generated datasets, tokenizes them using a byte-level BPE tokenizer (2.1), processes them through a RoBERTa-based masked language model (2.2), and finally applies a classifier (2.3) to detect anomalies in API requests.
  • Figure 2: Example of a structured HTTP API request extracted from the ATRDF 2023 dataset, showing the method indicating the operation performed, the path identifying the requested resource, and the headers providing metadata such as content type, client identity, and session details.
  • Figure 3: Impact of varying the confidence level used to compute the reserved token frequency threshold on classification performance. The plots show the F1 score improvement (after dataset extension) across different thresholds for the CSIC 2010 and ATRDF 2023 datasets. Higher confidence levels exclude more frequent structural tokens from replacement during generation.