Towards Better Understanding of Cybercrime: The Role of Fine-Tuned LLMs in Translation
Veronica Valeros, Anna Širokova, Carlos Catania, Sebastian Garcia
TL;DR
This paper tackles the challenge of translating Russian-language cybercrime chatter into English for timely cybersecurity insights, arguing that human translation is costly and machine translation often misses jargon and context. It proposes a fine-tuning workflow for a cloud LLM (GPT-3.5-turbo-0125) on a small, ground-truth dataset derived from the NoName057(16) hacktivist channel, including a structured fine-tuning prompt and vocabulary augmentation. The study combines human evaluation and automatic metrics (BLEU, METEOR, TER) to compare the fine-tuned model against baselines, finding that the fine-tuned model is generally preferred by human translators and yields improvements in several metrics, while also achieving substantial cost reductions relative to human translation. These results suggest that targeted fine-tuning can enable faster, cheaper, and more accurate cybercrime translations, facilitating real-time intelligence workflows, though challenges remain with platform restrictions and biases; future work emphasizes open-model fine-tuning and sharing to advance community collaboration.
Abstract
Understanding cybercrime communications is paramount for cybersecurity defence. This often involves translating communications into English for processing, interpreting, and generating timely intelligence. The problem is that translation is hard. Human translation is slow, expensive, and scarce. Machine translation is inaccurate and biased. We propose using fine-tuned Large Language Models (LLM) to generate translations that can accurately capture the nuances of cybercrime language. We apply our technique to public chats from the NoName057(16) Russian-speaking hacktivist group. Our results show that our fine-tuned LLM model is better, faster, more accurate, and able to capture nuances of the language. Our method shows it is possible to achieve high-fidelity translations and significantly reduce costs by a factor ranging from 430 to 23,000 compared to a human translator.
