RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams
Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
TL;DR
RoD-TAL introduces a Romanian driving-law benchmark pairing a law-grounded RoD-Law corpus with the RoD-QA multimodal dataset to evaluate information retrieval, question answering, and vision-language tasks in a low-resource language. The paper demonstrates that domain-specific retrievers and reasoning-enhanced prompting substantially improve retrieval and QA performance, with notable gains from retrieval-augmented generation and chain-of-thought prompting. Across IR, QA, VIR, and VQA tasks, results reveal both the potential and limitations of current LLMs/VLMs in legally grounded education, including hallucination risks and prompts sensitivity. The work provides a comprehensive analysis, public code, and prompts to support Romanian legal-education tooling and motivates further research on robust multilingual, legally grounded reasoning.
Abstract
The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about the Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, along with annotated legal references and explanations written by human experts. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks, including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum passing grades required for driving exams. We highlight the potential and limitations of applying LLMs and VLMs to legal education. We release the code and resources through the GitHub repository.
