Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system
Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, Eiji Aramaki
TL;DR
This work assesses the applicability of HealthBench to the Japanese medical context by translating 5,000 scenarios and benchmarking GPT-4.1 and a Japanese-native LLM-jp-3.1, while using an LLM-as-a-Judge to identify contextual gaps and localization needs. The results show a modest performance drop for GPT-4.1 after translation and a pronounced failure of the native Japanese model due to incompleteness and safety gaps, highlighting rubric misalignment as a key issue. The study reveals that while most conversation scenarios are transferable, a large majority of rubric criteria require localization to reflect Japan’s clinical guidelines and cultural norms, underscoring the inadequacy of direct translation alone. The authors propose a Japan-specific adaptation, the J-HealthBench, to guide the development of contextualized benchmarks that reliably evaluate medical LLMs in Japan and ensure safe, clinically appropriate evaluation.
Abstract
This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. Although robust evaluation frameworks are essential for the safe development of medical LLMs, resources in Japanese are scarce and often consist of translated multiple-choice questions. Our research addresses this issue in two ways. First, we establish a performance baseline by applying a machine-translated version of HealthBench's 5,000 scenarios to evaluate two models: a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Secondly, we use an LLM-as-a-Judge approach to systematically classify the benchmark's scenarios and rubric criteria. This allows us to identify 'contextual gaps' where the content is misaligned with Japan's clinical guidelines, healthcare systems or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches, as well as a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification shows that, despite most scenarios being applicable, a significant proportion of the rubric criteria require localisation. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localised adaptation, a "J-HealthBench", to ensure the reliable and safe evaluation of medical LLMs in Japan.
