Table of Contents
Fetching ...

POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Tingyue Yang, Junchi Yao, Yuhui Guo, Chang Liu

TL;DR

POLIS-Bench addresses the need for reliable evaluation of LLMs in bilingual governmental policy tasks by introducing an up-to-date bilingual corpus, scenario-grounded tasks (Clause Retrieval & Interpretation, Solution Generation, Compliance Judgment), and a dual-metric framework combining semantic similarity with task correctness. The authors demonstrate a scalable evaluation across 10+ models, reveal a performance hierarchy favoring reasoning models, and show that lightweight open-source models can match or exceed strong proprietary baselines through LoRA fine-tuning. This work delivers a cost-efficient, compliant path for real-world government deployment and provides a dynamic, auditable benchmark with continuous data updates. Future work focuses on expanding jurisdictional and multilingual coverage and strengthening governance around corpus updates and evaluation rubrics.

Abstract

We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.

POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

TL;DR

POLIS-Bench addresses the need for reliable evaluation of LLMs in bilingual governmental policy tasks by introducing an up-to-date bilingual corpus, scenario-grounded tasks (Clause Retrieval & Interpretation, Solution Generation, Compliance Judgment), and a dual-metric framework combining semantic similarity with task correctness. The authors demonstrate a scalable evaluation across 10+ models, reveal a performance hierarchy favoring reasoning models, and show that lightweight open-source models can match or exceed strong proprietary baselines through LoRA fine-tuning. This work delivers a cost-efficient, compliant path for real-world government deployment and provides a dynamic, auditable benchmark with continuous data updates. Future work focuses on expanding jurisdictional and multilingual coverage and strengthening governance around corpus updates and evaluation rubrics.

Abstract

We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.

Paper Structure

This paper contains 15 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of accuracy rate across 15 state-of-the-art models and POLIS series models.
  • Figure 2: Overview of our POLIS-Bench. (i) The diagram above illustrates construction pipeline of POLIS--Bench. (ii) The diagram below illustrates the three key characteristics of the POLIS--Bench.
  • Figure 3: Accuracy Rate Distribution. This chart illustrates the accuracy rate distribution for the average performance of Reasoning Models (Reasoning Average), Chat Models (Chat Average), and the overall mean (Overall Average) across the three bilingual policy tasks on POLIS-Bench.
  • Figure 4: Task-oriented performance of base and POLIS-tuned models. Bar plots compare four models (DeepSeek-R1-Distill-Llama-8B, Qwen3-8B, POLIS-DeepSeek-R1-Distill-Llama-8B, POLIS-Qwen3-8B) on three tasks: Clause Retrieval & Interpretation, Solution Generation, and Compliance Judgment. The top row reports Semantic Similarity, and the bottom row reports Accuracy Rate; higher values indicate better performance. Numbers atop bars are the corresponding scores, obtained under the unified evaluation pipeline.
  • Figure 5: Case Study on Compliance Judgment Task
  • ...and 1 more figures