Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

Yingqiang Gao; Veton Matoshi; Luca Rolshoven; Tilia Ellendorff; Judith Binder; Jeremy Austin Jann; Gerold Schneider; Matthias Stürmer

Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

Yingqiang Gao, Veton Matoshi, Luca Rolshoven, Tilia Ellendorff, Judith Binder, Jeremy Austin Jann, Gerold Schneider, Matthias Stürmer

Abstract

Public procurement refers to the process by which public sector institutions, such as governments, municipalities, and publicly funded bodies, acquire goods and services. Swiss law requires the integration of ecological, social, and economic sustainability requirements into tender evaluations in the format of criteria that have to be fulfilled by a bidder. However, translating high-level sustainability regulations into concrete, verifiable, and sector-specific procurement criteria (such as selection criteria, award criteria, and technical specifications) remains a labor-intensive and error-prone manual task, requiring substantial domain expertise in several groups of goods and services and considerable manual effort. This paper presents a configurable, LLM-assisted pipeline that is presented as a software supporting the systematic generation and evaluation of sustainability-oriented procurement criteria catalogs for Switzerland. The system integrates in-context prompting, interchangeable LLM backends, and automated output validation to enable auditable criteria generation across different procurement sectors. As a proof of concept, we instantiate the pipeline using official sustainability guidelines published by the Swiss government and the European Commission, which are ingested as structured reference documents. We evaluate the system through a combination of automated quality checks, including an LLM-based evaluation component, and expert comparison against a manually curated gold standard. Our results demonstrate that the proposed pipeline can substantially reduce manual drafting effort while producing criteria catalogs that are consistent with official guidelines. We further discuss system limitations, failure modes, and design trade-offs observed during deployment, highlighting key considerations for integrating generative AI into public sector software workflows.

Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

Abstract

Paper Structure (23 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Methodology
Data Collection
In-Context Learning
System Design
Context Retrieval Module
In-Context Prompter
XLSX Generator
Quality Assessment
Qualitative Evaluation
Evaluation Setup
Inter-Annotator Agreement
Findings
Automatic Evaluation
...and 8 more sections

Figures (5)

Figure 1: End-to-end pipeline of SwissSPC for sustainable procurement criteria generation and evaluation. SwissSPC integrates API configuration and model selection, prompt templating with optional document references, and XLSX-based criteria generation, followed by LLM-as-a-Judge evaluation to assess generation quality along multiple dimensions.
Figure 2: A snippet of the structured prompt used in SwissSPC (translated from German to English) for generating . The prompt is decomposed into semantically distinct components: Background, Problem, Task, Requirement, Inputs, etc. (the full prompt contains 875 words), to explicitly encode sector context, motivation, generation objectives, and output constraints.
Figure 3: SwissSPC user interface for AI-assisted generation of Sustainable Procurement Criteria. The workflow guides users through (1) generation settings such as model and sector selection, (2) in-context prompting via configurable prompt templates, and (3) optional upload of credited reference PDFs, producing structured, sector-specific catalogs exported as XLSX files.
Figure 4: LLM-as-a-Judge evaluation scores for . We report the better performing scores using either EU-GPP or Toolbox when generating the outputs. Abbreviations: PR (Precision & Relevance), CC (Completeness & Coverage), DS (Differentiation & Scalability), LF (Language & Formality).
Figure 5: Prompt used in SwissSPC for automated quality assurance of generated sustainable procurement criteria. The prompt instructs the model to evaluate individual criteria along four structured dimensions.

Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

Abstract

Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

Authors

Abstract

Table of Contents

Figures (5)