Table of Contents
Fetching ...

Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires

Ali Nour Eldin, Mohamed Sellami, Walid Gaaloul

TL;DR

This paper tackles the challenge of scaling TPRA questionnaire customization by introducing a semi-supervised semantic labeling framework (SSSL) that assigns interpretable labels to compliance questions, capturing both control domains and assessment scope. The approach clusters questions using embedding-based similarity, labels clusters with a targeted LLM call, and propagates labels to individual questions via kNN, enabling fast, label-space retrieval without repeated LLM inference. Experiments on CAIQ and synthetic datasets show that semantic labels improve retrieval alignment over pure semantic similarity, while SSSL substantially reduces labeling cost and runtime, with near-zero-cost label assignment during prediction. Cross-standard transfer highlights some limitations but also demonstrates practical value for scalable TPRA operations, suggesting further refinements like label-level grouping to enhance cross-domain robustness.

Abstract

Third-Party Risk Assessment (TPRA) is a core cybersecurity practice for evaluating suppliers against standards such as ISO/IEC 27001 and NIST. TPRA questionnaires are typically drawn from large repositories of security and compliance questions, yet tailoring assessments to organizational needs remains a largely manual process. Existing retrieval approaches rely on keyword or surface-level similarity, which often fails to capture implicit assessment scope and control semantics. This paper explores strategies for organizing and retrieving TPRA cybersecurity questions using semantic labels that describe both control domains and assessment scope. We compare direct question-level labeling with a Large Language Model (LLM) against a hybrid semi-supervised semantic labeling (SSSL) pipeline that clusters questions in embedding space, labels a small representative subset using an LLM, and propagates labels to remaining questions using k-Nearest Neighbors; we also compare downstream retrieval based on direct question similarity versus retrieval in the label space. We find that semantic labels can improve retrieval alignment when labels are discriminative and consistent, and that SSSL can generalize labels from a small labeled subset to large repositories while substantially reducing LLM usage and cost.

Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires

TL;DR

This paper tackles the challenge of scaling TPRA questionnaire customization by introducing a semi-supervised semantic labeling framework (SSSL) that assigns interpretable labels to compliance questions, capturing both control domains and assessment scope. The approach clusters questions using embedding-based similarity, labels clusters with a targeted LLM call, and propagates labels to individual questions via kNN, enabling fast, label-space retrieval without repeated LLM inference. Experiments on CAIQ and synthetic datasets show that semantic labels improve retrieval alignment over pure semantic similarity, while SSSL substantially reduces labeling cost and runtime, with near-zero-cost label assignment during prediction. Cross-standard transfer highlights some limitations but also demonstrates practical value for scalable TPRA operations, suggesting further refinements like label-level grouping to enhance cross-domain robustness.

Abstract

Third-Party Risk Assessment (TPRA) is a core cybersecurity practice for evaluating suppliers against standards such as ISO/IEC 27001 and NIST. TPRA questionnaires are typically drawn from large repositories of security and compliance questions, yet tailoring assessments to organizational needs remains a largely manual process. Existing retrieval approaches rely on keyword or surface-level similarity, which often fails to capture implicit assessment scope and control semantics. This paper explores strategies for organizing and retrieving TPRA cybersecurity questions using semantic labels that describe both control domains and assessment scope. We compare direct question-level labeling with a Large Language Model (LLM) against a hybrid semi-supervised semantic labeling (SSSL) pipeline that clusters questions in embedding space, labels a small representative subset using an LLM, and propagates labels to remaining questions using k-Nearest Neighbors; we also compare downstream retrieval based on direct question similarity versus retrieval in the label space. We find that semantic labels can improve retrieval alignment when labels are discriminative and consistent, and that SSSL can generalize labels from a small labeled subset to large repositories while substantially reducing LLM usage and cost.
Paper Structure (22 sections, 4 equations, 2 figures, 3 tables, 3 algorithms)

This paper contains 22 sections, 4 equations, 2 figures, 3 tables, 3 algorithms.

Figures (2)

  • Figure 1: Proposed SSSL pipeline
  • Figure 2: Structured prompt template used for cluster-level label extraction.