Table of Contents
Fetching ...

Vector Symbolic Open Source Information Discovery

Cai Davies, Sam Meek, Philip Hawkins, Benomy Tutcher, Graham Bent, Alun Preece

TL;DR

The paper addresses rapid OSINF data discovery in DDIL CJIIM settings by marrying transformer-based semantic embeddings with Vector Symbolic Architectures to produce compact, schema-agnostic representations. It demonstrates an end-to-end OSINF proof-of-concept portal that maps tweet content and metadata into 1k-bit VSA vectors, enabling fast, bandwidth-efficient matching with FAISS indexing. Results show strong semantic matching, especially with multi-vector representations, and reliable location/language recall, while highlighting trade-offs between vector size and accuracy for single-vector encodings. The work lowers human, computational, and communication burdens for cross-domain data discovery and has potential applications across healthcare, education, and business beyond defence.

Abstract

Combined, joint, intra-governmental, inter-agency and multinational (CJIIM) operations require rapid data sharing without the bottlenecks of metadata curation and alignment. Curation and alignment is particularly infeasible for external open source information (OSINF), e.g., social media, which has become increasingly valuable in understanding unfolding situations. Large language models (transformers) facilitate semantic data and metadata alignment but are inefficient in CJIIM settings characterised as denied, degraded, intermittent and low bandwidth (DDIL). Vector symbolic architectures (VSA) support semantic information processing using highly compact binary vectors, typically 1-10k bits, suitable in a DDIL setting. We demonstrate a novel integration of transformer models with VSA, combining the power of the former for semantic matching with the compactness and representational structure of the latter. The approach is illustrated via a proof-of-concept OSINF data discovery portal that allows partners in a CJIIM operation to share data sources with minimal metadata curation and low communications bandwidth. This work was carried out as a bridge between previous low technology readiness level (TRL) research and future higher-TRL technology demonstration and deployment.

Vector Symbolic Open Source Information Discovery

TL;DR

The paper addresses rapid OSINF data discovery in DDIL CJIIM settings by marrying transformer-based semantic embeddings with Vector Symbolic Architectures to produce compact, schema-agnostic representations. It demonstrates an end-to-end OSINF proof-of-concept portal that maps tweet content and metadata into 1k-bit VSA vectors, enabling fast, bandwidth-efficient matching with FAISS indexing. Results show strong semantic matching, especially with multi-vector representations, and reliable location/language recall, while highlighting trade-offs between vector size and accuracy for single-vector encodings. The work lowers human, computational, and communication burdens for cross-domain data discovery and has potential applications across healthcare, education, and business beyond defence.

Abstract

Combined, joint, intra-governmental, inter-agency and multinational (CJIIM) operations require rapid data sharing without the bottlenecks of metadata curation and alignment. Curation and alignment is particularly infeasible for external open source information (OSINF), e.g., social media, which has become increasingly valuable in understanding unfolding situations. Large language models (transformers) facilitate semantic data and metadata alignment but are inefficient in CJIIM settings characterised as denied, degraded, intermittent and low bandwidth (DDIL). Vector symbolic architectures (VSA) support semantic information processing using highly compact binary vectors, typically 1-10k bits, suitable in a DDIL setting. We demonstrate a novel integration of transformer models with VSA, combining the power of the former for semantic matching with the compactness and representational structure of the latter. The approach is illustrated via a proof-of-concept OSINF data discovery portal that allows partners in a CJIIM operation to share data sources with minimal metadata curation and low communications bandwidth. This work was carried out as a bridge between previous low technology readiness level (TRL) research and future higher-TRL technology demonstration and deployment.
Paper Structure (16 sections, 5 equations, 6 figures, 1 table)

This paper contains 16 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of DAIS 'distributed brain' CJIIM concept (adapted from https://dais-legacy.org/1a11).
  • Figure 2: Overview of proof-of-concept (Twitter) approach.
  • Figure 3: Minimal metadata model (UML schema).
  • Figure 4: Method for representing tweets as vectors.
  • Figure 5: Proof-of-concept app: query-by-example.
  • ...and 1 more figures