Logic Mill -- A Knowledge Navigation System
Sebastian Erhardt, Mainak Ghosh, Erik Buunk, Michael E. Rose, Dietmar Harhoff
TL;DR
Logic Mill presents a scalable, open-access knowledge navigation system that encodes vast science and patent corpora into dense embeddings to enable rapid cross-domain retrieval and similarity analysis. It centers on the SPECTER encoder (built on SciBERT) to produce $768$-dimensional vectors from title and abstract inputs within a $512$-token limit, and stores these embeddings in ElasticSearch using $HNSW$-based ANN for milliseconds-precision nearest-neighbor search. The architecture is implemented as microservices with a Go backend and a GraphQL API, and supports continuous ingestion from Semantic Scholar, EPO, USPTO, and WIPO, along with user-supplied documents for cross-domain linking and exploration. The system is designed for literature exploration, prior-art searches, and cross-domain knowledge tracing, with future plans to broaden corpora (e.g., Wikipedia) and encoders, thereby enhancing research workflows and knowledge transfer across domains.
Abstract
Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced Natural Language Processing (NLP) techniques to generate numerical representations of documents. Currently it leverages a large pre-trained language model to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains.
