TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets
David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder
TL;DR
TakeLab Retriever addresses the challenge of biased and incomplete access to Croatian news by providing a Croatian-focused semantic search engine that integrates an end-to-end NLP pipeline with a microservice architecture. It combines a custom scraper (scheduler, downloader, extractor) with an NLP stack (Core, NER, NEL, low-quality detector, IPTC topic classifier) and a web application to enable complex queries and rich visualizations over a growing archive of over ten million articles from 33 outlets as of Nov 2024. Key contributions include exact SimHash-based deduplication, Wikidata-driven entity linking, IPTC-based multi-label topic classification via Omikuji, and a GPU-accelerated NLP pipeline that supports reindexing. The system demonstrates scalable, unbiased, and precise retrieval for social science research, offering capabilities beyond general-purpose search engines and enabling insights into trends, patterns, and correlations in Croatian online news content. Ongoing work aims to enhance user experience, extend NLP capabilities (e.g., sentiment analysis, keyphrase extraction), and broaden outlet coverage.
Abstract
TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.
