Intent Classification on Low-Resource Languages with Query Similarity Search
Arjun Bhalla, Qi Huang
TL;DR
The paper tackles intent classification in information retrieval under data scarcity, with a focus on low-resource languages. It reframes the task as a query similarity search in latent space, indexing known-intent queries and labeling new queries by nearest-neighbor votes in the embedding space, using approximate nearest neighbor search with FAISS. Across English, multilingual, and low-resource experiments, the approach demonstrates feasibility and competitive performance without domain-specific fine-tuning or translation services, though fully supervised baselines can still outperform in some settings. The work highlights practical advantages for rapid deployment and scalability to new languages, while outlining future work on data balance and potential encoder fine-tuning.
Abstract
Intent classification is an important component of a functional Information Retrieval ecosystem. Many current approaches to intent classification, typically framed as a classification problem, can be problematic as intents are often hard to define and thus data can be difficult and expensive to annotate. The problem is exacerbated when we need to extend the intent classification system to support multiple and in particular low-resource languages. To address this, we propose casting intent classification as a query similarity search problem - we use previous example queries to define an intent, and a query similarity method to classify an incoming query based on the labels of its most similar queries in latent space. With the proposed approach, we are able to achieve reasonable intent classification performance for queries in low-resource languages in a zero-shot setting.
