ConFit: Improving Resume-Job Matching using Data Augmentation and Contrastive Learning
Xiao Yu, Jinzhong Zhang, Zhou Yu
TL;DR
ConFit tackles the sparsity of resume–job interaction data by combining data augmentation and contrastive learning to produce dense resume and job embeddings ($E_\theta$) that support fast inner-product matching ($s_\theta(R,J)=E_\theta(R)^T E_\theta(J)$). Data augmentation via EDA and ChatGPT paraphrasing increases labeled pairs, while in-batch and hard negatives in contrastive learning scale supervision to $\mathcal{O}(B^2)$ per batch, improving embedding quality. Empirical results on two real-world datasets show ConFit outperforms strong baselines including BM25 and OpenAI text-ada-002 in most ranking tasks, with notable gains in MAP and $nDCG@10$ for both ranking resumes and jobs. The approach yields retrieval-friendly, scalable performance suitable for large candidate pools, and is extensible to debiasing and preference-aware extensions for real-world deployment.
Abstract
A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes, and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction records in resume-job datasets are sparse. Different from many prior work that use complex modeling techniques, we tackle this sparsity problem using data augmentations and a simple contrastive learning approach. ConFit first creates an augmented resume-job dataset by paraphrasing specific sections in a resume or a job post. Then, ConFit uses contrastive learning to further increase training samples from $B$ pairs per batch to $O(B^2)$ per batch. We evaluate ConFit on two real-world datasets and find it outperforms prior methods (including BM25 and OpenAI text-ada-002) by up to 19% and 31% absolute in nDCG@10 for ranking jobs and ranking resumes, respectively.
