A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda
TL;DR
The paper demonstrates that crowdsourcing can effectively harvest Japanese-Chinese parallel data from the web, yielding 4.6M sentence pairs from 10k URL pairs and a high-quality 1.2M subset used to train a parallel-corpus filter. Despite having only about one-third the size of CCMatrix, a model trained on crowdsourced data achieves translation accuracy comparable to CCMatrix across multiple test sets, with some cases showing higher performance for Japanese-to-Chinese. The authors analyze the web-mining pipeline, showing crowdsourced sites yield higher parallel-sentence extraction efficiency than Common Crawl and validate the approach with extensive MT experiments using a Transformer model. They propose future enhancements including integrating CC data for diversity, content filtering, and continued refinement of dictionary-based alignment to further improve parallel sentence quality and MT performance.
Abstract
Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.
