Table of Contents
Fetching ...

Dravidian language family through Universal Dependencies lens

Taraka Rama, Sowmya Vajjala

TL;DR

This work analyzes Dravidian languages through the UD framework, highlighting their under-representation in UD and outlining concrete annotation strategies across tokenization, morphology, POS tagging, and syntax. It emphasizes language-specific features such as agglutinative morphology, extensive clitics and reduplication, dative subjects, and non-standard word order, proposing morpheme-level or function-based annotations and new dependency relations where needed. The paper contributes a taxonomy of Dravidian linguistic phenomena mapped to UD, practical guidelines for tokenization, POS tagging, and syntactic annotation, and a roadmap for expanding UD treebanks for Tamil, Telugu, Gadaba, Brahui, and other Dravidian languages. These contributions aim to enable richer cross-language NLP, more robust typology research, and greater linguistic coverage of under-resourced Dravidian languages in UD.

Abstract

The Universal Dependencies (UD) project aims to create a cross-linguistically consistent dependency annotation for multiple languages, to facilitate multilingual NLP. It currently supports 114 languages. Dravidian languages are spoken by over 200 million people across the word, and yet there are only two languages from this family in UD. This paper examines some of the morphological and syntactic features of Dravidian languages and explores how they can be annotated in the UD framework.

Dravidian language family through Universal Dependencies lens

TL;DR

This work analyzes Dravidian languages through the UD framework, highlighting their under-representation in UD and outlining concrete annotation strategies across tokenization, morphology, POS tagging, and syntax. It emphasizes language-specific features such as agglutinative morphology, extensive clitics and reduplication, dative subjects, and non-standard word order, proposing morpheme-level or function-based annotations and new dependency relations where needed. The paper contributes a taxonomy of Dravidian linguistic phenomena mapped to UD, practical guidelines for tokenization, POS tagging, and syntactic annotation, and a roadmap for expanding UD treebanks for Tamil, Telugu, Gadaba, Brahui, and other Dravidian languages. These contributions aim to enable richer cross-language NLP, more robust typology research, and greater linguistic coverage of under-resourced Dravidian languages in UD.

Abstract

The Universal Dependencies (UD) project aims to create a cross-linguistically consistent dependency annotation for multiple languages, to facilitate multilingual NLP. It currently supports 114 languages. Dravidian languages are spoken by over 200 million people across the word, and yet there are only two languages from this family in UD. This paper examines some of the morphological and syntactic features of Dravidian languages and explores how they can be annotated in the UD framework.
Paper Structure (28 sections, 14 figures)

This paper contains 28 sections, 14 figures.

Figures (14)

  • Figure 1: Dravidian Languages' subgrouping and geographical distribution.
  • Figure 2: Token and Morpheme level annotation
  • Figure 3: Interrogative Clitics in Tamil
  • Figure 4: Clitics in Brahui
  • Figure 5: Morphological analysis for Reduplication
  • ...and 9 more figures