LILI: A Simple Language Independent Approach for Language Identification

COLING 2016 · Mohamed Al-Badrashiny, Mona Diab ·

We introduce a generic Language Independent Framework for Linguistic Code Switch Point Detection. The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We test our proposed framework and compare it to the state-of-the-art published systems on standard data sets from several language pairs: English-Spanish, Nepali-English, English-Hindi, Arabizi (Refers to Arabic written using the Latin/Roman script)-English, Arabic-Engari (Refers to English written using Arabic script), Modern Standard Arabic(MSA)-Egyptian, Levantine-MSA, Gulf-MSA, one more English-Spanish, and one more MSA-EGY. The overall weighted average F-score of each language pair are 96.4{\%}, 97.3{\%}, 98.0{\%}, 97.0{\%}, 98.9{\%}, 86.3{\%}, 88.2{\%}, 90.6{\%}, 95.2{\%}, and 85.0{\%} respectively. The results show that our approach despite its simplicity, either outperforms or performs at comparable levels to state-of-the-art published systems.

PDF Abstract COLING 2016 PDF COLING 2016 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Language Identification

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

LILI: A Simple Language Independent Approach for Language Identification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove