Open · Research-backed · Community-driven

NLP for 20+ Turkic languages.

A practical, pip-installable Python toolkit — built so you can go from raw text to annotated output across the Turkic language family.

24 languages Morphology & POS Translation & Embeddings Apache 2.0

Explore the toolkit → Read the paper

From text to understanding

One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.

$ pip install turkicnlp

Works with Turkish, Kazakh, Uzbek, Turkmen, Azerbaijani, Kyrgyz, Uyghur, Tatar, Bashkir and 15 more languages.

🔧

The Toolkit

A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.

Supports 24 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.

Get started

Toolkit capabilities

Everything you need to build Turkic NLP systems

Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.

🔤

Tokenization & Scripts

Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.

🧬

Morphological Analysis

Apertium HFST finite-state transducers for 20 languages, loaded natively via Python. No system Apertium install required.

🏷️

POS & Dependency Parsing

Neural models for 15 languages. Stanza models trained on UD treebanks for Turkish, Kazakh, Kyrgyz, Uyghur. Custom-trained Stanza models or multilingual models for Uzbek, Turkmen, Azerbaijani, Tatar, Bashkir, Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish.

🌐

Translation & Embeddings

NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.

📄

CoNLL-U I/O

Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.

🔬

Research-ready

MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.

🔍

Language Identification

GlotLID model identifies 1000+ languages including all Turkic varieties. Auto-detect scripts and disambiguate similar languages.

🔤↔️

Script Transliteration

Bidirectional conversion between Cyrillic, Latin, and Perso-Arabic. Common Turkic Alphabet support for cross-language interoperability.

Language coverage

24 Turkic languages and growing

From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.

TurkishturFull pipeline

KazakhkazFull pipeline

KyrgyzkirFull pipeline

UyghuruigFull pipeline

UzbekuzbFull pipeline

AzerbaijaniazeFull pipeline

TatartatFull pipeline

BashkirbakFull pipeline

TurkmentukFull pipeline

Crimean TatarcrhMorph + MT

S. AzerbaijaniazbEmbeddings + MT

SakhasahMorph + POS/Dep

KarakalpakkaaMorph + POS/Dep

KumykkumMorph + POS/Dep

Ottoman TurkishotaPOS/Dep

ChuvashchvMorphology

GagauzgagMorphology

NogainogMorph + POS/Dep

Karachay-BalkarkrcMorph + POS/Dep

AltaialtMorphology

TuvantyvMorphology

KhakaskjhMorphology

KhalajkljMorphology

Old TurkishotkTransliteration

Join the community

Get help, share your work, and collaborate

The Turkic NLP community spans researchers, practitioners, and language enthusiasts across the globe. Join us to ask questions, share your results, and help shape the future of the toolkit.

💬 Join Discord 🔧 GitHub Discussions

Support the project

Help keep Turkic NLP open

The toolkit is free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

💙 Donate via PayPal 🎁 Support on Patreon

Funds go directly toward model training, infrastructure, and other tasks.