Open · Research-backed · Community-driven

NLP for 20+ Turkic languages.

A practical, pip-installable Python toolkit — built so you can go from raw text to annotated output across the Turkic language family.

24 languages Morphology & POS Translation & Embeddings Apache 2.0

From text to understanding

One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.

$ pip install turkicnlp

Works with Turkish, Kazakh, Uzbek, Turkmen, Azerbaijani, Kyrgyz, Uyghur, Tatar, Bashkir and 15 more languages.

🔧

The Toolkit

A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.

Supports 24 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.

Toolkit capabilities

Everything you need to build Turkic NLP systems

Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.

🔤

Tokenization & Scripts

Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.

🧬

Morphological Analysis

Apertium HFST finite-state transducers for 20 languages, loaded natively via Python. No system Apertium install required.

🏷️

POS & Dependency Parsing

Neural models for 15 languages. Stanza models trained on UD treebanks for Turkish, Kazakh, Kyrgyz, Uyghur. Custom-trained Stanza models or multilingual models for Uzbek, Turkmen, Azerbaijani, Tatar, Bashkir, Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish.

🌐

Translation & Embeddings

NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.

📄

CoNLL-U I/O

Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.

🔬

Research-ready

MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.

🔍

Language Identification

GlotLID model identifies 1000+ languages including all Turkic varieties. Auto-detect scripts and disambiguate similar languages.

🔤↔️

Script Transliteration

Bidirectional conversion between Cyrillic, Latin, and Perso-Arabic. Common Turkic Alphabet support for cross-language interoperability.

Language coverage

24 Turkic languages and growing

From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.

TurkishturFull pipeline
KazakhkazFull pipeline
KyrgyzkirFull pipeline
UyghuruigFull pipeline
UzbekuzbFull pipeline
AzerbaijaniazeFull pipeline
TatartatFull pipeline
BashkirbakFull pipeline
TurkmentukFull pipeline
Crimean TatarcrhMorph + MT
S. AzerbaijaniazbEmbeddings + MT
SakhasahMorph + POS/Dep
KarakalpakkaaMorph + POS/Dep
KumykkumMorph + POS/Dep
Ottoman TurkishotaPOS/Dep
ChuvashchvMorphology
GagauzgagMorphology
NogainogMorph + POS/Dep
Karachay-BalkarkrcMorph + POS/Dep
AltaialtMorphology
TuvantyvMorphology
KhakaskjhMorphology
KhalajkljMorphology
Old TurkishotkTransliteration
Join the community

Get help, share your work, and collaborate

The Turkic NLP community spans researchers, practitioners, and language enthusiasts across the globe. Join us to ask questions, share your results, and help shape the future of the toolkit.

Support the project

Help keep Turkic NLP open

The toolkit is free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

Funds go directly toward model training, infrastructure, and other tasks.