Open · Research-backed · Community-driven

NLP for 20+ Turkic languages.

A practical, pip-installable Python toolkit and an open textbook — built together so you can go from raw text to annotated output across the Turkic language family.

21 languages Morphology & POS Translation & Embeddings Apache 2.0

From text to understanding

One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.

$ pip install turkicnlp

Works with Turkish, Kazakh, Uzbek, Kyrgyz, Uyghur, Tatar and 15 more languages.

🔧

The Toolkit

A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.

Supports 21 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.

📖

The Book

NLP for Turkic Languages is a practical, open-access textbook covering foundations through modern LLMs — with exercises, code examples, and linguistic depth for every chapter.

Freely available online. Because knowledge about under-resourced languages should be open.

Toolkit capabilities

Everything you need to build Turkic NLP systems

Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.

🔤

Tokenization & Scripts

Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.

🧬

Morphological Analysis

Apertium HFST finite-state transducers for 21 languages, loaded natively via Python. No system Apertium install required.

🏷️

POS & Dependency Parsing

Stanza neural models trained on Universal Dependencies treebanks for Turkish, Kazakh, Kyrgyz, Uyghur, and Ottoman Turkish.

🌐

Translation & Embeddings

NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.

📄

CoNLL-U I/O

Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.

🔬

Research-ready

MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.

Language coverage

21 Turkic languages and growing

From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.

TurkishturFull pipeline
KazakhkazFull pipeline
KyrgyzkirFull pipeline
UyghuruigFull pipeline
UzbekuzbMorph + MT
AzerbaijaniazeMorph + MT
TatartatMorph + MT
BashkirbakMorph + MT
TurkmentukMorph + MT
Crimean TatarcrhMorph + MT
S. AzerbaijaniazbMorph + MT
ChuvashchvMorphology
GagauzgagMorphology
SakhasahMorphology
KarakalpakkaaMorphology
NogainogMorphology
KumykkumMorphology
Karachay-BalkarkrcMorphology
AltaialtMorphology
TuvantyvMorphology
KhakaskjhMorphology
Support the project

Help keep Turkic NLP open

Both the toolkit and the book are free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

Funds go directly toward model training, infrastructure, and time spent writing the open book.