Open · Research-backed · Community-driven

NLP for 20+ Turkic languages.

A practical, pip-installable Python toolkit and an open textbook — built together so you can go from raw text to annotated output across the Turkic language family.

21 languages Morphology & POS Translation & Embeddings Apache 2.0

Explore the toolkit → Read the book

From text to understanding

One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.

$ pip install turkicnlp

Works with Turkish, Kazakh, Uzbek, Kyrgyz, Uyghur, Tatar and 15 more languages.

🔧

The Toolkit

A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.

Supports 21 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.

Get started

📖

The Book

NLP for Turkic Languages is a practical, open-access textbook covering foundations through modern LLMs — with exercises, code examples, and linguistic depth for every chapter.

Freely available online. Because knowledge about under-resourced languages should be open.

Coming soon

Toolkit capabilities

Everything you need to build Turkic NLP systems

Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.

🔤

Tokenization & Scripts

Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.

🧬

Morphological Analysis

Apertium HFST finite-state transducers for 21 languages, loaded natively via Python. No system Apertium install required.

🏷️

POS & Dependency Parsing

Stanza neural models trained on Universal Dependencies treebanks for Turkish, Kazakh, Kyrgyz, Uyghur, and Ottoman Turkish.

🌐

Translation & Embeddings

NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.

📄

CoNLL-U I/O

Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.

🔬

Research-ready

MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.

Language coverage

21 Turkic languages and growing

From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.

TurkishturFull pipeline

KazakhkazFull pipeline

KyrgyzkirFull pipeline

UyghuruigFull pipeline

UzbekuzbMorph + MT

AzerbaijaniazeMorph + MT

TatartatMorph + MT

BashkirbakMorph + MT

TurkmentukMorph + MT

Crimean TatarcrhMorph + MT

S. AzerbaijaniazbMorph + MT

ChuvashchvMorphology

GagauzgagMorphology

SakhasahMorphology

KarakalpakkaaMorphology

NogainogMorphology

KumykkumMorphology

Karachay-BalkarkrcMorphology

AltaialtMorphology

TuvantyvMorphology

KhakaskjhMorphology

Support the project

Help keep Turkic NLP open

Both the toolkit and the book are free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

💙 Donate via PayPal 🎁 Support on Patreon

Funds go directly toward model training, infrastructure, and time spent writing the open book.