Toolkit

The TurkicNLP Toolkit

A modular, pip-installable Python library for NLP across the Turkic language family. Morphology, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation — in one unified pipeline.

24 languages Python 3.9+ Apache 2.0 CoNLL-U native

Installation

Core library — morphology + tokenization:

$ pip install turkicnlp

With all models:

$ pip install "turkicnlp[all]"

Models are downloaded separately at runtime — they are never bundled in the pip package.

Quick start

Up and running in three steps

1

Install the library

Install from PyPI with optional extras depending on which processors you need.

$ pip install "turkicnlp[stanza,translation]"
2

Download models for your language

Models are fetched on demand and cached in ~/.turkicnlp/models/.

import turkicnlp

# Download Stanza + Apertium models for Turkish
turkicnlp.download("tur", processors=["tokenize", "morph", "pos", "lemma", "depparse"])
3

Build a pipeline and annotate text

Processors chain automatically in the correct order.

nlp = turkicnlp.Pipeline("tur", processors=["tokenize", "pos", "lemma", "depparse"])
doc = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")

for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.text:<16} {word.upos:<8} {word.lemma}")

# Bugün            ADV      bugün
# hava             NOUN     hava
# çok              ADV      çok
# güzel            ADJ      güzel
# parkta           NOUN     park
# yürüyüş          NOUN     yürüyüş
# yaptım           VERB     yap

Supported languages for POS, Lemma & Parsing

These 15 languages have neural models for tokenization, POS tagging, lemmatization, and dependency parsing:

Turkishtur
Kazakhkaz
Kyrgyzkir
Uyghuruig
Uzbekuzb
Azerbaijaniaze
Tatartat
Bashkirbak
Turkmentuk
Sakhasah
Karakalpakkaa
Kumykkum
Karachay-Balkarkrc
Nogainog
Ottoman Turkishota
Machine translation

NLLB-200 translation & embeddings

Powered by Facebook's NLLB-200-distilled-600M model, supporting 11 Turkic languages. Browse models and datasets on HuggingFace.

# Translate Kazakh → English
nlp = turkicnlp.Pipeline(
    "kaz",
    processors=["translate"],
    translate_tgt_lang="eng"
)
doc = nlp("Бүгін ауа райы өте жақсы.")
print(doc.translation)
# The weather is very good today.

# Sentence embeddings for semantic similarity
embed = turkicnlp.Pipeline("tur", processors=["embeddings"])
doc = embed("Parkta yürüyüş çok güzeldi.")
print(len(doc.embedding))   # 1024-dim vector

Supported languages for MT & Embeddings

These 11 languages have full NLLB-200 support for both translation and semantic embeddings:

Turkishtur
Kazakhkaz
Kyrgyzkir
Uzbekuzb
Azerbaijaniaze
Tatartat
Bashkirbak
Turkmentuk
Uyghuruig
Crimean Tatarcrh
S. Azerbaijaniazb
Language identification

Auto-detect languages with confidence

Powered by GlotLID, supporting 1000+ languages including all Turkic varieties.

# Identify a text
import turkicnlp

lid = turkicnlp.LanguageDetection()
labels, probs = lid.predict("salam, hemmelere!", k=3)
print(labels)  # ['__label__uzb_Latn', '__label__tur_Latn', '__label__aze_Latn']
print(probs)   # [0.94, 0.03, 0.02]

# Limit to Turkic languages
lid_turkic = turkicnlp.LanguageDetection(
    languages=["__label__tur_Latn", "__label__kaz_Cyrl", "__label__uzb_Latn"]
)
label, prob = lid_turkic.predict("Привет!", k=1)

All 24 Turkic languages supported

GlotLID can identify text in any of the Turkic languages, with script variants (Cyrillic, Latin, Perso-Arabic):

Turkishtur
Kazakhkaz
Uzbekuzb
Kyrgyzkir
Uyghuruig
Azerbaijaniaze
Tatartat
Turkmentuk
Crimean Tatarcrh
Bashkirbak
Sakhasah
Karakalpakkaa
Kumykkum
Ottoman Turkishota
Karachay-Balkarkrc
Gagauzgag
Morpheme segmentation

Label morpheme boundaries with linguistic precision

Hybrid neural + FST morpheme tokenizer. Combines Glot500 neural models with Apertium FST transducers and phonological rules.

# Turkish example
from turkicnlp.processors.morpheme_tokenizer import MorphemeTokenizer

tok = MorphemeTokenizer(lang="tur")
tok.load()

result = tok.segment("evlerinden")
print(result.labeled)
# [('ev', 'STEM'), ('ler', 'PLUR'),
#  ('in', 'POSS.2SG'), ('den', 'ABL')]

# Kazakh example
tok_kaz = MorphemeTokenizer(lang="kaz")
result = tok_kaz.segment("бармадым")
print(result.labeled)
# [('бар', 'STEM'), ('ма', 'NEG'),
#  ('ды', 'PST'), ('м', '1SG')]

Supported for 16 languages

Comprehensive morpheme inventories and phonological rules for these Turkic languages:

Turkish, Azerbaijani, Kazakh, Uzbek, Kyrgyz, Tatar, Bashkir, Turkmen, Crimean Tatar, Sakha, Khakas, Tuvan, Altai, Chuvash, Gagauz, Kumyk

Features morpheme stems, part-of-speech, grammatical categories (case, number, person, tense, aspect, mood, and more).

Script conversion

Seamless transliteration across writing systems

Bidirectional script conversion: Cyrillic ↔ Latin ↔ Perso-Arabic, plus the Common Turkic Alphabet for cross-language interoperability.

# Kazakh Cyrillic → Latin
from turkicnlp.scripts import Script
from turkicnlp.scripts.transliterator import Transliterator

t = Transliterator("kaz", Script.CYRILLIC, Script.LATIN)
print(t.transliterate("Қазақстан"))
# → Qazaqstan (2021 official alphabet)

# Uyghur Perso-Arabic → Latin
t_ug = Transliterator("uig", Script.PERSO_ARABIC, Script.LATIN)
print(t_ug.transliterate("مەكتەپ"))
# → mektep

# Any Turkic language → Common Turkic Alphabet
t_cts = Transliterator("aze", Script.LATIN, Script.COMMON_TURKIC)
print(t_cts.transliterate("Azərbaycan dili"))
# → Azärbaycan dili

8 languages with full bidirectional support

These Turkic languages have complete script conversion:

Cyrillic ↔ Latin: Kazakh, Uzbek, Azerbaijani, Tatar, Turkmen, Karakalpak, Crimean Tatar
Perso-Arabic ↔ Latin: Uyghur
To Common Turkic Alphabet: All 21 supported languages

Also supports Old Turkic Runic Script (Orkhon-Yenisei) → Latin for historical texts.

Pipeline processors

Modular pipeline architecture

Request only the processors you need. Dependencies are resolved automatically.

tokenize

Tokenization

Neural (Stanza) or rule-based. Handles multi-word tokens, sentence splitting.

morph

Morphological Analysis

Apertium HFST FSTs for 20 languages. Lemma, POS, and full morphological features.

pos

POS Tagging

Stanza neural tagger and multilingual neural models for 15 languages. Outputs UPOS, XPOS, and features.

lemma

Lemmatization

Stanza neural lemmatizer and multilingual neural models for 15 languages. Apertium fallback for others.

depparse

Dependency Parsing

Biaffine attention parser via Stanza. UD-compatible dependency relations.

ner

Named Entities

Entity extraction via Stanza. BIO tagging format. Available for Turkish (Starlang) and Kazakh (KazNERD).

embeddings

Sentence Embeddings

NLLB-200 encoder states. 1024-dim multilingual vectors for semantic search and similarity. Models on HF

translate

Machine Translation

NLLB-200 sequence-to-sequence generation. Any Turkic language → any FLORES-200 language. Models on HF

lid

Language Identification

GlotLID model for multilingual language detection. Predict 1000+ language labels with confidence scores.

morph_neural

Neural Morphology

Glot500 multilingual morph analyzer. UPOS, UD morphological features, and lemmatization for 23 languages.

transliterate

Script Transliteration

Bidirectional conversion between Cyrillic, Latin, and Perso-Arabic. Common Turkic Alphabet support.

Language support matrix

What works for each language

✓ available  ·  ~ in development  ·  – not yet available

Language Code Script Tokenize Morphology POS DepParse Embeddings Translation LID
Turkish turLatin
Kazakh kazCyrillic
Kyrgyz kirCyrillic
Uyghur uigPerso-Arabic
Uzbek uzbLatin
Azerbaijani azeLatin / Cyrillic
Tatar tatCyrillic
Bashkir bakCyrillic
Turkmen tukLatin
Crimean Tatar crhLatin
S. Azerbaijani azbPerso-Arabic
Sakha (Yakut) sahCyrillic
Karakalpak kaaLatin / Cyrillic
Kumyk kumCyrillic
Karachay-Balkar krcCyrillic
Nogai nogCyrillic
Ottoman Turkish otaPerso-Arabic
Chuvash chvCyrillic
Gagauz gagLatin
Altai altCyrillic
Tuvan tyvCyrillic
Khakas kjhCyrillic
Khalaj kljLatin
Old Turkish otkRunic ~

✓ = available  ·  ~ = in development  ·  – = not yet available.

POS Tagging: Turkish, Kazakh, Kyrgyz, Uyghur use official Stanza/UD models; Uzbek, Turkmen, Tatar, Bashkir, Azerbaijani use custom-trained Stanza models; Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish use Glot500-based multilingual models. DepParse: Same backends as POS; note that some languages have POS but not dependency parsing.

Language ID: GlotLID supports all 24 Turkic languages via the multilingual model. Morphology: Available for 20+ languages via Apertium FST transducers and Glot500 neural models.

Design

Built to be extended

TurkicNLP follows a Stanza-inspired architecture. Every processor declares what it PROVIDES and REQUIRES — the pipeline resolves dependencies automatically.

Script-aware from the ground up: models are keyed by lang/script/processor/backend. The pipeline auto-detects scripts, inserts transliteration steps where needed, and bridges models across writing systems.

Apertium FST data is GPL-3.0 and always downloaded separately — never bundled in the Apache 2.0 pip package. All models and datasets hosted on HuggingFace.

# Pipeline execution order
script_detect → transliterate
  → tokenize → mwt → morph
  → pos → lemma → depparse
  → ner → embeddings
  → sentiment → translate

# Model storage layout
~/.turkicnlp/models/
  tur/Latn/morph/apertium/
  kaz/Cyrl/tokenize/stanza/
  huggingface/
    facebook--nllb-200-distilled-600M/
Citation

If you use TurkicNLP in your research

Please cite the accompanying paper published on arXiv:

Sherzod Hakimov. TurkicNLP: An NLP Toolkit for Turkic Languages. arXiv:2602.19174, 2026.

@misc{hakimov2026turkicnlp,
  title  = {TurkicNLP: An NLP Toolkit
            for Turkic Languages},
  author = {Sherzod Hakimov},
  year   = {2026},
  eprint = {2602.19174},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url    = {https://arxiv.org/abs/2602.19174}
}
Contributing

Help us cover more languages

TurkicNLP is a community effort. There are many ways to contribute — no matter your background.

🐛

Report bugs

Found an issue? Open a GitHub issue with your language, input text, and the error. Reproducible reports are hugely helpful.

🌍

Add a language

Have an Apertium FST or UD treebank for a Turkic language not yet covered? Add a catalog entry and open a PR.

🧪

Improve tag mappings

Apertium → UD tag mappings exist for Turkish, Kazakh, and Tatar. Help us add mappings for Uzbek, Azerbaijani, Kyrgyz, and others.

📊

Share datasets

Do you have annotated data, parallel corpora, or evaluation benchmarks for Turkic languages? Let us know or link them in the catalog.

🧠

Train neural models

Help train POS taggers, dependency parsers, or NER models for under-resourced languages. Training scripts are in the repo.

How to get started

1

Fork & clone the repository

$ git clone https://github.com/turkic-nlp/turkicnlp
2

Install in development mode

$ pip install -e ".[dev,stanza,nllb]"
3

Run the tests

$ pytest turkicnlp/tests/
4

Open a pull request

All PRs are reviewed. Small, focused changes merge faster. When in doubt, open an issue first to discuss the approach.

💬

Join the TurkicNLP Community

Get help with the toolkit, share your research, discuss challenges, and help shape the future of Turkic NLP — all in one place.

Licensing

Open source, with clear boundaries

The TurkicNLP library is Apache 2.0 — use it freely in research and commercial projects.

Apertium FST data (morphological analyzers) is GPL-3.0-or-later. It is downloaded separately at runtime and is never bundled in the pip package. Each language's data directory includes its license file.

Stanza models and NLLB-200 are used under their respective licenses (Apache 2.0 and CC-BY-NC-4.0).

Quick reference

turkicnlp (pip)Apache 2.0
Apertium FSTsGPL-3.0-or-later
Stanza modelsApache 2.0
NLLB-200CC-BY-NC-4.0
Support the project

Keep Turkic NLP open & growing

If TurkicNLP saves you time or powers your research, consider supporting its development. Every contribution helps fund model training, new language coverage.