A modular, pip-installable Python library for NLP across the Turkic language family. Morphology, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation — in one unified pipeline.
Core library — morphology + tokenization:
With all models:
Models are downloaded separately at runtime — they are never bundled in the pip package.
Install from PyPI with optional extras depending on which processors you need.
Models are fetched on demand and cached in ~/.turkicnlp/models/.
import turkicnlp
# Download Stanza + Apertium models for Turkish
turkicnlp.download("tur", processors=["tokenize", "morph", "pos", "lemma", "depparse"])
Processors chain automatically in the correct order.
nlp = turkicnlp.Pipeline("tur", processors=["tokenize", "pos", "lemma", "depparse"])
doc = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")
for sentence in doc.sentences:
for word in sentence.words:
print(f"{word.text:<16} {word.upos:<8} {word.lemma}")
# Bugün ADV bugün
# hava NOUN hava
# çok ADV çok
# güzel ADJ güzel
# parkta NOUN park
# yürüyüş NOUN yürüyüş
# yaptım VERB yap
These 15 languages have neural models for tokenization, POS tagging, lemmatization, and dependency parsing:
Powered by Facebook's NLLB-200-distilled-600M model, supporting 11 Turkic languages. Browse models and datasets on HuggingFace.
# Translate Kazakh → English
nlp = turkicnlp.Pipeline(
"kaz",
processors=["translate"],
translate_tgt_lang="eng"
)
doc = nlp("Бүгін ауа райы өте жақсы.")
print(doc.translation)
# The weather is very good today.
# Sentence embeddings for semantic similarity
embed = turkicnlp.Pipeline("tur", processors=["embeddings"])
doc = embed("Parkta yürüyüş çok güzeldi.")
print(len(doc.embedding)) # 1024-dim vector
These 11 languages have full NLLB-200 support for both translation and semantic embeddings:
Powered by GlotLID, supporting 1000+ languages including all Turkic varieties.
# Identify a text
import turkicnlp
lid = turkicnlp.LanguageDetection()
labels, probs = lid.predict("salam, hemmelere!", k=3)
print(labels) # ['__label__uzb_Latn', '__label__tur_Latn', '__label__aze_Latn']
print(probs) # [0.94, 0.03, 0.02]
# Limit to Turkic languages
lid_turkic = turkicnlp.LanguageDetection(
languages=["__label__tur_Latn", "__label__kaz_Cyrl", "__label__uzb_Latn"]
)
label, prob = lid_turkic.predict("Привет!", k=1)
GlotLID can identify text in any of the Turkic languages, with script variants (Cyrillic, Latin, Perso-Arabic):
Hybrid neural + FST morpheme tokenizer. Combines Glot500 neural models with Apertium FST transducers and phonological rules.
# Turkish example
from turkicnlp.processors.morpheme_tokenizer import MorphemeTokenizer
tok = MorphemeTokenizer(lang="tur")
tok.load()
result = tok.segment("evlerinden")
print(result.labeled)
# [('ev', 'STEM'), ('ler', 'PLUR'),
# ('in', 'POSS.2SG'), ('den', 'ABL')]
# Kazakh example
tok_kaz = MorphemeTokenizer(lang="kaz")
result = tok_kaz.segment("бармадым")
print(result.labeled)
# [('бар', 'STEM'), ('ма', 'NEG'),
# ('ды', 'PST'), ('м', '1SG')]
Comprehensive morpheme inventories and phonological rules for these Turkic languages:
Features morpheme stems, part-of-speech, grammatical categories (case, number, person, tense, aspect, mood, and more).
Bidirectional script conversion: Cyrillic ↔ Latin ↔ Perso-Arabic, plus the Common Turkic Alphabet for cross-language interoperability.
# Kazakh Cyrillic → Latin
from turkicnlp.scripts import Script
from turkicnlp.scripts.transliterator import Transliterator
t = Transliterator("kaz", Script.CYRILLIC, Script.LATIN)
print(t.transliterate("Қазақстан"))
# → Qazaqstan (2021 official alphabet)
# Uyghur Perso-Arabic → Latin
t_ug = Transliterator("uig", Script.PERSO_ARABIC, Script.LATIN)
print(t_ug.transliterate("مەكتەپ"))
# → mektep
# Any Turkic language → Common Turkic Alphabet
t_cts = Transliterator("aze", Script.LATIN, Script.COMMON_TURKIC)
print(t_cts.transliterate("Azərbaycan dili"))
# → Azärbaycan dili
These Turkic languages have complete script conversion:
Also supports Old Turkic Runic Script (Orkhon-Yenisei) → Latin for historical texts.
Request only the processors you need. Dependencies are resolved automatically.
Neural (Stanza) or rule-based. Handles multi-word tokens, sentence splitting.
Apertium HFST FSTs for 20 languages. Lemma, POS, and full morphological features.
Stanza neural tagger and multilingual neural models for 15 languages. Outputs UPOS, XPOS, and features.
Stanza neural lemmatizer and multilingual neural models for 15 languages. Apertium fallback for others.
Biaffine attention parser via Stanza. UD-compatible dependency relations.
Entity extraction via Stanza. BIO tagging format. Available for Turkish (Starlang) and Kazakh (KazNERD).
NLLB-200 encoder states. 1024-dim multilingual vectors for semantic search and similarity. Models on HF
NLLB-200 sequence-to-sequence generation. Any Turkic language → any FLORES-200 language. Models on HF
GlotLID model for multilingual language detection. Predict 1000+ language labels with confidence scores.
Glot500 multilingual morph analyzer. UPOS, UD morphological features, and lemmatization for 23 languages.
Bidirectional conversion between Cyrillic, Latin, and Perso-Arabic. Common Turkic Alphabet support.
✓ available · ~ in development · – not yet available
| Language | Code | Script | Tokenize | Morphology | POS | DepParse | Embeddings | Translation | LID |
|---|---|---|---|---|---|---|---|---|---|
| Turkish | tur | Latin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kazakh | kaz | Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kyrgyz | kir | Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Uyghur | uig | Perso-Arabic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Uzbek | uzb | Latin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Azerbaijani | aze | Latin / Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Tatar | tat | Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Bashkir | bak | Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Turkmen | tuk | Latin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Crimean Tatar | crh | Latin | ✓ | ✓ | ✓ | – | ✓ | ✓ | ✓ |
| S. Azerbaijani | azb | Perso-Arabic | ✓ | – | – | – | ✓ | ✓ | ✓ |
| Sakha (Yakut) | sah | Cyrillic | ✓ | ✓ | ✓ | ✓ | – | – | ✓ |
| Karakalpak | kaa | Latin / Cyrillic | ✓ | ✓ | ✓ | ✓ | – | – | ✓ |
| Kumyk | kum | Cyrillic | ✓ | ✓ | ✓ | ✓ | – | – | ✓ |
| Karachay-Balkar | krc | Cyrillic | ✓ | ✓ | ✓ | ✓ | – | – | ✓ |
| Nogai | nog | Cyrillic | ✓ | ✓ | ✓ | ✓ | – | – | ✓ |
| Ottoman Turkish | ota | Perso-Arabic | ✓ | ✓ | ✓ | ✓ | – | – | ✓ |
| Chuvash | chv | Cyrillic | ✓ | ✓ | ✓ | – | – | – | ✓ |
| Gagauz | gag | Latin | ✓ | ✓ | ✓ | – | – | – | ✓ |
| Altai | alt | Cyrillic | ✓ | ✓ | ✓ | – | – | – | ✓ |
| Tuvan | tyv | Cyrillic | ✓ | ✓ | ✓ | – | – | – | ✓ |
| Khakas | kjh | Cyrillic | ✓ | ✓ | ✓ | – | – | – | ✓ |
| Khalaj | klj | Latin | ✓ | ✓ | ✓ | – | – | – | ✓ |
| Old Turkish | otk | Runic | ~ | – | – | – | – | – | ✓ |
✓ = available · ~ = in development · – = not yet available.
POS Tagging: Turkish, Kazakh, Kyrgyz, Uyghur use official Stanza/UD models; Uzbek, Turkmen, Tatar, Bashkir, Azerbaijani use custom-trained Stanza models; Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish use Glot500-based multilingual models. DepParse: Same backends as POS; note that some languages have POS but not dependency parsing.
Language ID: GlotLID supports all 24 Turkic languages via the multilingual model. Morphology: Available for 20+ languages via Apertium FST transducers and Glot500 neural models.
TurkicNLP follows a Stanza-inspired architecture. Every processor declares what it PROVIDES and REQUIRES — the pipeline resolves dependencies automatically.
Script-aware from the ground up: models are keyed by lang/script/processor/backend. The pipeline auto-detects scripts, inserts transliteration steps where needed, and bridges models across writing systems.
Apertium FST data is GPL-3.0 and always downloaded separately — never bundled in the Apache 2.0 pip package. All models and datasets hosted on HuggingFace.
# Pipeline execution order
script_detect → transliterate
→ tokenize → mwt → morph
→ pos → lemma → depparse
→ ner → embeddings
→ sentiment → translate
# Model storage layout
~/.turkicnlp/models/
tur/Latn/morph/apertium/
kaz/Cyrl/tokenize/stanza/
huggingface/
facebook--nllb-200-distilled-600M/
Please cite the accompanying paper published on arXiv:
Sherzod Hakimov. TurkicNLP: An NLP Toolkit for Turkic Languages. arXiv:2602.19174, 2026.
@misc{hakimov2026turkicnlp,
title = {TurkicNLP: An NLP Toolkit
for Turkic Languages},
author = {Sherzod Hakimov},
year = {2026},
eprint = {2602.19174},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2602.19174}
}
TurkicNLP is a community effort. There are many ways to contribute — no matter your background.
Found an issue? Open a GitHub issue with your language, input text, and the error. Reproducible reports are hugely helpful.
Have an Apertium FST or UD treebank for a Turkic language not yet covered? Add a catalog entry and open a PR.
Apertium → UD tag mappings exist for Turkish, Kazakh, and Tatar. Help us add mappings for Uzbek, Azerbaijani, Kyrgyz, and others.
Do you have annotated data, parallel corpora, or evaluation benchmarks for Turkic languages? Let us know or link them in the catalog.
Help train POS taggers, dependency parsers, or NER models for under-resourced languages. Training scripts are in the repo.
All PRs are reviewed. Small, focused changes merge faster. When in doubt, open an issue first to discuss the approach.
Get help with the toolkit, share your research, discuss challenges, and help shape the future of Turkic NLP — all in one place.
The TurkicNLP library is Apache 2.0 — use it freely in research and commercial projects.
Apertium FST data (morphological analyzers) is GPL-3.0-or-later. It is downloaded separately at runtime and is never bundled in the pip package. Each language's data directory includes its license file.
Stanza models and NLLB-200 are used under their respective licenses (Apache 2.0 and CC-BY-NC-4.0).
| turkicnlp (pip) | Apache 2.0 |
| Apertium FSTs | GPL-3.0-or-later |
| Stanza models | Apache 2.0 |
| NLLB-200 | CC-BY-NC-4.0 |
If TurkicNLP saves you time or powers your research, consider supporting its development. Every contribution helps fund model training, new language coverage.