A modular, pip-installable Python library for NLP across the Turkic language family. Morphology, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation — in one unified pipeline.
Core library — morphology + tokenization:
With neural models (Stanza + Transformers):
Apertium FST data (GPL-3.0) is downloaded separately at runtime — it is never bundled in the pip package.
Install from PyPI with optional extras depending on which processors you need.
Models are fetched on demand and cached in ~/.turkicnlp/models/.
import turkicnlp
# Download Stanza + Apertium models for Turkish
turkicnlp.download("tur", processors=["tokenize", "morph", "pos", "lemma", "depparse"])
Processors chain automatically in the correct order.
nlp = turkicnlp.Pipeline("tur", processors=["tokenize", "pos", "lemma", "depparse"])
doc = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")
for sentence in doc.sentences:
for word in sentence.words:
print(f"{word.text:<16} {word.upos:<8} {word.lemma}")
# Bugün ADV bugün
# hava NOUN hava
# çok ADV çok
# güzel ADJ güzel
# parkta NOUN park
# yürüyüş NOUN yürüyüş
# yaptım VERB yap
Powered by Facebook's NLLB-200-distilled-600M model, supporting 11 Turkic languages.
# Translate Kazakh → English
nlp = turkicnlp.Pipeline(
"kaz",
processors=["translate"],
translate_tgt_lang="eng"
)
doc = nlp("Бүгін ауа райы өте жақсы.")
print(doc.translation)
# The weather is very good today.
# Sentence embeddings for semantic similarity
embed = turkicnlp.Pipeline("tur", processors=["embeddings"])
doc = embed("Parkta yürüyüş çok güzeldi.")
print(len(doc.embedding)) # 1024-dim vector
These 11 languages have full NLLB-200 support for both translation and semantic embeddings:
Request only the processors you need. Dependencies are resolved automatically.
Neural (Stanza) or rule-based. Handles multi-word tokens, sentence splitting.
Apertium HFST FSTs for 21 languages. Lemma, POS, and full morphological features.
Stanza neural tagger trained on UD treebanks. Outputs UPOS, XPOS, and features.
Stanza neural lemmatizer for tur, kaz, kir, uig. Apertium fallback for others.
Biaffine attention parser via Stanza. UD-compatible dependency relations.
NER processor for entity extraction. BIO tagging format. (In development)
NLLB-200 encoder states. 1024-dim multilingual vectors for semantic search and similarity.
NLLB-200 sequence-to-sequence generation. Any Turkic language → any FLORES-200 language.
✓ available · ~ in development · – not yet available
| Language | Code | Script | Tokenize | Morphology | POS / Dep | Embeddings | Translation |
|---|---|---|---|---|---|---|---|
| Turkish | tur | Latin | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kazakh | kaz | Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kyrgyz | kir | Cyrillic | ✓ | ✓ | ✓ | ✓ | ✓ |
| Uyghur | uig | Perso-Arabic | ✓ | ✓ | ✓ | ✓ | ✓ |
| Uzbek | uzb | Latin | ~ | ✓ | ~ | ✓ | ✓ |
| Azerbaijani | aze | Latin | ~ | ✓ | ~ | ✓ | ✓ |
| Tatar | tat | Cyrillic | ~ | ✓ | – | ✓ | ✓ |
| Bashkir | bak | Cyrillic | ~ | ✓ | – | ✓ | ✓ |
| Turkmen | tuk | Latin | ~ | ✓ | – | ✓ | ✓ |
| Crimean Tatar | crh | Latin | ~ | ✓ | – | ✓ | ✓ |
| S. Azerbaijani | azb | Perso-Arabic | ✓ | ✓ | – | ✓ | ✓ |
| Chuvash | chv | Cyrillic | ~ | ✓ | – | – | – |
| Sakha (Yakut) | sah | Cyrillic | ~ | ✓ | – | – | – |
| Gagauz | gag | Latin | ~ | ✓ | – | – | – |
| Karakalpak | kaa | Latin | ~ | ✓ | – | – | – |
| + 6 more | … | Cyrillic | ~ | ✓ | – | – | – |
~ = rule-based tokenizer available. Neural models require treebank data. Contributions welcome!
TurkicNLP follows a Stanza-inspired architecture. Every processor declares what it PROVIDES and REQUIRES — the pipeline resolves dependencies automatically.
Script-aware from the ground up: models are keyed by lang/script/processor/backend. The pipeline auto-detects scripts, inserts transliteration steps where needed, and bridges models across writing systems.
Apertium FST data is GPL-3.0 and always downloaded separately — never bundled in the Apache 2.0 pip package.
# Pipeline execution order
script_detect → transliterate
→ tokenize → mwt → morph
→ pos → lemma → depparse
→ ner → embeddings
→ sentiment → translate
# Model storage layout
~/.turkicnlp/models/
tur/Latn/morph/apertium/
kaz/Cyrl/tokenize/stanza/
huggingface/
facebook--nllb-200-distilled-600M/
TurkicNLP is a community effort. There are many ways to contribute — no matter your background.
Found an issue? Open a GitHub issue with your language, input text, and the error. Reproducible reports are hugely helpful.
Have an Apertium FST or UD treebank for a Turkic language not yet covered? Add a catalog entry and open a PR.
Apertium → UD tag mappings exist for Turkish, Kazakh, and Tatar. Help us add mappings for Uzbek, Azerbaijani, Kyrgyz, and others.
Do you have annotated data, parallel corpora, or evaluation benchmarks for Turkic languages? Let us know or link them in the catalog.
Help train POS taggers, dependency parsers, or NER models for under-resourced languages. Training scripts are in the repo.
The open book needs reviewers, code examples, and native speaker insights. See the book repository for how to get involved.
All PRs are reviewed. Small, focused changes merge faster. When in doubt, open an issue first to discuss the approach.
The TurkicNLP library is Apache 2.0 — use it freely in research and commercial projects.
Apertium FST data (morphological analyzers) is GPL-3.0-or-later. It is downloaded separately at runtime and is never bundled in the pip package. Each language's data directory includes its license file.
Stanza models and NLLB-200 are used under their respective licenses (Apache 2.0 and CC-BY-NC-4.0).
| turkicnlp (pip) | Apache 2.0 |
| Apertium FSTs | GPL-3.0-or-later |
| Stanza models | Apache 2.0 |
| NLLB-200 | CC-BY-NC-4.0 |
If TurkicNLP saves you time or powers your research, consider supporting its development. Every contribution helps fund model training, new language coverage, and the open book.
The book will always be free. Donations help make that sustainable.