Toolkit · v0.1.1

The TurkicNLP Toolkit

A modular, pip-installable Python library for NLP across the Turkic language family. Morphology, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation — in one unified pipeline.

21 languages Python 3.9+ Apache 2.0 CoNLL-U native

Installation

Core library — morphology + tokenization:

$ pip install turkicnlp

With neural models (Stanza + Transformers):

$ pip install "turkicnlp[stanza,nllb]"

Apertium FST data (GPL-3.0) is downloaded separately at runtime — it is never bundled in the pip package.

Quick start

Up and running in three steps

1

Install the library

Install from PyPI with optional extras depending on which processors you need.

$ pip install "turkicnlp[stanza,nllb]"
2

Download models for your language

Models are fetched on demand and cached in ~/.turkicnlp/models/.

import turkicnlp

# Download Stanza + Apertium models for Turkish
turkicnlp.download("tur", processors=["tokenize", "morph", "pos", "lemma", "depparse"])
3

Build a pipeline and annotate text

Processors chain automatically in the correct order.

nlp = turkicnlp.Pipeline("tur", processors=["tokenize", "pos", "lemma", "depparse"])
doc = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")

for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.text:<16} {word.upos:<8} {word.lemma}")

# Bugün            ADV      bugün
# hava             NOUN     hava
# çok              ADV      çok
# güzel            ADJ      güzel
# parkta           NOUN     park
# yürüyüş          NOUN     yürüyüş
# yaptım           VERB     yap
Machine translation

NLLB-200 translation & embeddings

Powered by Facebook's NLLB-200-distilled-600M model, supporting 11 Turkic languages.

# Translate Kazakh → English
nlp = turkicnlp.Pipeline(
    "kaz",
    processors=["translate"],
    translate_tgt_lang="eng"
)
doc = nlp("Бүгін ауа райы өте жақсы.")
print(doc.translation)
# The weather is very good today.

# Sentence embeddings for semantic similarity
embed = turkicnlp.Pipeline("tur", processors=["embeddings"])
doc = embed("Parkta yürüyüş çok güzeldi.")
print(len(doc.embedding))   # 1024-dim vector

Supported languages for MT & Embeddings

These 11 languages have full NLLB-200 support for both translation and semantic embeddings:

Turkishtur
Kazakhkaz
Kyrgyzkir
Uzbekuzb
Azerbaijaniaze
Tatartat
Bashkirbak
Turkmentuk
Uyghuruig
Crimean Tatarcrh
S. Azerbaijaniazb
Pipeline processors

Modular pipeline architecture

Request only the processors you need. Dependencies are resolved automatically.

tokenize

Tokenization

Neural (Stanza) or rule-based. Handles multi-word tokens, sentence splitting.

morph

Morphological Analysis

Apertium HFST FSTs for 21 languages. Lemma, POS, and full morphological features.

pos

POS Tagging

Stanza neural tagger trained on UD treebanks. Outputs UPOS, XPOS, and features.

lemma

Lemmatization

Stanza neural lemmatizer for tur, kaz, kir, uig. Apertium fallback for others.

depparse

Dependency Parsing

Biaffine attention parser via Stanza. UD-compatible dependency relations.

ner

Named Entities

NER processor for entity extraction. BIO tagging format. (In development)

embeddings

Sentence Embeddings

NLLB-200 encoder states. 1024-dim multilingual vectors for semantic search and similarity.

translate

Machine Translation

NLLB-200 sequence-to-sequence generation. Any Turkic language → any FLORES-200 language.

Language support matrix

What works for each language

✓ available  ·  ~ in development  ·  – not yet available

Language Code Script Tokenize Morphology POS / Dep Embeddings Translation
Turkish turLatin
Kazakh kazCyrillic
Kyrgyz kirCyrillic
Uyghur uigPerso-Arabic
Uzbek uzbLatin ~~
Azerbaijani azeLatin ~~
Tatar tatCyrillic ~
Bashkir bakCyrillic ~
Turkmen tukLatin ~
Crimean Tatar crhLatin ~
S. Azerbaijani azbPerso-Arabic
Chuvash chvCyrillic ~
Sakha (Yakut) sahCyrillic ~
Gagauz gagLatin ~
Karakalpak kaaLatin ~
+ 6 more Cyrillic ~

~ = rule-based tokenizer available. Neural models require treebank data. Contributions welcome!

Design

Built to be extended

TurkicNLP follows a Stanza-inspired architecture. Every processor declares what it PROVIDES and REQUIRES — the pipeline resolves dependencies automatically.

Script-aware from the ground up: models are keyed by lang/script/processor/backend. The pipeline auto-detects scripts, inserts transliteration steps where needed, and bridges models across writing systems.

Apertium FST data is GPL-3.0 and always downloaded separately — never bundled in the Apache 2.0 pip package.

# Pipeline execution order
script_detect → transliterate
  → tokenize → mwt → morph
  → pos → lemma → depparse
  → ner → embeddings
  → sentiment → translate

# Model storage layout
~/.turkicnlp/models/
  tur/Latn/morph/apertium/
  kaz/Cyrl/tokenize/stanza/
  huggingface/
    facebook--nllb-200-distilled-600M/
Contributing

Help us cover more languages

TurkicNLP is a community effort. There are many ways to contribute — no matter your background.

🐛

Report bugs

Found an issue? Open a GitHub issue with your language, input text, and the error. Reproducible reports are hugely helpful.

🌍

Add a language

Have an Apertium FST or UD treebank for a Turkic language not yet covered? Add a catalog entry and open a PR.

🧪

Improve tag mappings

Apertium → UD tag mappings exist for Turkish, Kazakh, and Tatar. Help us add mappings for Uzbek, Azerbaijani, Kyrgyz, and others.

📊

Share datasets

Do you have annotated data, parallel corpora, or evaluation benchmarks for Turkic languages? Let us know or link them in the catalog.

🧠

Train neural models

Help train POS taggers, dependency parsers, or NER models for under-resourced languages. Training scripts are in the repo.

📝

Contribute to the book

The open book needs reviewers, code examples, and native speaker insights. See the book repository for how to get involved.

How to get started

1

Fork & clone the repository

$ git clone https://github.com/turkic-nlp/turkicnlp
2

Install in development mode

$ pip install -e ".[dev,stanza,nllb]"
3

Run the tests

$ pytest turkicnlp/tests/
4

Open a pull request

All PRs are reviewed. Small, focused changes merge faster. When in doubt, open an issue first to discuss the approach.

Licensing

Open source, with clear boundaries

The TurkicNLP library is Apache 2.0 — use it freely in research and commercial projects.

Apertium FST data (morphological analyzers) is GPL-3.0-or-later. It is downloaded separately at runtime and is never bundled in the pip package. Each language's data directory includes its license file.

Stanza models and NLLB-200 are used under their respective licenses (Apache 2.0 and CC-BY-NC-4.0).

Quick reference

turkicnlp (pip)Apache 2.0
Apertium FSTsGPL-3.0-or-later
Stanza modelsApache 2.0
NLLB-200CC-BY-NC-4.0
Support the project

Keep Turkic NLP open & growing

If TurkicNLP saves you time or powers your research, consider supporting its development. Every contribution helps fund model training, new language coverage, and the open book.

The book will always be free. Donations help make that sustainable.