Toolkit · v0.1.1

The TurkicNLP Toolkit

A modular, pip-installable Python library for NLP across the Turkic language family. Morphology, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation — in one unified pipeline.

21 languages Python 3.9+ Apache 2.0 CoNLL-U native

View on GitHub → PyPI package

Installation

Core library — morphology + tokenization:

$ pip install turkicnlp

With neural models (Stanza + Transformers):

$ pip install "turkicnlp[stanza,nllb]"

Apertium FST data (GPL-3.0) is downloaded separately at runtime — it is never bundled in the pip package.

Quick start

Up and running in three steps

Install the library

Install from PyPI with optional extras depending on which processors you need.

$ pip install "turkicnlp[stanza,nllb]"

Download models for your language

Models are fetched on demand and cached in ~/.turkicnlp/models/.

import turkicnlp

# Download Stanza + Apertium models for Turkish
turkicnlp.download("tur", processors=["tokenize", "morph", "pos", "lemma", "depparse"])

Build a pipeline and annotate text

Processors chain automatically in the correct order.

nlp = turkicnlp.Pipeline("tur", processors=["tokenize", "pos", "lemma", "depparse"])
doc = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")

for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.text:<16} {word.upos:<8} {word.lemma}")

# Bugün            ADV      bugün
# hava             NOUN     hava
# çok              ADV      çok
# güzel            ADJ      güzel
# parkta           NOUN     park
# yürüyüş          NOUN     yürüyüş
# yaptım           VERB     yap

Machine translation

NLLB-200 translation & embeddings

# Translate Kazakh → English
nlp = turkicnlp.Pipeline(
    "kaz",
    processors=["translate"],
    translate_tgt_lang="eng"
)
doc = nlp("Бүгін ауа райы өте жақсы.")
print(doc.translation)
# The weather is very good today.

# Sentence embeddings for semantic similarity
embed = turkicnlp.Pipeline("tur", processors=["embeddings"])
doc = embed("Parkta yürüyüş çok güzeldi.")
print(len(doc.embedding))   # 1024-dim vector

Supported languages for MT & Embeddings

These 11 languages have full NLLB-200 support for both translation and semantic embeddings:

Turkishtur

Kazakhkaz

Kyrgyzkir

Uzbekuzb

Azerbaijaniaze

Tatartat

Bashkirbak

Turkmentuk

Uyghuruig

Crimean Tatarcrh

S. Azerbaijaniazb

Pipeline processors

Modular pipeline architecture

Request only the processors you need. Dependencies are resolved automatically.

tokenize

Tokenization

Neural (Stanza) or rule-based. Handles multi-word tokens, sentence splitting.

morph

Morphological Analysis

Apertium HFST FSTs for 21 languages. Lemma, POS, and full morphological features.

pos

POS Tagging

Stanza neural tagger trained on UD treebanks. Outputs UPOS, XPOS, and features.

lemma

Lemmatization

Stanza neural lemmatizer for tur, kaz, kir, uig. Apertium fallback for others.

depparse

Dependency Parsing

Biaffine attention parser via Stanza. UD-compatible dependency relations.

ner

Named Entities

NER processor for entity extraction. BIO tagging format. (In development)

embeddings

Sentence Embeddings

NLLB-200 encoder states. 1024-dim multilingual vectors for semantic search and similarity.

translate

Machine Translation

NLLB-200 sequence-to-sequence generation. Any Turkic language → any FLORES-200 language.

Language support matrix

What works for each language

✓ available · ~ in development · – not yet available

Language	Code	Script	Tokenize	Morphology	POS / Dep	Embeddings	Translation
Turkish	tur	Latin	✓	✓	✓	✓	✓
Kazakh	kaz	Cyrillic	✓	✓	✓	✓	✓
Kyrgyz	kir	Cyrillic	✓	✓	✓	✓	✓
Uyghur	uig	Perso-Arabic	✓	✓	✓	✓	✓
Uzbek	uzb	Latin	~	✓	~	✓	✓
Azerbaijani	aze	Latin	~	✓	~	✓	✓
Tatar	tat	Cyrillic	~	✓	–	✓	✓
Bashkir	bak	Cyrillic	~	✓	–	✓	✓
Turkmen	tuk	Latin	~	✓	–	✓	✓
Crimean Tatar	crh	Latin	~	✓	–	✓	✓
S. Azerbaijani	azb	Perso-Arabic	✓	✓	–	✓	✓
Chuvash	chv	Cyrillic	~	✓	–	–	–
Sakha (Yakut)	sah	Cyrillic	~	✓	–	–	–
Gagauz	gag	Latin	~	✓	–	–	–
Karakalpak	kaa	Latin	~	✓	–	–	–
+ 6 more	…	Cyrillic	~	✓	–	–	–

~ = rule-based tokenizer available. Neural models require treebank data. Contributions welcome!

Design

Built to be extended

TurkicNLP follows a Stanza-inspired architecture. Every processor declares what it PROVIDES and REQUIRES — the pipeline resolves dependencies automatically.

Script-aware from the ground up: models are keyed by lang/script/processor/backend. The pipeline auto-detects scripts, inserts transliteration steps where needed, and bridges models across writing systems.

Apertium FST data is GPL-3.0 and always downloaded separately — never bundled in the Apache 2.0 pip package.

# Pipeline execution order
script_detect → transliterate
  → tokenize → mwt → morph
  → pos → lemma → depparse
  → ner → embeddings
  → sentiment → translate

# Model storage layout
~/.turkicnlp/models/
  tur/Latn/morph/apertium/
  kaz/Cyrl/tokenize/stanza/
  huggingface/
    facebook--nllb-200-distilled-600M/

Contributing

Help us cover more languages

TurkicNLP is a community effort. There are many ways to contribute — no matter your background.

🐛

Report bugs

Found an issue? Open a GitHub issue with your language, input text, and the error. Reproducible reports are hugely helpful.

🌍

Add a language

Have an Apertium FST or UD treebank for a Turkic language not yet covered? Add a catalog entry and open a PR.

🧪

Improve tag mappings

Apertium → UD tag mappings exist for Turkish, Kazakh, and Tatar. Help us add mappings for Uzbek, Azerbaijani, Kyrgyz, and others.

📊

Share datasets

Do you have annotated data, parallel corpora, or evaluation benchmarks for Turkic languages? Let us know or link them in the catalog.

🧠

Train neural models

Help train POS taggers, dependency parsers, or NER models for under-resourced languages. Training scripts are in the repo.

📝

Contribute to the book

The open book needs reviewers, code examples, and native speaker insights. See the book repository for how to get involved.

How to get started

Fork & clone the repository

$ git clone https://github.com/turkic-nlp/turkicnlp

Install in development mode

$ pip install -e ".[dev,stanza,nllb]"

Run the tests

$ pytest turkicnlp/tests/

Open a pull request

All PRs are reviewed. Small, focused changes merge faster. When in doubt, open an issue first to discuss the approach.

Licensing

Open source, with clear boundaries

The TurkicNLP library is Apache 2.0 — use it freely in research and commercial projects.

Apertium FST data (morphological analyzers) is GPL-3.0-or-later. It is downloaded separately at runtime and is never bundled in the pip package. Each language's data directory includes its license file.

Stanza models and NLLB-200 are used under their respective licenses (Apache 2.0 and CC-BY-NC-4.0).

          Quick reference
          
            turkicnlp (pip)Apache 2.0
Apertium FSTsGPL-3.0-or-later
Stanza modelsApache 2.0
NLLB-200CC-BY-NC-4.0

        

Support the project

Keep Turkic NLP open & growing

If TurkicNLP saves you time or powers your research, consider supporting its development. Every contribution helps fund model training, new language coverage, and the open book.

💙 Donate via PayPal 🎁 Support on Patreon

The book will always be free. Donations help make that sustainable.

turkicnlp (pip)	Apache 2.0
Apertium FSTs	GPL-3.0-or-later
Stanza models	Apache 2.0
NLLB-200	CC-BY-NC-4.0