A practical, pip-installable Python toolkit and an open textbook — built together so you can go from raw text to annotated output across the Turkic language family.
One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.
Works with Turkish, Kazakh, Uzbek, Kyrgyz, Uyghur, Tatar and 15 more languages.
A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.
Supports 21 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.
NLP for Turkic Languages is a practical, open-access textbook covering foundations through modern LLMs — with exercises, code examples, and linguistic depth for every chapter.
Freely available online. Because knowledge about under-resourced languages should be open.
Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.
Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.
Apertium HFST finite-state transducers for 21 languages, loaded natively via Python. No system Apertium install required.
Stanza neural models trained on Universal Dependencies treebanks for Turkish, Kazakh, Kyrgyz, Uyghur, and Ottoman Turkish.
NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.
Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.
MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.
From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.
Both the toolkit and the book are free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.
Funds go directly toward model training, infrastructure, and time spent writing the open book.