Open · Free · Coming Soon

NLP for Turkic Languages

A practical, open-access textbook covering foundations through modern LLMs — with exercises, code examples, and linguistic depth for 20+ Turkic languages.

9 Chapters Free & Open Code Examples Exercises
📖

Coming Soon

The book is actively being written. Chapter 1 is in first draft. Sign up or watch the GitHub repository to be notified at launch.

Watch on GitHub ★

Free forever. Because knowledge about under-resourced languages should be open.

About the book

A practical guide — from raw text to modern LLMs

Written for NLP researchers, practitioners, and graduate students who want to work with Turkic languages. No prior knowledge of Turkic linguistics required.

🧠

Foundations First

Each chapter builds on the last — starting from scripts and encodings, through morphology and syntax, all the way to large language models and generative AI.

💻

Hands-on Code

Every chapter includes working code examples using the TurkicNLP toolkit. A companion repository hosts full runnable notebooks for each tutorial.

🌍

20+ Languages

Examples span Turkish, Kazakh, Uzbek, Azerbaijani, Kyrgyz, Uyghur, Tatar, and more — not just the well-resourced flagship language.

🔬

Research-Backed

Grounded in published research with proper citations. Covers state-of-the-art models, benchmarks, datasets, and open problems at the frontier of Turkic NLP.

📝

Exercises

Each chapter ends with exercises ranging from conceptual questions to implementation challenges — suitable for self-study and university courses.

♾️

Free Forever

The book is published online under an open license. PDF and print editions may follow, but the web version will always be free and open.

Table of contents

9 chapters, ~350 pages

From the basics of the Turkic family to cutting-edge generative AI — structured for sequential reading or chapter-by-chapter reference.

01
First draft complete

Introduction — The Linguistic and Computational Landscape

What is NLP and why Turkic languages matter. The Turkic language family: 200M+ speakers, 24+ languages, geographic spread from Turkey to Siberia. Core computational challenges: agglutinative morphology, vowel harmony, script diversity, pro-drop.

02
Planned

Script, Orthography, and the Encoding Frontier

03
Planned

Phonology and Speech Technologies

04
Planned

Computational Morphology — The Core of Turkic NLP

05
Planned

Syntax and Universal Dependencies

06
Planned

Semantics and Lexical Resources

07
Planned

Machine Translation for Turkic Languages

08
Planned

Large Language Models and Generative AI

09
Planned

Regional Focus, Ethics, and Future Directions

Audience

Who is this book for?

🎓

Researchers & Graduate Students

A comprehensive reference covering published work, benchmarks, open problems, and research directions. Properly cited throughout — suitable as a course textbook or self-study guide.

Prerequisites: Basic Python, familiarity with machine learning concepts. No prior knowledge of Turkic languages required.

⚙️

Practitioners & Developers

Hands-on tutorials in every chapter using the TurkicNLP toolkit. Real code that runs. Focused on what works in production — not just theory.

Prerequisites: Python and basic NLP concepts (tokenization, embeddings). Each chapter is self-contained enough to jump in anywhere.

Stay updated

Be the first to know

The book is being actively written. Watch the GitHub repository to get notified when chapters are published — or follow on social media.

Star & Watch on GitHub

Star the repository and set notifications to "Watching" to be alerted when new chapters land.

Go to GitHub →
🔧

Use the Toolkit Now

The companion Python toolkit is already available. Install it and start working with Turkic languages while the book is being written.

Get the Toolkit →
Support the project

Help keep the book free

The book and the toolkit are both free and open-source. Writing an open textbook of this scope takes hundreds of hours. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

Funds go directly toward time spent researching and writing, compute for model experiments referenced in the book, and infrastructure for hosting.