shipped · 2024-06 → 2025-08

Yandex.Translate Tuvan digitalization

Automated parallel-sentence matching with YandexGPT to expand training data for Tuvan in Yandex Translate. A personally meaningful project on a low-resource language.

Developer

Tuvan is severely underrepresented in machine translation. Hand-curating parallel sentences for training is the bottleneck, and the existing corpus is small.

The pipeline I built uses YandexGPT — Yandex’s ChatGPT-equivalent — to generate aligned parallel sentence pairs from monolingual text, with quality filters to keep noise out of the training set. The output fed Yandex Translate, which is the largest Russian-language IT company’s translation product.

Tuvan is my heritage language. Building this was 14 months of work for a population of ~280,000 native speakers, almost none of whom expect to ever see their language in a major translation product. That’s the part that mattered.