PyData & PyCon Yerevan 2026

Talk

Cleaning Large Text Corpora, or Why CulturaX Is Uncultured

Track: Data Science Duration: 50 minutes View on Schedule

LLMs NLP GPU Computing Python Statistics

Session outline:

1) Do we really need as much data as possible for LLM training? A brief look at recent research proving that data quality is the primary bottleneck in LLM performance. (curated small dataset may lead to a better performance compared to larger and "rawer" ones)

2) CulturaX: Why even datasets marketed as 'cleaned' contain significant noise, and why relying on them blindly can degrade model performance.

3) The English-Centric Bias: An overview of existing tools like RedPajama and C4, and why they often fail when applied to other languages.

4) Defining the Target: What does 'clean' data actually look like for a production-grade LLM?

5) Technical Deep-Dive: The 4-Stage Cleaning Pipeline (the stages to be described and why we specifically need them)

6) [Phase 1] Artifact Removal: Identifying and purging parsing debris using transformer-based models trained on synthetic 'noisy' data.

7) [Phase 2] Heuristic & Statistical Filtering: Adapting RedPajama-style filters to the linguistic nuances of non-English text.

8) [Phase 3] Fuzzy Deduplication with LSH + MinHash: Implementing a pipeline that remains surprisingly effective even on supposedly 'pre-deduplicated' public datasets.

9) [Phase 4] Semantic Deduplication & Clustering (domain-balancing): Using e5 and BIRCH to cluster billions of samples and remove redundant information based on meaning, not just syntax.

10) Advanced Filtering: A difficult yet highly effective extra stage-training custom text-quality classifiers for factuality and integrity. We will discuss data sourcing and how to optimize neural network inference when processing massive corpora to avoid "million-year" processing times.

11) The results: How corpus cleaning impacted our model quality and training speed, while significantly reducing GPU compute hours.

Target Audience: ML Engineers, Data Scientists, especially those who train LLMs for low-resource languages (including Armenian).

Prerequisites: familiarity with Python and basic NLP concepts. No prior experience with terabyte-scale data processing is required.

About the Speaker

Senior Data Scientist with 5+ years of experience building scalable NLP systems. Currently, I am part of the LLM development team at Tochka Bank, where I lead initiatives for training data quality and LLM guardrails. As an active speaker and program committee member for several Python and ML conferences, I am deeply passionate about Data and tech community.
Over the last two years, my team and I built an in-house LLM (based on Qwen235b) that performs as well as GPT-4.1 on our task set, resulting in significant cost savings for the company. Through this engineering journey, I have gained deep expertise in LLM language adaptation, data curation, and LLM evaluation/benchmarking. I am always open to networking and sharing knowledge, so feel free to reach out!

Recording

Video will be available after the conference.

Elizaveta Afanaseva

Talk

About the Speaker

Recording