Talk
Session outline:
1) Do we really need as much data as possible for LLM training? A brief look at recent research proving that data quality is the primary bottleneck in LLM performance. (curated small dataset may lead to a better performance compared to larger and "rawer" ones)
2) CulturaX: Why even datasets marketed as 'cleaned' contain significant noise, and why relying on them blindly can degrade model performance.
3) The English-Centric Bias: An overview of existing tools like RedPajama and C4, and why they often fail when applied to other languages.
4) Defining the Target: What does 'clean' data actually look like for a production-grade LLM?
5) Technical Deep-Dive: The 4-Stage Cleaning Pipeline (the stages to be described and why we specifically need them)
6) [Phase 1] Artifact Removal: Identifying and purging parsing debris using transformer-based models trained on synthetic 'noisy' data.
7) [Phase 2] Heuristic & Statistical Filtering: Adapting RedPajama-style filters to the linguistic nuances of non-English text.
8) [Phase 3] Fuzzy Deduplication with LSH + MinHash: Implementing a pipeline that remains surprisingly effective even on supposedly 'pre-deduplicated' public datasets.
9) [Phase 4] Semantic Deduplication & Clustering (domain-balancing): Using e5 and BIRCH to cluster billions of samples and remove redundant information based on meaning, not just syntax.
10) Advanced Filtering: A difficult yet highly effective extra stage-training custom text-quality classifiers for factuality and integrity. We will discuss data sourcing and how to optimize neural network inference when processing massive corpora to avoid "million-year" processing times.
11) The results: How corpus cleaning impacted our model quality and training speed, while significantly reducing GPU compute hours.
Target Audience: ML Engineers, Data Scientists, especially those who train LLMs for low-resource languages (including Armenian).
Prerequisites: familiarity with Python and basic NLP concepts. No prior experience with terabyte-scale data processing is required.
About the Speaker
Senior Data Scientist with 5+ years of experience building scalable NLP systems. Currently, I am part
of the LLM development team at Tochka Bank, where I lead initiatives for training data quality and
LLM guardrails. As an active speaker and program committee member for several Python and ML
conferences, I am deeply passionate about Data and tech community.
Over the last two years, my team and I built an in-house LLM (based on Qwen235b) that performs as
well as GPT-4.1 on our task set, resulting in significant cost savings for the company. Through this
engineering journey, I have gained deep expertise in LLM language adaptation, data curation, and LLM
evaluation/benchmarking. I am always open to networking and sharing knowledge, so feel free to reach
out!