PyData & PyCon Yerevan 2026

Talk

Reproducible Machine Learning Using DVC

Track: Data Science Duration: 25 minutes View on Schedule

Machine Learning Open Source Python Data Science Audience Modeling

Reproducibility in machine learning is not just a research concern - it's a production necessity. This talk presents a practical approach to building reproducible ML pipelines using DVC, an open-source tool that brings Git-style versioning to data and pipelines.
DVC handles the data and pipeline side: versioning large datasets alongside code using Git, defining reproducible pipeline stages, and enabling teams to reproduce any historical run deterministically.
We will cover:
Why reproducibility fails in typical ML projects (notebook chaos, implicit state, untracked data)
Versioning datasets and linking them to Git commits with DVC
Structuring ML pipelines as DVC stages with explicit dependencies and outputs
Using remote storage backends (S3, GCS, Azure) for DVC remotes
The session includes a concrete example project - a recommendation pipeline - demonstrating the full workflow from raw data ingestion to model evaluation, with every step tracked and reproducible.
By the end of this talk, attendees will understand how to structure their own ML projects so that any experiment can be reproduced exactly, shared with teammates, and audited over time.
Target audience: ML practitioners, data scientists, and MLOps engineers with basic familiarity with Python and Git. No prior DVC experience required.

About the Speaker

Software Engineer | Team Lead | NLP Researcher
Kiyarash Fazeli is a Tehran-based software engineer whose path into technology runs through both systems engineering and machine learning research. His academic work centered on word2vec and doc2vec - the embedding models that laid the conceptual groundwork for how language is represented in modern AI systems. That foundation gives him an unusually grounded perspective on large language models: where most engineers encounter them as APIs, Kiyarash understands them from the representation layer up.
I've spoken at community meetups and university events on topics that span my two main threads - systems engineering and natural language processing.
On the ML side, I've talked about word embeddings and NLP fundamentals - tracing the line from word2vec and doc2vec through to modern transformer architectures and LLMs. My goal is always to make the underlying mechanics feel concrete rather than magical, especially for audiences encountering these ideas for the first time.
On the engineering side, I've covered backend architecture with Django and PostgreSQL - query optimization, schema design at scale, and the practical lessons you only learn when things break in production. I've also spoken on infrastructure and DevOps: containerization, Kubernetes, and what it actually takes to keep a high-traffic system stable.

Recording

Video will be available after the conference.