Talk
Reproducibility in machine learning is not just a research concern - it's a production necessity.
This talk presents a practical approach to building reproducible ML pipelines using DVC, an
open-source tool that brings Git-style versioning to data and pipelines.
DVC handles the data and pipeline side: versioning large datasets alongside code using Git, defining
reproducible pipeline stages, and enabling teams to reproduce any historical run
deterministically.
We will cover:
Why reproducibility fails in typical ML projects (notebook chaos, implicit state, untracked
data)
Versioning datasets and linking them to Git commits with DVC
Structuring ML pipelines as DVC stages with explicit dependencies and outputs
Using remote storage backends (S3, GCS, Azure) for DVC remotes
The session includes a concrete example project - a recommendation pipeline - demonstrating the full
workflow from raw data ingestion to model evaluation, with every step tracked and
reproducible.
By the end of this talk, attendees will understand how to structure their own ML projects so that
any experiment can be reproduced exactly, shared with teammates, and audited over time.
Target audience: ML practitioners, data scientists, and MLOps engineers with basic familiarity with
Python and Git. No prior DVC experience required.
About the Speaker
Software Engineer | Team Lead | NLP Researcher
Kiyarash Fazeli is a Tehran-based software engineer whose path into technology runs through both
systems engineering and machine learning research. His academic work centered on word2vec and
doc2vec - the embedding models that laid the conceptual groundwork for how language is represented
in modern AI systems. That foundation gives him an unusually grounded perspective on large language
models: where most engineers encounter them as APIs, Kiyarash understands them from the
representation layer up.
I've spoken at community meetups and university events on topics that span my two main threads -
systems engineering and natural language processing.
On the ML side, I've talked about word embeddings and NLP fundamentals - tracing the line from
word2vec and doc2vec through to modern transformer architectures and LLMs. My goal is always to make
the underlying mechanics feel concrete rather than magical, especially for audiences encountering
these ideas for the first time.
On the engineering side, I've covered backend architecture with Django and PostgreSQL - query
optimization, schema design at scale, and the practical lessons you only learn when things break in
production. I've also spoken on infrastructure and DevOps: containerization, Kubernetes, and what it
actually takes to keep a high-traffic system stable.