Talk
Distributed training has a tooling problem. Mainstream distributed training frameworks tend to be monolithic platforms optimized for predefined regimes - powerful, but incredibly difficult to adapt for experimental workflows. Conversely, creating a pipeline from scratch requires deeper expertise, dealing with low-level primitives and writing complex logic from scratch.
This talk introduces d9d, designed to occupy the sweet spot between monoliths and
homebrew scripts. We will look under the hood of d9d to uncover the modern engineering
paradigms and PyTorch 2.0 features that make robust, hackable scale-out LLM training possible.
Key takeaways
- Modern Distributed Training 101. A quick introduction to how modern LLMs are trained in a distributed manner, including an overview of 6D parallelism.
- The "Why?". Tooling for distributed training that is currently available, and why we decided to build a completely new framework.
- Philosophy of d9d. Key decisions we made during the development process. How we
integrate with existing PyTorch 2.0 infrastructure. Why we abandon massive wrappers like
DistributedDataParallel. The "white-box" modelling approach. How to seamlessly integrate non-native, high-performance Triton/CUDA extensions and still keep the framework's code readable. - Highlighted APIs. Some APIs in d9d we find especially useful and novel:
streaming distributed checkpointing engine supporting graph-based model state transformations,
refactored pipeline parallelism engine based on
torch.distributed.pipelining. - Engineering for ML Frameworks. How strict linting, static type-checking, and a comprehensive approach to local and distributed testing keep a highly complex framework reliable.
Attendees will leave with a solid understanding of modern distributed training mechanics, a ready-to-use framework to scale out their LLM research, and actionable software engineering patterns to apply to their own complex Python systems.
About the Speaker
I am a Lead Research Engineer and Engineering Manager at tochka.com, leading a 15-person team focused on building high-performance foundational models (including LLMs). Over the last year, my work has heavily revolved around orchestrating large-scale distributed training systems - scaling Mixture of Experts models up to 235 billion parameters using modern multidimensional parallelism approaches.
To solve the bottlenecks we faced at this scale, I created d9d, an open-source PyTorch training framework designed to bridge the gap between raw scalability and research flexibility.
I am incredibly passionate about open-source software, treating machine learning pipelines with rigorous software engineering discipline, and building better AI tooling. I also love to casually write low-level Triton GPU kernels. Concurrently with my professional work, I am completing my BS in Mathematics and Computer Science.