PyData & PyCon Yerevan 2026

Talk

Tiny, Framework-Aware Code Diffusion: Training a Mixture-of-Experts on a Consumer GPU

Track: Data Science Duration: 25 minutes View on Schedule

GPU Computing Data Science

Context. A substantial fraction of day-to-day programming happens inside a narrow framework ecosystem. “Write a tRPC endpoint,” “add a React hook,” “fix this Prisma schema” - the surface is small, the idioms are strict, and a general-purpose 70B model is a heavy hammer. What if a compact model, trained specifically for framework-shaped code, could do most of the work on a laptop?

What we built. MoEDCoder is an approximately 10M-parameter (about 6M active) masked-discrete-diffusion transformer with a 4-expert Mixture-of-Experts: one expert per framework family (TypeScript core, NestJS, Next.js, React / React Native). It is trained on a multi-million-chunk TypeScript corpus with a custom SuperBPE tokenizer, end-to-end on a single RTX 4070 laptop (8 GB).

What you’ll learn. First, masked discrete diffusion for code: the forward and reverse process, the reweighted denoising ELBO, and why dropping the left-to-right constraint matters for filling code holes. Second, framework-routed MoE: top-1 softmax routing, Switch-Transformer load-balance loss, router z-loss, and why the obvious recipe underspecializes on small multi-domain data.

Third, a research debugging story with real numbers: the first run hit entropy near log F with specialization scores below 0.1 effectively uniform routing. We show how we measured it, what the signal looked like on WandB, and the fix: a framework-supervised cross-entropy routing loss on mean router logits, combined with a two-stage training schedule (general pretraining then framework fine-tuning). The novelty is not a secret formula; it is the empirical finding that standard MoE losses fail on small multi-domain code data, and that masked diffusion plus supervised routing plus a curriculum makes it work. Specialization jumped from 0.08 to 0.85 (TypeScript), 0.48 (NestJS), 0.45 (React), and 0.21 (Next.js). Perplexity dropped from 161 to 65 (full ablations in an accompanying paper).

About the Speaker

Ryan(Sobhan) Bahrami is a Software Engineer and AI Systems Architect with over a decade of hands-on international experience building scalable software systems, AI-powered platforms, and production infrastructure across ML/AI, fintech, gaming, and cloud-native environments. His work spans distributed systems, computer vision, LLM/SLM training, architecture and orchestration, agentic AI architectures. His current focus areas include multi-agent AI systems, novel diffusion-based language model architectures, real-time inference pipelines, and applied machine learning for multimodal AI product engineering. he has shipped more than 100 agentic solutions globally, and actively conducting research on specialized expert tiny multimodal SLMs capable of running on edge environments.

Recording

Video will be available after the conference.

Ryan (Sobhan)

Talk

About the Speaker

Recording