PyData & PyCon Yerevan 2026

Talk

Taming the Chaos: Building a Reproducible and Flexible Bioinformatics Architecture

Track: Data Science Duration: 50 minutes View on Schedule

Python Computer Vision Data Science

Reproducibility is the bedrock of bioinformatics, yet achieving it within large, cross-functional commercial teams often feels like navigating a chaotic labyrinth. Data analysis in medicine and complex natural sciences requires diving into "rabbit holes" to extract real-world patient value. As this serendipitous exploration is where groundbreaking ideas originate, an overly standardized approach risks suffocating them. Conversely, operating without a sound architecture leads to scattered notebooks, fragmented repositories, and unscalable workflows. The ultimate technical challenge is finding the balance: taming the chaos without murdering the serendipity.
Drawing on experiences from both academia and industry – which often sit at opposite ends of this rabbit hole – this talk outlines a comprehensive architectural philosophy for building data systems that balance structural integrity with scientific freedom. I will introduce five core tenets for a successful bioinformatics ecosystem:
* The Kernel Concept: The system must act as a strict core founded on fundamental principles of biology, math, and programming, rather than a disjointed box of technical utensils;
* Language-Like Flexibility: It must serve as a common language, allowing new ideas and extensions to sprout organically while the underlying plumbing falls into place by design;
* Multi-Level Accessibility: It must be seamless for the bioinformatician under the hood, provide comprehensible visualizations for the biologist, and easily distill findings into abstract narratives for upper management;
* Omnipresent Tracking: It must eliminate the catastrophe of hunting through dozens of untracked, scattered notebooks across various clouds and machines;
* Cost & Time Efficiency: Above all, it must systematically save time, thereby saving money.

To ground these tenets in reality, the second half of the talk will provide a technical walkthrough of the architecture we are implementing at BostonGene. We will demonstrate how modern Python ecosystems can be orchestrated to build this ideal system. Specifically, we will dive into our use of the uvpackage manager for fast reproducible environments, Pydantic for strict type validation, and AnnData and MuData classes for handling complex biological data. Finally, we will show how combining S3 storage, stringent version control, and marimo reactive notebooks and dashboards brings this highly automated, reproducible, and serendipity-friendly architecture to life.

About the Speaker

I have worked across a few different fields as a scientist, which has given me a fairly broad perspective on biological data. I did my PhD in Systems Immunometabolism, followed by two separate postdocs in molecular biology and neuroscience. Eventually, I transitioned into industry, joining BostonGene as a Senior Bioinformatician, where I currently lead a small bioinformatics team. In our daily work, we focus heavily on artificial intelligence guided biomedical research within the field of oncoimmunology. Our primary goal is highly practical: helping pharma companies design more effective clinical trials, and providing oncologists the insights needed to better treat patients.

Recording

Video will be available after the conference.

Töma Nikitin

Talk

About the Speaker

Recording