Talk
Reproducibility is the bedrock of bioinformatics, yet achieving it within large, cross-functional
commercial teams often feels like navigating a chaotic labyrinth. Data analysis in medicine and
complex natural sciences requires diving into "rabbit holes" to extract real-world patient value. As
this serendipitous exploration is where groundbreaking ideas originate, an overly standardized
approach risks suffocating them. Conversely, operating without a sound architecture leads to
scattered notebooks, fragmented repositories, and unscalable workflows. The ultimate technical
challenge is finding the balance: taming the chaos without murdering the serendipity.
Drawing on experiences from both academia and industry – which often sit at opposite ends of this
rabbit hole – this talk outlines a comprehensive architectural philosophy for building data systems
that balance structural integrity with scientific freedom. I will introduce five core tenets for a
successful bioinformatics ecosystem:
* The Kernel Concept: The system must act as a strict core founded on fundamental principles of
biology, math, and programming, rather than a disjointed box of technical utensils;
* Language-Like Flexibility: It must serve as a common language, allowing new ideas and extensions
to sprout organically while the underlying plumbing falls into place by design;
* Multi-Level Accessibility: It must be seamless for the bioinformatician under the hood, provide
comprehensible visualizations for the biologist, and easily distill findings into abstract
narratives for upper management;
* Omnipresent Tracking: It must eliminate the catastrophe of hunting through dozens of untracked,
scattered notebooks across various clouds and machines;
* Cost & Time Efficiency: Above all, it must systematically save time, thereby saving money.
To ground these tenets in reality, the second half of the talk will provide a technical walkthrough
of the architecture we are implementing at BostonGene. We will demonstrate how modern Python
ecosystems can be orchestrated to build this ideal system. Specifically, we will dive into our use
of the uvpackage manager for fast reproducible environments, Pydantic for
strict type validation, and AnnData and MuData classes for handling
complex biological data. Finally, we will show how combining S3 storage, stringent version control,
and marimo reactive notebooks and dashboards brings this highly automated,
reproducible, and serendipity-friendly architecture to life.
About the Speaker
I have worked across a few different fields as a scientist, which has given me a fairly broad perspective on biological data. I did my PhD in Systems Immunometabolism, followed by two separate postdocs in molecular biology and neuroscience. Eventually, I transitioned into industry, joining BostonGene as a Senior Bioinformatician, where I currently lead a small bioinformatics team. In our daily work, we focus heavily on artificial intelligence guided biomedical research within the field of oncoimmunology. Our primary goal is highly practical: helping pharma companies design more effective clinical trials, and providing oncologists the insights needed to better treat patients.