Talk
CLIP and similar cross-modal contrastive models are widely used, but they operate under a hidden
assumption: that all useful information is redundant - shared between modalities. For many real
tasks this is wrong. Movie genre prediction is a perfect example: "Horror" might be obvious from the
poster alone (unique to image), "Documentary" from the plot alone (unique to text), and "Thriller"
only becomes clear when both are seen together (synergy). CLIP cannot learn the last two categories
well by design.
CoMM (Contrastive MultiModal learning) addresses this with a clean theoretical framework grounded in
Partial Information Decomposition (PID). Instead of aligning two separate unimodal spaces, CoMM
fuses all modalities into a single shared transformer space and aligns augmented multimodal views -
a shift that naturally forces the model to capture unique, redundant, and synergistic information.
The paper was published at ICLR 2025 and achieved state-of-the-art on seven multimodal benchmarks.
This talk is the story of reimplementing that paper end-to-end in Python.
About the Speaker
I'm a machine learning engineer with a bit over seven years of experience working on applied AI
problems mostly around large language models, search systems, and building things that are actually
useful in practice.
I've built chatbots, retrieval-based systems, and data pipelines, with a focus on making them
reliable rather than just impressive. I've worked with cloud infrastructure on AWS, and have tackled
some interesting systems challenges including applying reinforcement learning to load balancing,
which deepened my appreciation for how elegant RL can be when it meets real-world constraints.
Outside of my day-to-day work, I try to stay close to research. I read papers and implement them
when I can, because I'm genuinely curious about reinforcement learning, memory augmentation, and
multimodal learning how models remember, reason, and perceive the world across different forms of
information. I've also had the chance to support and learn alongside small teams, and I believe
sharing knowledge is part of the learning itself.
I think the best way to understand something is to build it, and the best way to know you understand
it is to explain it.
At the core, I'm just someone fascinated by how mathematics, code, and compute quietly come together
step by step toward something that starts to resemble intelligence. I find that beautiful, and it's
what keeps me going.