Vazgen

Vazgen Tadevosyan

Senior Engineer

WorldQuant

Yerevan, Armenia

Talk

Beyond CLIP: Implementing CoMM - A Multimodal Contrastive Learning Paper from ICLR 2025
Track: Data Science Duration: 25 minutes View on Schedule
Multimodal AI Python Image Processing Computer Vision Data Science

CLIP and similar cross-modal contrastive models are widely used, but they operate under a hidden assumption: that all useful information is redundant - shared between modalities. For many real tasks this is wrong. Movie genre prediction is a perfect example: "Horror" might be obvious from the poster alone (unique to image), "Documentary" from the plot alone (unique to text), and "Thriller" only becomes clear when both are seen together (synergy). CLIP cannot learn the last two categories well by design.
CoMM (Contrastive MultiModal learning) addresses this with a clean theoretical framework grounded in
Partial Information Decomposition (PID). Instead of aligning two separate unimodal spaces, CoMM fuses all modalities into a single shared transformer space and aligns augmented multimodal views - a shift that naturally forces the model to capture unique, redundant, and synergistic information. The paper was published at ICLR 2025 and achieved state-of-the-art on seven multimodal benchmarks. This talk is the story of reimplementing that paper end-to-end in Python.

About the Speaker

I'm a machine learning engineer with a bit over seven years of experience working on applied AI problems mostly around large language models, search systems, and building things that are actually useful in practice.
I've built chatbots, retrieval-based systems, and data pipelines, with a focus on making them reliable rather than just impressive. I've worked with cloud infrastructure on AWS, and have tackled some interesting systems challenges including applying reinforcement learning to load balancing, which deepened my appreciation for how elegant RL can be when it meets real-world constraints.
Outside of my day-to-day work, I try to stay close to research. I read papers and implement them when I can, because I'm genuinely curious about reinforcement learning, memory augmentation, and multimodal learning how models remember, reason, and perceive the world across different forms of information. I've also had the chance to support and learn alongside small teams, and I believe sharing knowledge is part of the learning itself.
I think the best way to understand something is to build it, and the best way to know you understand it is to explain it.
At the core, I'm just someone fascinated by how mathematics, code, and compute quietly come together step by step toward something that starts to resemble intelligence. I find that beautiful, and it's what keeps me going.

Recording

Video will be available after the conference.