PyData & PyCon Yerevan 2026

Talk

Who Tests the AI Testers? Evaluating AI QA Agents embedded in TDD-based software generation

Track: Data Science Duration: 50 minutes View on Schedule

AI Agents LLMs Open Source Python Data Science

Overview
The rush to automate Quality Assurance using LLM-based agents often overlooks a fundamental engineering principle: an unverified test is a liability. This talk introduces systematic, tool-agnostic principles designed to evaluate the performance of QA Agents, embedded in TDD, through two lenses: Test Analysis (logical validity) and White-Box Evaluation (technical depth). The speaker combines extensive experience in Software testing with current ML evaluation position, and can see the problem from both sides.

All the examples are based on open-source products and depersonalized data.

Target Audience & Background
This talk is for ML Engineers, Software Development Engineers in Test (SDET), and Technical Leads. Attendees should have a working knowledge of Python and familiarity with testing frameworks (e.g., pytest). The talk focuses on methodology and architectural patterns; no proprietary code or data will be disclosed.

About the Speaker

With over 25 years of experience in the IT industry, I have navigated the full spectrum of quality assurance-from manual testing and automation to leading global QA departments. My technical foundation is built on years of developing robust automated testing frameworks in both Java and Python. This extensive background in "classical" software engineering now serves as the bedrock for my current role at the intersection of AI and quality.

Currently, I serve as a Senior Staff ML Evaluation Engineer at Grid Dynamics in Yerevan. My work is focused on the most critical frontier of modern AI: building rigorous evaluation pipelines for autonomous agents and LLM-driven applications. I specialize in designing complex evaluation architectures using the Hydra framework and integrating sophisticated "LLM-as-a-judge" methodologies. My mission is to translate high-level quality requirements into precise, data-driven metrics. By bridging the gap between traditional testing discipline and generative AI, I help ensure that the next generation of AI agents is not just innovative, but reliable and production-ready.

Recording

Video will be available after the conference.

Lilia Urmazova

Talk

About the Speaker

Recording