PyData & PyCon Yerevan 2026

Talk

How the Guardian measured rhetorics toward immigration in Parliament

Track: Data Science Duration: 50 minutes View on Schedule

LLMs Machine Learning Statistics Data Science Testing

The Guardian’s Data Science and Data Projects teams, in collaboration with University College London, developed an in-house machine learning model to measure linguistic sentiment towards immigration in debates in the UK House of Commons. The model is designed to distinguish sentiment directed specifically at immigration from more general emotionally charged political language.
The project began with a set of core editorial questions: how can we assess whether parliamentary discourse on immigration has shifted over time? Can we meaningfully compare sentiment in recent debates with speeches from the 1950s and 1960s when immigration was a highly contested political issue? And can such changes be measured in a statistically robust and transparent way, suitable for journalistic scrutiny?
Over two years of collaboration between journalists, data scientists, and academic experts, we developed and validated a bespoke sentiment model capable of answering these questions. Alongside this, we experimented with using Large Language Models (LLMs) to support the labelling of training data, critically evaluating whether they could improve the efficiency and consistency of annotation without undermining analytical reliability.
Attendees will take away:
-A practical framework for measuring targeted sentiment in political texts, rather than relying on generic tone analysis.
-Hard-learnt lessons from translating open-ended editorial questions into robust machine learning approaches.
-Insights into the strengths and limitations of using LLMs for data annotation in sensitive, high-stakes domains.
-Guidance on applying these methods to long-running political or civic datasets to support accountability reporting.

About the Speaker

I’m the Head of Data Science at The Guardian, leading a team developing machine learning models and data products across editorial and business functions. I'm a physicist by trade with a deep first-hand domain knowledge of the publishing industry. My work focuses on NLP, GenAI, recommendation systems, and applied machine learning on large-scale and unstructured datasets.

Recent projects include building models to extract structured information from text, developing tagging and discovery systems for content, forecasting web traffic, and analysing audience behaviour to inform product and editorial strategy.

I’m especially interested in the challenges of deploying and maintaining ML systems in production, and in bridging the gap between experimentation and real-world impact.

Recording

Video will be available after the conference.

Anna Vissens

Talk

About the Speaker

Recording