PyData & PyCon Yerevan 2026

Talk

Benchmarking Historical Reasoning in LLMs: Tasks, Metrics, and Failure Modes

Track: Data Science Duration: 50 minutes View on Schedule

LLMs Graph Analytics Uncertainty Quantification Data Science Testing

Large Language Models (LLMs) are trained on heterogeneous corpora that include not only factual information but also narratives shaped by nationalism, ideology, and popular culture. Historical domains are particularly affected, as 19th-century nation-building and later political regimes curated the past into simplified, often heroic narratives. These include proto-national reinterpretations of medieval figures, inflated military victories, and moralized accounts of complex geopolitical events.

This talk introduces a benchmark designed to evaluate LLM robustness to historiographical bias, using Romanian history as a case study. The dataset targets topics where popular narratives diverge from mainstream scholarship, covering periods from medieval to modern history. Rather than focusing solely on factual recall, the benchmark evaluates multiple dimensions: event outcome accuracy (tactical vs. strategic), chronological and contextual reasoning, detection of anachronisms, and stability under different prompt framings (neutral vs. nationalistic). It also assesses the model’s ability to acknowledge uncertainty, reflect historiographical debates, and correct common misconceptions.

Evaluation is based on accuracy against curated reference answers, complemented by qualitative analysis of reasoning patterns and failure modes. Results across multiple LLMs highlight systematic tendencies to reproduce dominant but biased narratives, particularly under leading prompts.

The benchmark is implemented and publicly available on Kaggle (https://www.kaggle.com/benchmarks/gpreda/romanian-history
), supporting reproducible evaluation and community contributions. This work contributes to LLM evaluation by emphasizing epistemic robustness in politically and historically sensitive contexts.

About the Speaker

Survived PhD in Computational Electromagnetics, while working as a researcher, applied 25 years ago what will be called now Machine Learning to solve ill-posed inverse problems in NDT, worked for long time in Software Development, with positions from developer to senior manager or programme manager, started few years ago to dive into Data Science, currently working as a Principal Data Scientist at Endava, delivering projects in industries multiple industries, including finance, insurance, logistics, telecom. Author with Packt for a book on Kaggle Notebooks (data analytics, machine learning, generative AI), Kaggle & AI Google Developer Expert. Writing about Kaggle, Data Science, Machine Learning, AI, Agentic AI.

Recording

Video will be available after the conference.

Gabriel Preda

Talk

About the Speaker

Recording