Chatting with the Past? How AI is Changing Our Notion of History

This text was written in preparation for an interview – it is here published “as is”, i. e. without correction of my non-native and very French English. The questions were written by Dr. Felix Fuhg, for the Körber Stiftung.

back to the blog landing page

Intro Questions

Do you remember what was your first contact with AI as a professor for European history?

That’s a very hard question to answer. Basically, almost everybody today has had a contact with AI, including historians in their professional capacity, without knowing it. Do you take pictures of documents in archive centers? There’s probably AI embedded in your smartphone that makes you a AI user as a historian.

Let’s say that, as a researcher, I have willingly used some AI based pieces of software for ten years. For instance, even if it’s not obvious in my publications, I have used MALLET, a command line software for topic modelling – looking for topics within a large corpus of texts – that is based on machine learning.

As a teacher, I have a course – European collective memory in the digital era – where there is, for the second year now, a session where we use some GPT based piece of software to see how collective memory is sort of embedded into those large language models. GPT is the large language model behind ChatGPT. It’s a master course, and it’s working well.

How many people are using LLM-tools such as ChatGPT, Bard or Bing API? Has its use by students had any impact on your marking of papers at university?

I am not able to say how many people are using such tools – but virtually, as ChatGPT can be used freely (with some constraints), virtually everybody can use such tools, including through this new MS Bing thing that is based on ChatGPT.

I have not yet being confronted directly – or I did not notice it – to students using chatgpt for their essay. But I have a bachelor source next semester that is based on essays written at home where we are currently thinking, with a colleague who has the same teaching in German, to change how we are going to organize the evaluation.

Just, here, an anecdote:

This year, in this master teaching I was talking about, a few weeks ago, we assessed the quality of some text produced by ChatGPT about Jean Monnet. And a student seemed unhappy. I asked her why she seemed to be so unhappy: in fact, while working with chatgpt, she realized that chatgpt has some sort of writing style, something very neutral. And she thought that an introduction of a collective work for another course, written by one of her fellow students, had the same style. So, she was – rightly – convinced that this introduction had been written with ChatGPT and asked her colleague, who admitted it, to rewrite the introduction.

Could you briefly say what – technically speaking – a large language model is. How does it work?

A large language model is an AI-powered system that is trained (the ‘P’ of ChatGPT means ‘pre-trained’) on massive datasets – in the case of ChatGPT it includes the use of CommonCrawl, a sort of filtered snapshot of the living web. Those systems are mostly based on neural-networks that loosely imitates how our brain functions. They are able to ‘learn’ from this training. Learning, here, means statistics and probabilities: for each word in the training dataset, the AI based system will deduce the probabilities of this word to be associated with other words. Today, we speak about Transformers – a sort of neural network defined by Google in 2017 – that are better to, let’s say, take into consideration the general context of a word within the training dataset. So basically, the texts generated by, for instance, chatgpt, are texts made of words that are statistically pertinent. It’s complex, hard to understand statistics, but it’s just statistics.

As a user, LLM user-interfaces are usually simple: you enter a prompt, the LLM-based website, for instance, will answer with a text or an image based on this prompt.

Artificial Intelligence has become a commonly used term, but what are actually the dimensions of the definition? And to what extent is the term, and the often-used descriptions, really linked to a new era of data analytics and data modelling?

There are several families of AI. Usually, historians of computing – which I’m not – are making a difference between symbolic and connectionist AI. Symbolic AI imitates the way we reason, we think, insisting on logic, it was popular in the 1980s. Connectionist AI imitates – loosely, see this more as a metaphor – how our brain works. Today, it’s the second that is the most popular and in particular deep learning which uses neural networks (again, the brain metaphor). Deep learning is the basis of LLMs. But, from what I have understood, if LLMs are based on connectionist AI, they for some elements also use symbolic AI. But both families of AIs have roots in the 1960s, if not the 1950s and in this sense it’s not new. In fact, it’s a project, AI in general, that is part of computing since the beginning of computing sciences – just remember the Turing test, described in a famous 1950 article by Alan Turing. Some authors evoke even the old Humanity’s project of creating Automatons that looks like us, from the Antiquity to today, at least in the western world. What’s new is the fact that computers lacked power and data to make deep learning based applications really efficient. Since the 2010s, we have the computing power and the huge necessary datasets to make connectionist AI work pretty well. Furthermore, with ChatGPT, for instance, data analytics becomes far easier, for different reasons: because GPT, if you do a complementary training on it, can easily detects persons, dates, etc. Also, because it is today quite easy to write code, including data analytics code, with ChatGPT.

Have, in your opinion, Frédéric, AI and particularly those tools, which work with so-called large language models, the potential to have a not inconsiderable impact on remembrance culture, history education and the way we think about history (and how we link history to the present and future)?

We need to look at three things here:

LLMs, as I said, are trained on training datasets. Those datasets are embedding views, visions of the historical past, descriptions of facts, etc. So all LLMs are coming with embedded notions of history, that can be quite diverse. The problem being that, commercial LLMs refuse to detail their training datasets. We know that common crawl has been used for GPT, the engine of ChatGPT. We also know that a complementary training has been done for ChatGPT, based on reddit datasets – for the conversational aspect. We know also that Wikipedia has been used, etc. But the details are not well known. Furthermore, the datasets are too massive: it’s hard to really assess what’s history in those training dataset;
Then, as we said, it’s a question of statistics: the generated text is statistically pertinent. In a way, look at this text as an average text. Hence the typical style of ChatGPT, very balanced;
Last but not least, less documented, to avoid answers of ChatGPT that could be inappropriate (racism for instance), there’s a supplementary training through digital labor, ie humans correcting answers of ChatGPT, correction from which ChatGPT, again, will learn. (For ChatGPT, badly paid Nigerian digital workers – lots of them finally refused to work, as it is a traumatic work).

So, in the end, we have generated texts that are, as I said, quite ‘average’. It can sometimes work, probably better in English (90% of the training dataset), sometimes the ‘average’ text is really bad, often in the details: it looks convincing, but most of the details are wrong – that’s typically the case of Jean Monnet. And of course, there’s what’s not in the generated text and sometimes there are huge absence. So it produces, in other words, « good enough » texts. But « good enough » is not science, right? It’s not what we aim at. Furthermore, in the case of ChatGPT, the data it was trained on dates back to December 2021. So, it’s a conservative ‘good enough’, based on data where recent past is very much more represented than the past before the web, or even before 2010. It’s a sort of ultimate presentism.

Does historicization helps us to understand the current debate regarding the impacts of AI on society and the way AI is changing the notion of history through its implications?

Obviously, though it is not sufficient in itself and should not play a too important role – I do not like the tendency of historians to say that everything has already existed. But understanding how it works, understanding the fact that it is a longue durée project is important for two reasons:

To prevent any apocalyptical visions of the future, because this vision is preventing us to think AI, including in its negative aspects;
To remember that this project is inserted in long term cultural trends, at least in the Western World.

Main Part on Chat GPT and Hello History

It is still an ongoing debate if European nations “sleepwalked” into the war. Let us see how Chat GPT answer the difficult question “Who is responsible for the outbreak of the First World War?”

Chat GPT response:

The outbreak of the First World War was the result of a complex set of factors and events. While it is not accurate to attribute the entire responsibility to a single party or individual, there were several key factors and countries that played significant roles in the lead-up to the war. Here are some important factors and parties involved:

Alliance Systems: The complex web of alliances between major European powers contributed to the escalation of the conflict. The two main alliances were the Triple Entente (comprised of France, Russia, and later Britain) and the Central Powers (consisting of Germany, Austria-Hungary, and Italy).

Imperialism and Rivalries: Intense rivalries and competition for colonial territories and resources among European powers heightened tensions. Germany’s rapid industrialization and naval expansion, as well as its desire for a greater global influence, challenged the existing balance of power.

Nationalism: Nationalistic sentiments and aspirations were prevalent in various countries, including Serbia, Austria-Hungary, and the Balkan region as a whole. These nationalist movements and the desire for self-determination fueled tensions and conflicts. Assassination of Archduke Franz Ferdinand: The assassination of Archduke Franz Ferdinand of Austria-Hungary and his wife in Sarajevo, Bosnia, by a Serbian nationalist in June 1914 acted as a catalyst for the outbreak of war. Austria-Hungary saw this as an opportunity to confront Serbia and curb its influence in the region.

Failure of Diplomacy: Diplomatic efforts to resolve the crisis and prevent the outbreak of war were inadequate. The July Crisis of 1914 saw a series of diplomatic failures, ultimatums, and miscalculations that escalated the tensions instead of defusing them. While Austria-Hungary’s ultimatum to Serbia and Germany’s unconditional support for Austria-Hungary are often seen as immediate triggers, it is important to note that the responsibility for the outbreak of the war cannot be solely attributed to one party. The complex interplay of geopolitical, economic, and social factors, combined with a lack of effective diplomacy, led to the escalation of the conflict and the eventual outbreak of the First World War.

What do you think about the given answer? Does the answer match the complexity of the happenings? And: what are the challenges in your eyes?

It’s not a too bad text, but because it is neutral and ‘average’. This is a synthesis, not a problematized piece of research – there’s no explicit relationship to current researchers for instance. It’s « good enough ». The problem for me here, is that, you have no clue about the sources of this text. If you read an article on Wikipedia, you’ll see references, you will see all the edit history, the debates, etc. Here, nothing. You can ask chatgpt for references. ChatGPT will give some, they will be made up.

In our previous question, we used Chat GPT for reconstructing and summarizing a research debate. Chat GPT offers much more opportunities, we can, for example, ask Chat GPT to let Otto von Bismarck comment on the Russian invasion of Ukraine, an event that started in February 2022. Bismarck died in 1898. If Bismarck had to react with a political statement to the invasion, what would he say? This is a snippet from his speech, generated and delivered by Chat GPT:

Ladies and gentlemen,

Today, I stand before you to address the pressing matter of the ongoing war in Ukraine. As an observer of history, it is with deep concern that I witness the conflicts and divisions that plague the Ukrainian land. The struggles faced by the Ukrainian people resonate with the challenges and tribulations my own nation once endured. Thus, it is my duty to offer insights and reflections in the hopes of fostering understanding and, ultimately, a path towards peace.

First and foremost, it is imperative to recognize the complexity of the Ukrainian conflict. The roots of this strife lie in a long history of cultural, ethnic, and political tensions. These deep-seated divisions have been exacerbated by external influences, power struggles, and divergent geopolitical interests. However, we must not forget that peace cannot be achieved by perpetuating a cycle of violence. Instead, it requires empathy, dialogue, and compromise.

This text will be brought to life with voice-over AI technology

In preparation for our podcast, I have stumbled over reports of schoolteachers in which they mention that they let Chat GPT write such speeches and then analyzed and problematized them in their classes. What do you think about this approach?

This Bismarck text is just nonsense for many reasons: it’s not a second part of the XIXth Century vocabulary, it does not look like – from my point of view of French historian who is not a specialist of Germany before 1919 (I have worked on the interwar period and then on the memory of the First World War) – it reflects Bismarck’s political / international thoughts. There’s also another point: there’s a double time inconsistency. Asking a dead historical character to answer this question is a first one, you mentioned it, but equally problematic is the fact that ChatGPT has been trained on data that is anterior to the Russian aggression. That’s a more subtle time inconsistency, but equally important. Nevertheless, writing that kind of text, asking students to analyse it, to show them how bad that kind of texts is, and then having some thinking about how we write history, can be a good idea. It’s basically what I do, not with a discourse, but by asking for biographies of historical figures, less prominent than Bismarck. It’s a god training for historical criticism. Finding subjects where the answers are rather good, and subjects where the answers are bad is also a way to think about biases in those systems but also in the way we write history, we do research. Some regions of the world lack data about their history – I mean online, digitized data – including large parts of Africa. And that lack of data will be reflected in the way the chatbot will answer or will “hallucinate” (ie totally make up the answer).

Chatbots’ response: I did not promote the protocols of the elders of Zion. In fact, I publicly denounced them and declared they were an anti-Semitic work that had nothing to do with Russia’s Jewish population. Could you please comment on the answer and explain to us what the problem is with the system that the chatbot is giving us such responses? The answer is historically false, that is as simple as that. Not a single word is true. It’s not even plausible fiction. I have had a look at Hello History. It’s based on GPT, the engine behind ChatGPT. They do not say if they have done a complementary training of GPT, if they have optimized it for history, and if they did, with which data. It’s a black box.

To be honest, I’m quite puzzled when I look at Hello History: they mix fictional and real characters, they provide images of those historical persons that are generated and, in details, false (look at Gilgamesh’s or De Gaulle’s portraits: the uniform of De Gaulle has never existed and does not even look French, Gilgamesh is portrayed as a body building world champion).

Nevertheless, asking students to work on Hello History can be a good pedagogical exercise: why those characters and not others? How did they choose them? Why this focus on ‘great people of the past’? Assessing the quality of the answers, trying to get information about the system, how it works, etc. All that can be done in teaching.

Furthermore, it could be interesting to test a version of a LLM – I would prefer BLOOM, which is open source and public – or of an image generation system (stable diffusion, because it’s open source) that would be optimized. One of the reason it could be interesting is to get data: to get the famous prompts, that are the seeds of the generated texts and images. It would be primary sources about how our fello citizens see the past.

How to deal with the use of LMM-tools in history education?

I agree with the HCI statement, for several reasons, including the fact that lots of calls for regulation are hypocrite, not really honest. I think we should focus on critical thinking, which is at the core of our work, in the end. As historians, we’re good at stimulating critical thinking, at thinking about what’s a document, what’s a text, the complex relations between text, image, veracity. Concretely, we can organize exercises that allow students to understand the system they’re using, to do a critique of the generated texts or images, and from there you can also teach content and methods.

The statement also asks for a stronger cooperation between historians and tech companies, developing AI tools often used by public and students but also by everyone with a genuine interest in getting information on historical events and figures as well as an explanation of historical developments. Why is this necessary? And what issues of current AI tools such as Chat GPT can this cooperation counterbalance?

It’s necessary because tech companies have no idea of what’s a primary source. They train their systems on datasets, and they have no clue about the content of those datasets, and I wonder if, in the end, they care about that. I think some of those tech companies just don’t care. Then, once it’s trained, they start correcting biases. Where we could help is to help those companies understand how to assess training datasets, how to balance them, how to perform a critique of their datasets. We should not be too self-centered: anthropologists, literature researchers, sociologists, political scientists, lawyers have a lot to say too. All humanities and social sciences should be involved in the design of artificial intelligence systems.

In some domains of the tech industry, it’s already the case: Ubisoft, in their Montréal branch, has worked with historians for Assassin’s creed. What’s possible for video games should be possible for the AI industry.

back to the blog main page