This document should be read in parallel with the corresponding slides.
First, a huge thank you to the BeSocial team for the invitation.
Who am I
I like to present a bit myself before starting a talk. I’m a
historian initially trained in political sciences and international
history.
What I am going to talk about today is linked to chance. I have
started collecting tweets in 2009 for a conference I was organising,
just to know what the conference participants were discussing on
Twitter. I then learned – first for subjects of personal interests – to
collect tweets massively, thanks to the streaming API.
When the Centenary of the First World War started, I was ready to
collect tweets massively and easily, and I did it, first to see what
would happen, after some discussions with historians of the Great War.
And it worked so well, that it became my main research project until
2019.
In March 2020, when the first lockdowns where initiated in Europe, I
had my server ready as the #ww1 project had reached an end for a few
months. So I started collecting tweets – it’s still under way, with more
than 62 millions tweets, mostly french-speaking, stored somewhere in a
database, on a university server.
That’s the two projects I will base my keynote on.
So, « api or archives? tormented ways to transform tweets into
historical sources ».
Defining an Application Programming Interface (API) is
easier than defining archives. An API is a socio-technical
device - a piece of software - that allows two apps to exchange features
for instance. Do you see on a webpage a facebook “like” button? That’s a
possible use of the facebook API.
Concerning what we are discussing today, I am - like many developers
and researchers - using the Twitter API to get information. Concretly, a
piece of software on a server is connecting to the Twitter API, send it
some information (what we want to collect, ie all tweets containing a
word or hashtag from a list of words or hahstags), and the API is
returning, if the request meets some condition that are defined by
Twitter, information – in this case tweets and metadata about those
tweets and their authors.
archives ?
Archives is much more complex to define.
I need to precise here that I am a historian and speaking as such –
my considerations on archives might be contested by archivists who are
far more rigourous than us!
Archives” is a word and concept that is much more polysemic than API,
all the more that the first digital humanities projects where often
called ‘archives’ and, in a way, blured the definition. Furthermore,
even if most european languages are using more or less the same word
(the French “archive(s)”, the English “Archives”, the German “Archiv” or
“Archivalien” to give some examples), they don’t cover the same
concepts, and even when they do, traditions to concretely collect and
sort archives are not the same from one country to the other. I should
precise that, when I say ‘archives’ in English I am using it as a French
who would use the French word and the concept that it covers.
So as it is complex to define, I asked the system that simplifies
everything to define it: Google (define:archive).
Here is the definition of “archives” from Goole, in English. Some
elements are here - though they could all be defined better.
‘collection’, ‘historical documents’, ‘records’, ‘store’. But this
definition lacks something.
archives? (in French)
And this something is present in the definition in French. ‘classés’
means ‘sorted’. It’s an important point, because, behind this word lays
most of the work of archivists. Archivists have to decide what will be
collected, what will be kept, how it will be sorted, and where it will
be stored. They then index documents that are stored in archive
centers.
archiving as the setting of historians’ work
We remain in a too simple definition of ‘archives’, but let’s say
that archivists (and librarians) work can be summed up in three words:
preserving, sorting, indexing. I’ll let aside storage for today, I’ll
speak about accessibility a bit later.
Those three words are basically the setting of historians’ work. With
no preservation, there’s no history. If documents or records are not
sorted, they’re not findable, and there’s no academic history. If they
are not indexed, searching the right document might still be too hard –
and history is biased.
Just one example taken from my previous researcher’s life. When I
first went to the Bundesarchiv in Berlin, I had to look for documents
that used for some of them to be stored in the former West Germany
(Koblenz) and some in the former East Germany (Potsdam). In Koblenz,
they worked on indexing. I could do very efficient search, sometimes at
the level of the document, rather than at the level of the box. Archives
that used to be stored in East Germany were preserved, sorted, but not
indexed. I had to work very differently, less efficiently, and so much
slower.
So archiving is the basis of almost all work of history. Archives are
‘mediated’ thanks to the archivists’ work.
primary sources and archives
But there’s a difference between archives and primary sources. All
archives are primary sources, at least potentially. But all primary
sources are net necessarily archives.
Indeed, in some cases, historians have to get their primary sources
without archivists. For instance, oral history often implies a method
where the historian is creating, together with interviewees, their own
primary sources.
Some ‘private archives’ – family archives for instance – can
sometimes be included in primary sources but not in archives: when they
are preserved, but not sorted and even less indexed. In that case,
historians will have to sort and index archives themselves.
Tweets, data – I am saying data and not archives, you all noticed it,
though I am not going to speak about that today – from social media can
also be primary sources but are most of the time not archives - in the
sense that they are not mediated by archivists.
Preservation, sorting and indexing are dealt differently. Either they
are archived through processes such as web archiving, or they are not
archived. They will be preserved, or not, by social media firm
themselves. And you will access them through either possibly not legal
ways (scraping), or through an API. If there is one.
So, to access tweets, there is a mediation through the API – not
through the work of archivists.
api as the setting of the historians’ work
More generally, when you work with social media (or lots of web based
/ internet accessible online services) and their APIs, there are many
elements you should pay attention to:
Is there an api? As I said, we are many to work on Twitter,
because it’s feasible. Because there’s an API. It’s a strong source of
biases. Of course there are other ways (scraping) and in some cases,
there will be enough data or primary sources in web archives.
What are the conditions to use the API? Before 2021, to use
twitter’s API with no budget, you had to choose between 1) streaming (1%
of the firehose, meaning 5 to 6 millions tweets per day theoretically)
that implied anticipating hashtags (for instance) or 2) search but with
strong limitations – you could only go back up to 7 days in the
past.
What is the sustainability of the API? usually it’s not
sustainable, it will change, it might be brutally closed and that will
destroy at the same time your research project. It happened to the
french project algopol based on facebook in 2016 – hopefully, they had
enough data to work on.
All those elements will orient your research and influence the way
you work. The disadvantage is that the work archivists are usually doing
with lots of historical documents will be yours to do: it’s time
consuming, it also requires to learn technical (not only computing)
skills. The advantage is that you can taylor your corpus as you wish and
you do this very explicitly.