Using AI to speed up the investigation into the communication between Shell and the Dutch government.
In light of the recently launched Special Interest Group of Applied Data Science at Utrecht University, the ASReview-team is organizing a hackathon to help Follow the Money (FTM).
In this blog post, you will read all about the background of the hackathon, the goals, and the tracks available for you to join!
As this is the first ASReview hackathon ever, let’s start with some information on what is meant with an ASReview hackathon. An ASReview hackathon is a collaborative event during which the organization team members and participants work together towards a certain objective.
This objective is often quite broad, so to make things more manageable the hackathon is split up into different tracks. Each track thus has a smaller, more practical objective with some very broad guidelines. The rest is completely up to you: Design awesome concepts and/or build code which could perhaps even be used in practice!
One of the fundamental beliefs at ASReview is that research/science should be open and transparent. Therefore, any work created during an ASReview hackathon will be published under an open licence (MIT-licence). This way, if your work is deemed as the winner within your track, it could be implemented in ASReview or be published as a solution to attain the larger objective! Of course this also means that you will receive all the credits for your original work 🙂
The very first ASReview hackathon is focused on helping investigative journalism platform Follow the Money (FTM). FTM strives for justice and transparency in a world of scarce and unequally distributed riches, where everything is about money. By examining the flow of money, they scrutinize power and embedded societal problems while uncovering complex connections and offering solutions.
In the investigation of Shell Papers, there is a call for help to the general public to search through the papers for possible ties between multinational oil and gas company Shell and the Dutch government.
The Shell Papers are gathered through different so-called Wob-applications (Wet openbaarheid van bestuur – translates to Government Information (Public Access) Act). The Dutch government is requested to share information. So far, FTM has collected 2500 documents, consisting of 17000 pages. That’s a very long read, and it is only the first batch of documents: In total, around 150.000 documents are expected!
The Shell Papers (in Dutch) consist of emails, documents, WhatsApp messages, governmental decisions, and many more types of files, which are retrieved from several different counties in the Netherlands. In a nutshell, it contains a very large proportion of all communication between Shell and Dutch governmental institutions.
Before we can do anything with the data, there has to be some pre-processing. The raw data is quite challenging, containing much noise and little structure. Most data is raw email HTML, combined with a header. Documents are labeled on correspondence, reports, permits, governmental decisions, and particular documents.
To get usable information from this data, you could, for example, get started with beautifulsoup and quite a bit of regex! Also, Wob requests (Wet openbaarheid van bestuur – translates to: Government Information (Public Access) Act) allow for the censoring of some personal data, so some email addresses and personal names are scrubbed.
Available are at least:
- Message types
- Message content
- Multiple date formats (which aren’t always the same for a single entry)
A real treasure trove of information might be hidden within the mess!
Outside of reading all the documents, visualization is also a valuable tool to identify the relations between the documents. From semantic clusters to wordclouds (for example, as implemented in ASReview), a visualization can say more than 1000 words. As this is mainly a creative track, it is completely up to you how you will visualize the data at hand.
Will your artwork be used on the FTM-website?
Follow the Money currently uses crowdsourcing to analyze all the different documents of the Shell Papers. This means that people can read any and all files in any order. The goal of the third track is to provide FTM with a pipeline that allows for faster screening of the documents. Think about the following potential solution for example: Instead of randomly reading one document after another, it would be far more helpful to read documents ranked by, for example, relevance!
A possible strategy could be screening prioritization through using Active Learning: a constant interaction between a human (labeling document as relevant or irrelevant and a machine (which provides you with the next most likely relevant document to read based on previous decisions)). The Active Learning model constantly learns from your decisions and continuously updates its methods to find the next potentially relevant document. Once the data has been cleaned, the dataset(s) could, for example, be made available as a plug-in in ASReview LAB, as was done for the covid-19 dataset, which is also available in ASReview LAB. ASReview LAB is open and free software that allows you to screen documents using Active Learning. For a more in-depth introduction to ASReview, read this Nature article or blog post.
Let’s hack the reading-process and screen smarter.
During the hackathon, there will also be special guests present. Learn more about the data from Bas van Beek and Dimitri Tokmetzis, two journalists at Follow the Money. During the hackathon you will also have the opportunity to get feedback on your work from track experts.
And what’s more? On Hackathon Saturday, you can also take a break from all the coding and thinking and attend a mini-lecture by research experts and renowned professors. Subjects range from concentration to open science, sit back, relax and enjoy the lectures. See the preliminary program for an overview of the mini-lectures.
You can find all the practical details (like the registration form and the preliminary program) on the announcement page.