What a great weekend and what amazing contributions we have seen during the hackathon! From dashboards, to 3D visualizations, it was all there! Thanks to the work of the participants, it was verified that the ASReview pipeline can also be used for documents other than scientific abstracts. Meaning that any other organization or person that has to read many many documents, like emails, can use ASReview.
Data journalists, Bas van Beek and Dimitri Tokmetzis, from Follow the Money, were very impressed by all the work that was done in such a short period. And as a bonus; Dimitri also wrote a nice article mentioning the hackathon and the possible use of ASReview for journalists. From the article: “While we do use e-discovery software to easily search through that [hundreds of thousands of documents], we think ASReview can help us automatically make selections of documents on the same topic. It’s like being able to reduce a haystack to a pile of bales, where each bale is a subject.” Read the whole newsletter here.
But most importantly, these were the winners of each track.
- Raymon van Dinter (Sioux Technologies & PhD at Wageningen University)
- Lukas Schubotz (student Innovation Sciences at Utrecht University)
- Stef van Buuren (Professor Missing Data imputation at the department of Methods and Statistics UU)
Their project consisted of visualizing threads of mail exchanges between actors over time. The display gives rapid insight into the structure and timing of exchanges between actors, when done for multiple threads. See their results here.
- Bianca Kramer (Open Science Expert at the University Library Utrecht)
Bianca Kramer created a network analysis of email sender and recipient domains, using VOSViewer. This can be seen as an alternative way to identify clusters within email correspondence, which in turn can complement a keyword-based approach. Take a look at the project.
- Evi Hendrikx (PhD candidate at the department of Experimental psychology)
- Matthew van der Meer (student of UU master Artificial Intelligence)
This team created a search engine-like function that the user can use to select specific files from this (or any other tabular) dataset. In other words, with this contribution, one could filter the dataset using search strategies including “AND”, “OR” and “NOT”. Read more here.
Their price? All winners received golden rubber duckies to assist them in future programming tasks! Read more about Rubber duckies and programmers here.
Using AI to speed up the investigation into the communication between Shell and the Dutch government.
In light of the recently launched Special Interest Group of Applied Data Science at Utrecht University, the ASReview-team is organizing a hackathon to help Follow the Money (FTM).
In this blog post, you will read all about the background of the hackathon, the goals, and the tracks available for you to join!
As this is the first ASReview hackathon ever, let’s start with some information on what is meant with an ASReview hackathon. An ASReview hackathon is a collaborative event during which the organization team members and participants work together towards a certain objective.
This objective is often quite broad, so to make things more manageable the hackathon is split up into different tracks. Each track thus has a smaller, more practical objective with some very broad guidelines. The rest is completely up to you: Design awesome concepts and/or build code which could perhaps even be used in practice!
One of the fundamental beliefs at ASReview is that research/science should be open and transparent. Therefore, any work created during an ASReview hackathon will be published under an open licence (MIT-licence). This way, if your work is deemed as the winner within your track, it could be implemented in ASReview or be published as a solution to attain the larger objective! Of course this also means that you will receive all the credits for your original work 🙂
The very first ASReview hackathon is focused on helping investigative journalism platform Follow the Money (FTM). FTM strives for justice and transparency in a world of scarce and unequally distributed riches, where everything is about money. By examining the flow of money, they scrutinize power and embedded societal problems while uncovering complex connections and offering solutions.
In the investigation of Shell Papers, there is a call for help to the general public to search through the papers for possible ties between multinational oil and gas company Shell and the Dutch government.
The Shell Papers are gathered through different so-called Wob-applications (Wet openbaarheid van bestuur – translates to Government Information (Public Access) Act). The Dutch government is requested to share information. So far, FTM has collected 2500 documents, consisting of 17000 pages. That’s a very long read, and it is only the first batch of documents: In total, around 150.000 documents are expected!
The Shell Papers (in Dutch) consist of emails, documents, WhatsApp messages, governmental decisions, and many more types of files, which are retrieved from several different counties in the Netherlands. In a nutshell, it contains a very large proportion of all communication between Shell and Dutch governmental institutions.
Before we can do anything with the data, there has to be some pre-processing. The raw data is quite challenging, containing much noise and little structure. Most data is raw email HTML, combined with a header. Documents are labeled on correspondence, reports, permits, governmental decisions, and particular documents.
To get usable information from this data, you could, for example, get started with beautifulsoup and quite a bit of regex! Also, Wob requests (Wet openbaarheid van bestuur – translates to: Government Information (Public Access) Act) allow for the censoring of some personal data, so some email addresses and personal names are scrubbed.
Available are at least:
- Message types
- Message content
- Multiple date formats (which aren’t always the same for a single entry)
A real treasure trove of information might be hidden within the mess!
Outside of reading all the documents, visualization is also a valuable tool to identify the relations between the documents. From semantic clusters to wordclouds (for example, as implemented in ASReview), a visualization can say more than 1000 words. As this is mainly a creative track, it is completely up to you how you will visualize the data at hand.
Will your artwork be used on the FTM-website?
Follow the Money currently uses crowdsourcing to analyze all the different documents of the Shell Papers. This means that people can read any and all files in any order. The goal of the third track is to provide FTM with a pipeline that allows for faster screening of the documents. Think about the following potential solution for example: Instead of randomly reading one document after another, it would be far more helpful to read documents ranked by, for example, relevance!
A possible strategy could be screening prioritization through using Active Learning: a constant interaction between a human (labeling document as relevant or irrelevant and a machine (which provides you with the next most likely relevant document to read based on previous decisions)). The Active Learning model constantly learns from your decisions and continuously updates its methods to find the next potentially relevant document. Once the data has been cleaned, the dataset(s) could, for example, be made available as a plug-in in ASReview LAB, as was done for the covid-19 dataset, which is also available in ASReview LAB. ASReview LAB is open and free software that allows you to screen documents using Active Learning. For a more in-depth introduction to ASReview, read this Nature article or blog post.
Let’s hack the reading-process and screen smarter.
During the hackathon, there will also be special guests present. Learn more about the data from Bas van Beek and Dimitri Tokmetzis, two journalists at Follow the Money. During the hackathon you will also have the opportunity to get feedback on your work from track experts.
And what’s more? On Hackathon Saturday, you can also take a break from all the coding and thinking and attend a mini-lecture by research experts and renowned professors. Subjects range from concentration to open science, sit back, relax and enjoy the lectures. See the preliminary program for an overview of the mini-lectures.
You can find all the practical details (like the registration form and the preliminary program) on the announcement page.