Education Research Current About VU Amsterdam NL
Login as
Prospective student Student Employee
Bachelor Master VU for Professionals
Exchange programme VU Amsterdam Summer School Honours programme VU-NT2 Semester in Amsterdam
PhD at VU Amsterdam Research highlights Prizes and distinctions
Research institutes Our scientists Research Impact Support Portal Creating impact
News Events calendar Woman at the top
Israël and Palestinian regions Culture on campus
Practical matters Mission and core values Entrepreneurship on VU Campus
Organisation Partnerships Alumni University Library Working at VU Amsterdam
Sorry! De informatie die je zoekt, is enkel beschikbaar in het Engels.
This programme is saved in My Study Choice.
Something went wrong with processing the request.
Something went wrong with processing the request.

Working reproducibly with personal data

Data Horror Stories
I wanted to get rid of this version and removed it from my workspace in OpenRefine, thinking that was a very clever thing to do. It wasn’t.

Where does this story come from?
University Library, Research data support

Tell us your horror story, what happened? 
I was working with survey data and had downloaded the raw data from the questionnaire platform that I had used. I was using an open-source tool called OpenRefine to clean the data, and one of the actions I performed was to remove several columns - some of them were information such as how the user had accessed the survey (not relevant for my project) but one of them was the column with contact details for those users who wanted to hear about the results of my survey. I needed those contact details, but I wanted to remove them from the dataset that I would work with, so I wouldn’t accidentally leak them. I then performed some data cleaning to standardise the input and use a comma separator, so that the data would be easier to analyse and there would be fewer bugs when writing the analysis scripts. Those were really important actions for which I used OpenRefine’s general expressions (and to their credit, it worked like a charm!). After exporting a cleaned version of the dataset, I went on and started the analysis. A few days later I realised that OpenRefine tracks the previous versions of the dataset and that the column containing the contact details was still in the file - just in a previous version, but if my laptop was breached, it would be easy to find (the raw data file that contains contact details was in an encrypted folder). I wanted to get rid of this version and removed it from my workspace in OpenRefine, thinking that was a very clever thing to do.

It wasn’t.
In order to ensure reproducibility of the project, I first should have exported the script of the cleaning actions that I had performed, which is an option in OpenRefine (and which makes it a good tool to work with). But I was so quick in wanting to get rid of this version, I completely forgot about it. Fortunately, in my case, I wasn’t planning to publish about this very small dataset in any meaningful way, so this isn’t an enormous problem. But as a research supporter who encourages researchers to save their scripts, it was definitely a teachable moment.

Did you find a solution? 
There was no solution: my data cleaning script was lost for posterity.

Was there a lesson learned? 
One thing I could have done to avoid the horror is to design a data cleaning protocol that takes into account that OpenRefine also saves previous versions of datasets and that this version should be removed AFTER exporting the cleaning script. I did learn from it in the sense that when I did something else in OpenRefine later on, I took this approach.

Such a protocol can then be shared as part of the dataset documentation so that others can learn from it.

Small hands working with lego on the floor, the explanation in front of them, surrounded by more lego.

There was no solution: my data cleaning script was lost for posterity.

Quick links

Homepage Culture on campus VU Sports Centre Dashboard

Study

Academic calendar Study guide Timetable Canvas

Featured

VUfonds VU Magazine Ad Valvas Digital accessibility

About VU

Contact us Working at VU Amsterdam Faculties Divisions
Privacy Disclaimer Veiligheid Webcolofon Cookies Webarchief

Copyright © 2025 - Vrije Universiteit Amsterdam