Building an Ark in the Flood of Data: Notes on TVOF’s Data Management Approach
As our colleague Maria Teresa Rachetta is preparing the publication of our interpretive edition of the Genesis section of the Histoire Ancienne, the newest member of the TVOF team, Natasha Romanova, reflects on how we should plan for the future of our research data.
Li arche aloit flotant sa et la et voltant sor les ondes. Les aigues crurent et engrandirent durement sor terre, et habunderent tant qu'eles totes les plus hautes montaignes qui sous le ciel estoient covrirent (HA1 §24.2).
The ark journeyed, sailing hither and thither, rocking over the waves. The water grew and vastly accumulated on earth, and gathered to such an extent that it covered all the very highest mountains under the sky.
Issues of data management, storage and discoverability are central to any large project with an interdisciplinary team of researchers and research software engineers providing a valuable scholarly and educational resource for an international community of users, such as The Values of French Language and Literature in the European Middle Ages (TVOF).
Sometimes it may feel that finding one’s way in the rapidly expanding ocean of research data is akin to navigating to safety in the times of the Flood: an enterprise requiring a concerted team effort, careful forward planning, and an awareness of a changing research climate and technological landscape. In short, what's needed is a spark of divine inspiration!
I became interested in humanities research data management when working at King’s Digital Lab and contributing to KDL’s Archiving and Sustainability effort. Joining TVOF in the final stages of the project, I was excited to see the extent to which the team have thought through the day-to-day management of the workflows and data, as well as the impact digitisation and the digital have on the preservation of and interaction with medieval artefacts and texts (e.g. blogposts by Simone Ventura and Alice Hazard). Working closely with KDL and KCL Library, the TVOF team have also planned beyond the project end for data to remain accessible to scholars and the wider community and so that the their work can be preserved for years to come.
The threat of information loss tends to be associated with material media. In 2018, the fire at the National Museum of Brazil that destroyed the Museum library was a tragic reminder of just how fragile and ephemeral such material objects can be. Besides physical threat of destruction, artefacts can become inaccessible due to lack of curation or inadequate storage: over the centuries, manuscripts or books can be lost, damaged or destroyed, but they can also be made unreachable through inappropriate cataloguing or lack of information. Thus they “disappear into the ever-expanding heap of cultural remains” (Burdick and al, p. 48).
The advent of digitisation and digitally-enabled scholarship, combined with the arrival of the Internet at the end of the twentieth century, appeared to offer a perfect solution: seemingly unlimited storage supplemented by immediate discoverability. Until recently, however, the peculiar kind of “materiality” of the digital tended to be overlooked as were ever multiplying threats it poses to the preservation and dissemination of data: growing online security risks, eventual obsolescence of software and, as a consequence, the labour and costs involved in bringing resources up to date, not to mention maintenance and storage costs. Moreover, as in the case of “analogue” data, there is a danger for content or research to be lost or overlooked in the limitless ocean of information.
In the last decade, the sustainability of Digital Humanities projects has become an important topic of discussion even outside the DH community; funding applications now have to include a data management plan and provision for future maintenance of datasets. When it was founded in 2015, King’s Digital Lab inherited just short of 100 research projects and websites, most of which were no longer funded. In consultation with project partners, KDL had to find a sustainable archiving solution for each project, and now future planning and Service Level Agreements feature prominently in funding applications. (Read Minmin Yu’s recent blogpost on “Legacy Data” project).
A website with a range of functionalities is a typical deliverable of a DH project, but at the heart of any project is a dataset, collected and organised by the team with a research question in mind. The website’s functionalities determine how a user engages with this data (“user-journeys”), but data itself has value reaching beyond the goals of a particular project: it can be explored by researchers in a different discipline, studied using new methods or it can find its way into the classroom. For this to happen, data should be adequately described in order for users to be able to find and cite it in their work, thus acknowledging the team’s contribution.
Since 2016, research data specialists and librarians have been implementing a set of guiding principles for an improved practice that advocate Findability, Accessibility, Interoperability and Reuse of research data and metadata (“FAIR principles”). The spirit of FAIR underlies TVOF's approach to data: the project, for instance, has followed Text Encoding Initiative (TEI) markup, a standard for text encoding which makes use of XML markup to support interoperability and interchange for digital editions to potentially enable future reuses of datasets.
From TVOF’s inception, the team has had to plan for the best ways to collaborate, produce and preserve transcriptions, encoding, documentation and software without fear of loss or duplication of work. As a consequence, an approach involving a shared Dropbox folder and a staging website that is updated every two hours was adopted. Furthermore, following discussions between TVOF’s Hannah Morcos, Geoffroy Noël, Senior Research Software Engineer at KDL, and Dan Crane, Research Data Manager at King’s College Library, a plan for the future was put in place. The TVOF project website will remain available until 2030, hosted and maintained by King’s Digital Lab.
In addition, the TVOF team are exposing project data on the Figshare platform, recommended as an ideal repository by the Research Data Management team at KCL. Figshare, an open access repository for researchers in all disciplines to deposit and share their research outputs, including “non-traditional” ones such as research datasets, was first launched in 2011 by Mark Hahnel (watch a short video about Figshare here).
Figshare’s motto “Store, Share, Discover, Research” echoes the FAIR principles discussed above. TVOF data is joining a growing number of Humanities datasets, and is divided across four types of records: Edition, Alignment, Lemmatisation, and Software.
The datasets have Digital Object Identifiers (DOIs) that are used to identify the project in the Reference Excellence Framework (REF) and for other citation purposes (on the importance of REF in the British higher education context and for a checklist of Digital Outputs Assessment, read this blogpost by Arianna Ciula from KDL).
The software section of TVOF’s Figshare page contains links to the GitHub repository of the open source software, produced for the project by the team at KDL, responsible for converting the TEI input produced by the research team into various derivative forms, such as tokenised XML, HTML output, concordances. For the foreseeable future, it will be possible to download the most recent files at the source of the website in their original XML format from three remaining sections (Edition, Alignment, Lemmatisation) for reuse and study.
Et sachez que adonc esgarda Noé fors de l'arche et vit que les hautes terres estoient descovertes et que l'erbe aparoit et la verdure. Liés en fu mout en son corage, de ce ne doutés mie (HA1 §26.1).
And know that at this point Noah looked out of the Ark and saw that the mountains were uncovered and that grass and greenery were coming out. He felt joy in his heart, have no doubt about it.
As we are putting finishing touches to the edition, the TVOF team are starting to look forward to hearing about the new applications researchers will find for the project's datasets and the new exciting research and teaching applications to which they can contribute.
King's College London
Anne Burdick, Joanna Drucker, Peter Lunefeld et al., Digital_Humanities, Cambridge, Massachusetts and London: The MIT Press, 2012
Arianna Ciula, ‘What Makes Good Honey? KDL Checklist for Digital Outputs Assessment in the REF’ (blogpost) [Available at: https://www.kdl.kcl.ac.uk/blog/checklist-digitaloutputs-ref. Last accessed 17 June 2020]
Alice Hazard, ‘Medievalist Objects: Parchment and the Computer Screen’ (blogpost) [Available at: https://tvof.ac.uk/blog/medievalist-objects-parchment-and-computer-screen. Last accessed 30 June 2020]
King’s Digital Lab, ‘Archiving and Sustainability: KDL’s pragmatic approach to managing 100 Digital Humanities projects, and more…” (webpage) [Available at: https://www.kdl.kcl.ac.uk/our-work/archiving-sustainability. Last accessed 17 June 2020]
King’s Digital Lab, ‘Frequently Asked Questions: What project partners might want to know about KDL’ [Available at: https://www.kdl.kcl.ac.uk/how-we-work/faq-partners. Last accessed 3 July 2020]
Simone Ventura, ‘Born digital editions and much more at a mid-summer conference in Verona’ (blogpost) [Available at https://tvof.ac.uk/blog/born-digital-editions-and-much-more-mid-summer-conference-verona. Last accessed 30 June 2020]
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg et al., ‘The FAIR guiding principles for scientific data management and stewardship’ Scientific Data 3 (2016) [Available at: https://www.nature.com/articles/sdata201618. Last accessed 17 June 2020]
Minmin Yu, ‘Reflections on an Internship at KDL: Legacy Data Project’ (blogpost) [Available at: https://www.kdl.kcl.ac.uk/blog/reflections-internship-kdl/. Last accessed 17 June 2020]
‘The Values of French Language and Literature in the European Middle Ages” on Figshare (Available at: https://doi.org/10.6084/m9.figshare.c.4873335.v1. Last accessed 30 June 2020).
Contains project data of ‘The Values of French’ project
DARIAH-CAMPUS, ‘Winter School: Shaping New Approaches to Data Management in Art and Humanities, 10-13 December 2019, Lisbon, Portugal’ (Available at: https://campus.dariah.eu/resource/ws2019. Last accessed 17 June 2020).
Contains videos of presentations on a range of topics in Data Management in the Humanities
DARIAH-CAMPUS, ‘MaDiH: Research Software Engineering Training” (Available at: https://campus.dariah.eu/resource/rse2019. Last accessed 17 June 2020).
Contains slides on a diverse range of topics in Research Software Engineering in the Humanities (including Data Management and Archiving and Sustainability