A workshop to seek interdisciplinary expert perspectives on ethically and visually representing the historical place of misrepresented peoples and locales.
This workshop is the consensus decision-making first stage of a digital humanities project ethically visualizing the French cultural imagination, called Visualizing Empire. To this end, it unites top experts from relevant fields to address the conceptual and logistical challenges of visualizing French colonial historical text without reproducing their inherent ethnocentrism. To this end, the workshop will address two key issues:
This workshop lays the foundations for:
This workshop was made possible by the generosity of:
In addition to the participants and supporting instituions, this workshop was successful thanks to the contributions of many people, including:
Since the beginning of this project, we have intended to share the work of the dataset harvested from the Journal des Voyages in its present state, before further work has been completed to make it into a ready-to-use corpus. However, several ethical concerns about this were raised at the workshop, due to the problematic nature of much of the content (including racist and sexist slurs, dehumanizing descriptions etc). We aim to balance our dual desires to do minimal harm, and to be as transparent as reasonably possible. Therefore, we welcome enquiries about using the dataset. However, we request that prior to your request, you read and sign the Memorandum of Understanding, and attach a scanned copy to an email to us at .
From a technical standpoint, we have cleaned the OCR output for one of these corpora, reaching a high degree of accuracy, well above the standard found in most digital collections, and we have marked-up the text. While the data set is large enough to enable the training of machine learning algorithms, such as named-entity recognition (NER), essential to the analysis of large textual corpora, it is of a manageable size with high accuracy. As a continuous time-series, the corpora also allow for the exploration of how preoccupations changed over the eighteenth and nineteenth centuries. By performing “distant reading” on these “medium data” — too large for traditional reading, but small enough to be managed in a targeted and effective manner — we can unearth the epistemological narratives targeted at the reading public. With the goal of using machine learning to broaden critical engagement with the past, this project will provide guidelines and two key training datasets, the ARTFL Encyclopédie and the Journal des Voyages, for modeling and analyzing other eighteenth and nineteenth century French sources.
If you are interested in using the Journal des Voyages dataset, please see the Memorandum of Understanding here.
The following table outlines the cleaning processes have been performed to arrive at the latest version of our Journal des Voyages dataset.
Version | Date | Note |
---|---|---|
Version 0.3 | 2017 | Cleaned data from FlatWorld solutions was post-processed using a workflow built around the OpeNER toolkit (https://www.opener-project.eu/), generating KWIC files as well as geolocated names. Process and code for the data processing are outlined at https://github.com/cmchurch/empire-viz-alpha. Initial visualizations were generated at this stage. |
Version 0.2 | 2016 | Data cleaning performed by FlatWorld Solutions. Data was entered from original pages through keyed entry, cross-checked against the OCR’d data gleaned from BNF. The data went through six rounds of corrections. FlatWorld solutions promised 95% accuracy, but it became clear much later that the 95% target was not achieved. |
Version 0.1 | 2013 | Data scraped from BnF holdings using pjscrape and PhantomJS. Data initially cleaned using regular expressions (https://github.com/cmchurch/OCR-Postprocessing) and predictive language correction (https://github.com/cmchurch/ngram-spellchecker-FRE) based on a French-language dictionary. Later versions replaced predictive language correction with keyed entry. |
All files are available by downloading the github repository. This includes audio files, briefing materials, presentation files, graphic recordings, and session notes. To browse, hear or see individual files, refer to the individual files in session pages.
This workshop, and intellectual activity leading up to it, have resulted in several related projects. The materials for each of these activities are made publicly available via Github repositories under open source licenses, and in some cases, as Github Pages websites. The main long-term impact of these activities is preparation for the next stage of the project, namely creating a visualization to explore the Journal des Voyages corpus.
We have also done extensive work on a prototype ethical visualization workflow, tested with present day materials from the humanities and the sciences. The workshop will allow adaptation of this workflow for ethical visualization of historical textual sources and non-English text mining. For the latest development of this work, see: https://kathep.github.io/ethics/
We teach the method described above as well as how to navigate the concerns explored in this project in the course ‘Ethical Data Visualization: Taming Treacherous Data’. This course was offered in 2018 and 2019 at the Digital Humanities Summer Institute (DHSI) at the University of Victoria, Vancouver Island, Canada, and in 2019 at DHDownunder at the University of Newcastle, New South Wales, Australia. See the DHSI course content and DHDownunder course content in these repositories.
We wrote a paper on work preparatory to this workshop, it is available here: Katherine Hepworth and Christopher Church. 2018. “Racism in the Machine: Visualization Ethics in Digital Humanities Projects.” Digital Humanities Quarterly 12:4.
This paper incorporates some considerations explored at the workshop, and applies them to an expanded view of ethical visualization, beyond the digital humanities. From Hepworth, K. 2020. (forthcoming) “Make Me Care: Ethical Visualization for Impact in the Sciences and Data Sciences”, HCII Conference 2020 Proceedings.
The additional long-term impacts of these activities are that, to date:
All materials from the planning workshop have been collected into this online repository, accessible via GitHub and deposited at Zenodo with a document object identifier. These materials include recordings of all discussions and copies of all slide decks from the planning workshop itself, as well as a summary of the key findings from the grant period.
They also include the dataset of the Journal des Voyages corpus produced through a combination of OCR scanning with keyed re-entry. Alongside the Encyclopedie dataset hosted by Stanford University, these data will ultimately be used to create a training model for natural language processing and named-entity recognition in non-modern French.
See the full white paper here.
The award also contributed to the drafting of a peer-reviewed paper that will be submitted to Digital Scholarship in the Humanities outlining a method for computationally processing non-modern, non- English languages for humanities research in a critically engaged way.
This dataset will ultimately be refined and cleaned further, using ethical visualization principles, to create a complete corpus. This final corpus will be made publicly available.
A digital humanities project that seeks to ethically visualize the French historical imagination using cartographic and network visualizations of events contained within the Journal des Voyages corpus.
You can follow development of this work by subscribing to updates at our newsletter.
We welcome communication, contributions, and thoughts on this work, particularly from people in the French diaspora. If you’d like to contribute to the development of this work, please join the vizempire google group. We welcome referrals to other projects, literature, and methods that may be relevant, as well as suggestions for improvement or other modes of implementation. Contact us at the shared project email address.