Ethical Visualization in the Age of Big Data

A Planning Workshop Summary

A workshop to seek interdisciplinary expert perspectives on ethically and visually representing the historical place of misrepresented peoples and locales.

Contents

Session 4: Domain adaptation (for early modern French)

graphic recording session 4

Scope and purpose

Documentation

Discussion summary

Building on the previous session, this session addressed the applicability and accuracy of natural language processing across language domains. Most methods for natural language processing, including name entity recognition, are trained using the 1998 Wall Street Journal, which consequently yields highly accurate results for modern, journalistic American English (rates as high as 100% for tokenization and 98% for part of speech tagging). However, once those trained models are applied to other language domains (historic English, foreign languages), the accuracy declines precipitously to 40% to 60% accuracy, well below the threshold for making reliable analytic conclusions. Therefore, it is essential to create trained models within the language domain being studied, most preferably using proximate texts from the materials to be algorithmically analyzed. This means a great deal of discipline-specific work and human hours to create these datasets. We also discussed different methods for annotating such a dataset, as well as the various NLP methods that could be used.

Decisions

In our natural language processing workflow, we resolved to:

  1. restrict our algorithm to a maximum of five categories: people, physical places, metaphoric places, organizations, and events. Tracking more than locations in the text, will avoid replicating the militaristic ulterior motives inherent to the corpora themselves.
  2. use the BilSTM system of neural networks as a machine learning method,
  3. use BERT contextual embeddings, a Python library published by Google,
  4. employ the BRAT rapid annotation tool for creating the annotated data necessary for NLP training.
  5. use three datasets in the project:
    • development data that is annotated,
    • test data very similar to the corpus to evaluate the accuracy of the machine learning algorithm, and
    • the actual corpus itself as the data that will be processed.

 


Back to main page