Session 3: Natural language processing (NLP) workflow

graphic recording session 3

Scope and purpose

Guiding question
Which natural languge processes are most appropriate for this corpora?
Considerations
Language models, named entity recognition, RNN, dialects, semantic analysis, part of speech tagging, OCR requirements, analytic depth
Goal Roadmap for training and employing NLP model
Discussants
Claire Gardent (lead), Ludovic Moncla & David Bamman

Documentation

Listen:
Session 3 audio recording PART 1
Session 3 audio recording PART 2
Session 3 audio recording PART 3
View: Session presentation slide deck
Read: Session notes
Briefing Documents:
- Blessing, A. et al. 2017. “An End-to-end Environment for Research Question-Driven Entity Extraction and Network Analysis.” Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Proceedings, 57–67, Vancouver, BC.
- Berzak, Yevgeni, Michal Richter, and Carsten Ehrler. 2010. “Similarity-Based Navigation in Visualized Collections of Historical Documents.” ECAI 2010, 15. http://www.academia.edu/download/45857901/ws16.pdf#page=15

Discussion summary

During this session, we discussed the technical and conceptual requirements necessary for performing natural language processing on the corpora and the implications of differing approaches. We addressed three approaches to the visualization of text: 1) document-based visualization, which provides a network-based bird’s eye view of multiple documents in order to identify key topics and distinguish between relevant and irrelevant assets; 2) location-based visualization, which allows for the geographic representation of events and generates a map or cartographic visualization from the documents in the corpora; and 3) event-based visualization, which yields a network view of the various entities within a collection of texts.

We discussed the importance of fidelity in annotating texts, regardless of the method employed, as errors can propagate from stage to stage as the project progresses. WIth respect to the naming of entities, disambiguation is frequently the largest hurdle, which is compounded by the necessity for historical gazetteers for place names that may have changed from their historical antecedents. Further, machine learning models are based on probability and frequency, so the most egregious errors may occur with those textual elements that diverge from their most frequent use. Oftentimes in humanistic study, these divergences are the most telling and important.

Decisions

We decided that we would employ two of the three methods discussed during the session, focusing our efforts on:

at least two visualizations of the corpus:
- an event-based visualization that looks at the relations between textual entities and correlates them to locations on a physical map,
- an event-based network visualization that identifies strength of relationships between textual entities
ensuring that the text is never obscured by the visualization apparatus, thus allowing both the researcher and the user to quickly and easily access the original source material.
an annotation workflow that includes an inter-annotator agreement analysis.

Back to main page

Ethical Visualization in the Age of Big Data

A Planning Workshop Summary

Contents