A workshop to seek interdisciplinary expert perspectives on ethically and visually representing the historical place of misrepresented peoples and locales.
Guiding question
Which natural languge processes are most appropriate for this corpora?
Considerations
Language models, named entity recognition, RNN, dialects, semantic analysis, part of speech tagging, OCR requirements, analytic depth
Goal Roadmap for training and employing NLP model
Discussants
Claire Gardent (lead), Ludovic Moncla & David Bamman
During this session, we discussed the technical and conceptual requirements necessary for performing natural language processing on the corpora and the implications of differing approaches. We addressed three approaches to the visualization of text: 1) document-based visualization, which provides a network-based bird’s eye view of multiple documents in order to identify key topics and distinguish between relevant and irrelevant assets; 2) location-based visualization, which allows for the geographic representation of events and generates a map or cartographic visualization from the documents in the corpora; and 3) event-based visualization, which yields a network view of the various entities within a collection of texts.
We discussed the importance of fidelity in annotating texts, regardless of the method employed, as errors can propagate from stage to stage as the project progresses. WIth respect to the naming of entities, disambiguation is frequently the largest hurdle, which is compounded by the necessity for historical gazetteers for place names that may have changed from their historical antecedents. Further, machine learning models are based on probability and frequency, so the most egregious errors may occur with those textual elements that diverge from their most frequent use. Oftentimes in humanistic study, these divergences are the most telling and important.
We decided that we would employ two of the three methods discussed during the session, focusing our efforts on: