
Methodology
To put it bluntly at first, the digital humanities are humanities that make use of information technologies. However, not only are computer science methods and tools applied to existing questions and discourses, but new paths are also taken, which are only made possible by the information technology orientation.
M. Thaller, „Digital Humanities als Wissenschaft“, in Digital Humanities: Eine Einführung, F. Jannidis, H. Kohle, und M. Rehbein, Hrsg. Stuttgart: J.B. Metzler, 2017, S. 13–18. doi: 10.1007/978-3-476-05446-3_2
Below are methods relevant to our project listed and explained in more detail:
We used OCR (Optical Character Recognition) technology to recognize and convert text from electronic versions of books. We didn’t always have to use it, as some sagas were already available in full text. Considering the old version of the book, fonts that were hard to recognize, and other issues, we mainly used OCR recognition software for open-source software OCR4all and one-time buyout software PDFReaderPro. During the preparation and use of the software, there were some problems.
When correcting recognized texts, five types of typical recognition errors were identified.
- Errors occurred most frequently with individual letters: examples are “s” as “s”, “z” as “ʒ”, “k”, which were recognized as “t”, “K” as “S” and “I” as “J”.
- Furthermore, recognition errors occurred in individual punctuation marks. Double quotation marks at the bottom were often recognized as a double comma. The hyphen “-” was sometimes also recognized as “=”.
- A third bug concerned the recognition of individual words: some long and unusual place names were incorrectly split into two or more words.
- Detection errors on blank pages and small decorative patterns also gave us problems. Blank sheets and small patterns in front of the page number were identified as either “ttttttt”, “000”, “088” or others each time.
- The last error was found in changes in the text structure, where a section was mistakenly separated into two or more parts.
We have used Duden-Mentor to recognize and correct incorrect words in the text. The software contains a Duden corpus that contains an extensive compilation of electronic texts and covers the vocabulary of a wide variety of subject areas.
Nevertheless, the software sometimes detects errors that are not actually errors, such as the upper and lower case of words or the recognition of certain letters. Therefore, the words incorrectly identified by the Duden were loaded into the IDE PyCharm and a Python script was written that checked words for their spelling. In the event that the result of the change was not yet determined after the review by Duden and Python code, a manual correction had to be made.
The term “parser” is commonly used in the field of computer science and programming. A parser is a program or tool that is used to analyze inputs and convert them into data structures. Parsers play an important role in computer science because they help programs to understand and process various input data.
In this project, the saga texts were structured after they had been prepared for output using a self-implemented parser. Functionalities, such as blank line cleaning and extraction of certain information for the compilation of the data sets, were included in the parser. To implement this, the parser consisted of two parts. In the first part, the saga titles were used to recognize when a new saga begins. In the second part, with the help of regular expressions, empty and unimportant lines were removed on the one hand and page breaks were detected and saved on the other. The goal was that the files could be used directly for XML structure and database creation in a next step.
After parsing, places had to be added to the sagas. In some cases, we were able to use gazetteers within the books. For the rest of the sagas, we wrote a Python script that uses the Named Entity Recognition (NER) library flair to extract all location mentions within the sagas. From the places identified per legend, the most frequent mention was selected and set as the location of the legend. Subsequently, coordinates were automatically assigned to the places with the help of German and French gazetteers.
The Text Encoding Initiative (TEI) is a not-for-profit membership organization of academic institutions, research projects, and scholars from around the world that develops and maintains a set of guidelines for the digital encoding of text. The guidelines are a set of machine-readable text encodings for use in the humanities, social sciences and linguistics.
In this project, TEI coding was used to help the individual sagas create structured, interoperable and digitized texts. The standard provides a set of elements and attributes to describe the structure and linguistic characteristics of a text. As part of the project, were automatically generated using the TreeElement library in a Python script, after which the encoder manually encoded the sections and sections.
The website had to include the integration and expansion of several systems. To facilitate data interaction and sharing, and to make it easier for the public to read the works on the website, this project integrated database systems into the website, providing a variety of data interfaces and formats. MySQL is an open-source relational database management system that is widely used for data storage and management of websites and applications.
This project was created and managed with the content management system WordPress.
In the initial phase of the project, it was suggested to use Typo3 (also a CMS) to display information about the legends. This idea originated from a module attended in the previous semester, and the promise that Typo3 is widely used and simplifies collaborative work. The latter was important to us because half of the project team lacked knowledge in web development, and this was one of the reasons for choosing a dynamic website. Toward the end of the project period, we were forced to look for an alternative. The reason for this was must-have requirements, such as the visualization of the origin of the legends on a map, which could not be implemented satisfactorily. Researching alternatives revealed that WordPress was suitable for our needs. The functionalities already implemented with Typo3 at that point, as well as the still pending must-have and nice-to-have requirements, were able to be handled and met without issue, thanks to regular communication within the team.