Module 0: Introduction to Text Encoding and the TEI

3. Markup Languages in the Humanities

3.1. Procedural and Descriptive Markup

When human beings read texts, they perceive both the information stored in the linguistic code of the text and the meta-information which is inferred from the appearance and interpretation of the text. By convention, italics are, for instance, used as a code signalling a title of a book, play, or movie; a foreign word or phrase; or emphatic use of the language. Through their cognitive abilities, readers usually have no problems selecting the most appropriate interpretation of an italic string of text. Computers, however, need to be informed about these issues in order to be able to process them. This can be done by way of a markup language that provides rules to formally separate information (the text in a document) from meta-information (information about the text in a document). Whereas markup languages in use in the typesetting community were mainly of a procedural nature—that is, they indicate procedures that a particular application should follow—(e.g., printing a string of text in italics), the humanities were also and mainly considered with descriptive markup that identifies the entity type of tokens (e.g., identifying that a string of text is a title of a book or a foreign word). Unlike procedural or presentational markup, descriptive markup establishes a one to one mapping between logical elements in the text and their markup. In order to achieve this, descriptive markup languages tend to formally separate information (the text in a document) from meta-information (information about the text in a document).

3.2. Early Attempts

Some sort of standardisation of markup for the encoding and analysis of literary texts was reached by the COCOA encoding scheme originally developed for the COCOA program in the 1960s and 1970s (Russel 1967), but used as an input standard by the Oxford Concordance Program (OCP) in the 1980s (Hockey 1980) and by the Textual Analysis Computing Tools (TACT) in the 1990s (Lancashire et al. 1996). For the transcription and encoding of classical Greek texts, the Beta-transcription/encoding system reached some level of standardised use (Berkowitz, Squitier, and Johnson 1986).

3.3. The Standard Generalized Markup Language (SGML)

The call for a markup language that could guarantee reusability, interchange, system- and software-independence, portability and collaboration in the humanities was answered by the publication of the Standard Generalized Markup Language (SGML) as an ISO standard in 1986 (ISO 8879:1986) (Goldfarb 1990). Based on IBM’s Document Composition Facility Generalized Markup Language, SGML was developed mainly by Charles Goldfarb as a metalanguage for the description of markup schemes that satisfied at least seven requirements for an encoding standard (Barnard, Fraser, and Logan 1988, 28–29):

  1. The requirement of comprehensiveness;
  2. The requirement of simplicity;
  3. The requirement that documents be processable by software of moderate complexity;
  4. The requirement that the standard not be dependent on any particular characteristic set or text-entry devise;
  5. The requirement that the standard not be geared to any particular analytic program or printing system;
  6. The requirement that the standard should describe text in editable form;
  7. The requirement that the standard allow the interchange of encoded texts across communication networks.

In order to achieve universal exchangeability and software and platform independence, SGML made use exclusively of the ASCII codes. As mentioned above, SGML is not a markup language itself, but a metalanguage by which one can create separate markup languages for separate purposes. This means that SGML defines the rules and procedures to specify the vocabulary and the syntax of a markup language in a formal Document Type Definition (DTD). Such a DTD is a formal description of, for instance, names for all elements, names and default values for their attributes, rules about how elements can nest and how often they can occur, and names for re-usable pieces of data (entities). The DTD enables full control, parsing, and validation of SGML encoded documents. By and large the most popular SGML DTD is the Hypertext Markup Language (HTML) developed for the exchange of graphical documents over the internet.

A markup scheme with all these qualities was exactly what the humanities were looking for in their quest for a descriptive encoding standard for the preparation and interchange of electronic texts for scholarly research. There was a strong consensus among the computing humanists that SGML offered a better foundation for research oriented text encoding than other such schemes (Barnard, Fraser, and Logan 1988, 26–31; Barnard et al. 1988). From the beginning, however, SGML was also criticised for at least two problematic matters: SGML’s hierarchical perspective on text, i.e., the representation of text as a hierarchical tree structure, and SGML’s verbose markup system (Barnard et al. 1988). These two issues have since been central to the theoretical and educational debates on markup languages in the humanities.

3.4. The eXtensible Markup Language (XML)

The publication of the eXtensible Markup Language (XML) 1.0 as a W3C recommendation in 1998 (Bray, Paoli, and Sperberg-McQueen 1998) brought together the best features of SGML and HTML and soon achieved huge popularity. Among the power XML borrowed from SGML are the explicitness of descriptive markup, the expressive power of hierarchic models, the extensibility of markup languages, and the possibility to validate a document against a DTD. From HTML it borrowed simplicity and the possibility to work without a DTD. Technically speaking, XML is a subset of SGML and the recommendation was developed by a group of people with a long standing experience in SGML, many of whom were TEI members.

Because of its advantages and widespread popularity, XML became the metalanguage of choice for expressing the rules for descriptive text encoding in TEI.

Bibliography

  • Barnard, David T., Cheryl A. Fraser, and George M. Logan. 1988. “Generalized Markup for Literary Texts.” Literary and Linguistic Computing 3 (1): 26–31. 10.1093/llc/3.1.26.
  • Barnard, David T., Ron Hayter, Maria Karababa, George M. Logan, and John McFadden 1988. “SGML-Based Markup for Literary Texts: Two Problems and Some Solutions.” Computers and the Humanities 22 (4): 265–276.
  • Berkowitz, Luci, Karl A. Squitier, and William H. A. Johnson. 1986. Thesaurus Linguae Graecae, Canon of Greek Authors and Works. New York/Oxford: Oxford University Press.
  • Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. https://www.w3.org/TR/1998/REC-xml-19980210 (accessed September 2008).
  • Burnard, Lou 1988. “Report of Workshop on Text Encoding Guidelines.” Literary and Linguistic Computing 3 (2): 131–133. 10.1093/llc/3.2.131.
  • Burnard, Lou, and C. M. Sperberg-McQueen. 2006. “TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release.” February 2006 https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_lite.doc.html.
  • DeRose, Steven J. 1999. “XML and the TEI.” Computers and the Humanities 33 (1–2): 11–30.
  • Goldfarb, Charles F. 1990. The SGML Handbook. Oxford: Clarendon Press.
  • Hockey, Susan 1980. Oxford Concordance Program Users’ Manual. Oxford: Oxford University Computing Service.
  • Ide, Nancy M., and C. M. Sperberg-McQueen. 1988. “Development of a Standard for Encoding Literary and Linguistic Materials.” In Cologne Computer Conference 1988. Uses of the Computer in the Humanities and Social Sciences. Volume of Abstracts. Cologne, Germany, Sept 7–10 1988, p. E.6-3-4.
  • ———. 1995. “The TEI: History, Goals, and Future.” Computers and the Humanities 29 (1): 5–15.
  • Kay, Martin 1967. “Standards for Encoding Data in a Natural Language.” Computers and the Humanities, 1 (5): 170–177.
  • Lancashire, Ian, John Bradley, Willard McCarty, Michael Stairs, and Terence Russon Woolridge. 1996 Using TACT with Electronic Texts. New York: Modern Language Association of America.
  • Russel, D. B. 1967. COCOA: A Word Count and Concordance Generator for Atlas. Chilton: Atlas Computer Laboratory.
  • Sperberg-McQueen, C. M. 1991. “Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts.” Literary and Linguistic Computing 6 (1): 34–46. 10.1093/llc/6.1.34.
  • Sperberg-McQueen, C. M., and Lou Burnard (eds.). 1990. TEI P1: Guidelines for the Encoding and Interchange of Machine Readable Texts. Chicago/Oxford: ACH-ALLC-ACL Text Encoding Initiative. https://tei-c.org/Vault/Vault-GL.html (accessed October 2008).
  • ———. 1993. TEI P2 Guidelines for the Encoding and Interchange of Machine Readable Texts Draft P2 (published serially 1992–1993); Draft Version 2 of April 1993: 19 chapters. https://tei-c.org/Vault/Vault-GL.html (accessed October 2008).
  • ———. 1994. Guidelines for Electronic Text Encoding and Interchange. TEI P3. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
  • ———. 1999. Guidelines for Electronic Text Encoding and Interchange. TEI P3. Revised reprint. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
  • ———. 2002. TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML-compatible edition. XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium. https://tei-c.org/Vault/P4/doc/html/ (accessed October 2008).
  • TEI Consortium. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Oxford, Providence, Charlottesville, Nancy: TEI Consortium. https://tei-c.org/Vault/P5/1.0.0/doc/tei-p5-doc/en/html/ (accessed October 2008).