Module 0: Introduction to Text Encoding and the TEI
5. TEI: Ground Rules #
5.1. Guidelines #
The conclusions and the work of the TEI community are formulated as guidelines, rules, and recommendations rather than standards, because it is acknowledged that each scholar must have the freedom of expressing their own theory of text by encoding the features they think important in the text. A wide array of possible solutions to encoding matters is demonstrated in the TEI Guidelines which therefore should be considered a reference manual rather than a tutorial. Mastering the complete TEI encoding scheme implies a steep learning curve, but few projects require a complete knowledge of the TEI. Therefore, a manageable subset of the full TEI encoding scheme was published as TEI Lite, currently describing 140 elements (Burnard and Sperberg-McQueen 2006). Originally intended as an introduction and a didactic stepping stone to the full recommendations, TEI Lite has, since its publication in 1995, become one of the most popular TEI customisations and proves to meet the needs of 90% of the TEI community, 90% of the time.
5.2. TEI Modules #
A significant part of the rules in the TEI Guidelines apply to the expression of descriptive and structural meta-information about the text. Yet, the TEI defines concepts to represent a much wider array of textual phenomena, amounting to a total of 580 elements and 265 attributes. These are organised into 21 modules, grouping related elements and attributes:
- The TEI Infrastructure (tei)
- Definition of common datatypes and modular class structures used to define the elements and attributes in the other modules.
- The TEI Header (header)
- Definition of the elements that make up the header section of TEI documents. Its major parts provide elements to encode detailed metadata about bibliographic aspects of electronic texts, their relationship with the source materials from which they may have been derived, non-bibliographic details, and a complete revision history.
- Elements Available in All TEI Documents (core)
- Definition of elements and attributes that may occur in any TEI text, of whatever genre. These elements cover textual phenomena like paragraphs, highlighting and quotation, editorial changes (marking of errors, regularisations, additions), data-like structures (names, addresses, dates, numbers, abbreviations), cross-reference mechanisms, lists, notes, graphical elements, bibliographic references, and passages of verse or drama.
- Default Text Structure (textstructure)
- Definition of elements and attributes that describe the structure of TEI texts, like front matter and title pages, text body, and back matter. These may contain further divisions, possibly introduced by headings, salutations, opening formulae, and/or concluded by closing formulae, closing salutations, trailing material and postscripts.
- Characters, Glyphs, and Writing Modes (gaiji)
- Definition of specific provisions for representing characters for which no standardised representation (such as defined by the Unicode Consortium) exists.
- Verse (verse)
- Definition of specific elements and attributes for dedicated analysis of verse materials, such as caesurae, metrical systems, rhyme schemes, and enjambments.
- Performance Texts (drama)
- Definition of specific elements and attributes for dedicated analysis of drama materials. These include provisions for encoding specific phenomena in front and back matter, like details about performances, prologues, epilogues, the dramatic setting, and cast lists. Other drama-specific structures include speeches and stage directions. For multimedia performances, elements for the description of screen contents, camera angles, captions, and sound are provided.
- Transcriptions of Speech (spoken)
- Definition of elements and attributes for (general purpose) transcription of different kinds of spoken material. These cover phenomena like utterances, pauses, non-lexical sounds, gestures, and shifts in vocal quality. Besides this, specific header elements for describing the vocal source of the transcription are provided.
- Dictionaries (dictionaries)
- Definition of elements and attributes for representing dictionaries, with provisions for unstructured and structured dictionary entries (possibly grouped). Dictionary entries may be structured with a number of specific elements indicating homonyms, sense, word form, grammatical information, definitions, citations, usage, and etymology.
- Manuscript Description (msdescription)
- Definition of specific header and structural elements and attributes for the encoding of manuscript sources. Header elements include provisions for detailed documentation of a manuscript’s or manuscript part’s identification, heading information, contents, physical description, history, and additional information. Dedicated text elements cover phenomena like catchwords, dimensions, heraldry, watermarks, and so on.
- Representation of Primary Sources (transcr)
- Definition of elements and attributes for detailed transcription of primary sources. Phenomena covered are facsimiles, more complex additions, deletions, substitutions and restorations, document hands, damage to the source material and illegibility of the text.
- Critical Apparatus (textcrit)
- Definition of elements and attributes for the representation of (different versions texts as) scholarly editions, listing all variation between the versions in a variant apparatus.
- Names, Dates, People, and Places (namesdates)
- Definition of elements and attributes for more detailed analysis of names of persons, organisations, and places, their referents (persons, organisations, and places) and aspects of temporal analyses.
- Tables, Formulæ, Graphics and Notated Music (figures)
- Definition of specific elements and attributes for detailed representation of graphical elements in texts, like tables, formulae, images, and notated music.
- Language Corpora (corpus)
- Definition of elements and attributes for the encoding of corpora of texts that have been collected according to specific criteria. Most of these elements apply to the documentation of these sampling criteria, and contextual information about the texts, participants, and their communicative setting.
- Linking, Segmentation, and Alignment (linking)
- Definition of elements and attributes for representing complex systems of cross-references between identified anchor places in TEI texts. Recommendations are given for either in-line or stand-off reference mechanisms.
- Simple Analytic Mechanisms (analysis)
- Definition of elements and attributes that allow the association of simple analyses and interpretations with text elements. Mechanisms for the representation of both generic and particularly linguistic analyses are discussed.
- Feature Structures (iso-fs)
- Definition of elements and attributes for constructing complex analytical frameworks that can be used to represent specific analyses in TEI texts.
- Graphs, Networks, and Trees (nets)
- Definition of elements and attributes for the analytical representation of schematic relationships between nodes in graphs and charts.
- Certainty, Precision, and Responsibility (certainty)
- Definition of elements for detailed attribution of certainty for the encoding in a TEI text, as well as the identification of the responsibility for these encodings.
- Documentation Elements (tagdocs)
- Definition of elements and attributes for the documentation of the encoding scheme used in TEI texts. This module provides means to define elements, attributes, element and attribute classes, either by changing existing definitions or by creating new ones.
Each of these modules and the use of the elements they define are discussed extensively in a dedicated chapter of the TEI Guidelines.
5.3. Using TEI #
Among more technical ones, Steven DeRose pointed out substantial advantages of XML to the TEI community: by allowing for more flexible automatic parsing strategies and easy delivery of electronic documents with cheap ubiquitous tools such as web browsers, XML could spread the notion of descriptive markup to a wide audience that will thus be acquainted with the concepts articulated in the TEI Guidelines (DeRose 1999, 19).
In order to use TEI for the encoding of texts, users must make sure that their texts belong to the TEI namespace (http://www.tei-c.org/ns/1.0) and adhere to the requirements of the text model proposed by the TEI. In order to facilitate this conformance, it is possible (and strongly suggested) to associate TEI texts with formal representations of this text model. These formal “structural grammars” of a TEI compatible model of the text can be expressed in a number of ways, commonly referred to as a TEI schema. Technically, a TEI schema can be expressed in a variety of formal languages such as Document Type Definition (DTD), W3C XML Schema, or the RELAX NG schema language. It is important to notice that no such thing as “the TEI schema” exists. Rather, users are expected to select their desired TEI elements and attributes from the TEI modules, possibly with alterations or extensions where required. In this way, TEI offers a stable base with unambiguous means for the representation of basic textual phenomena, while providing standardised mechanisms for user customisation for uncovered features. It is a particular feature of TEI that these abstract text models themselves can be expressed as TEI texts, using the documentation elements defined in the dedicated tagdocs module for documentation elements. A minimal TEI customisation file looks as follows:
Besides the common minimal TEI structure (<teiHeader> and <text>), a TEI customisation file has one specific element which defines the TEI schema (<schemaSpec>). A TEI schema must minimally include the modules which define the minimal TEI text structure: the tei module, the core module with all common TEI elements, the header module defining all teiHeader elements, and the textstructure module defining the elements representing the minimal structure of TEI texts.
In the vein of “Literary Programming,” a TEI customisation file not only contains the formal declaration of TEI elements inside <schemaSpec>, but may also contain prose documentation of the TEI encoding scheme it defines. Consequently, TEI customisation files are commonly called “ODD files” (One Document Does it all), because they serve as a source for the derivation of:
- a formal TEI schema
- human-friendly documentation of the TEI encoding scheme
In order to accommodate the process of creating customised TEI schemas and prose documentation, the TEI has developed a dedicated piece of software called “Roma,” available at https://roma.tei-c.org/. This is a dedicated ODD processor, offering an intuitive web-based interface for the creation and basic editing of ODD files, generation of according TEI schemas and prose documentation in a number of presentation formats.
A TEI schema, declaring all structural conditions and restraints for the elements and attributes in TEI texts can then be used to automatically validate actual TEI documents with an XML parser. Consider, for example, following fragments:
[A] | [B] |
<TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> </teiHeader><titleStmt> <title>A sample TEI document</title> </titleStmt><publicationStmt> <publisher> KANTL</publisher> <pubPlace>Ghent</pubPlace> <date when="2009"/> </publicationStmt><sourceDesc> </fileDesc><p>No source, born digital</p> </sourceDesc><text> </TEI><body> </text><p>This is a sample paragraph, illustrating a </body><name type="organisation">TEI</name> document.</p>Example 12. A sample (valid) TEI document. |
<TEI xmlns="http://www.tei-c.org/ns/1.0"> <text> </TEI><body> </text><p>This is a sample paragraph, illustrating a </body><orgName>TEI</orgName> document.</p>Example 13. A sample (invalid) TEI document. |
When validated against a TEI schema derived from the previous ODD file, file [A] will be recognised as a valid TEI document, while file [B] won’t:
- The TEI prescribes that the <teiHeader> element must be present in a TEI document, and that it precedes the <text> part.
- The minimal set of TEI modules does not include the specialised <orgName> element. Although it is a TEI element, using it requires selection of the appropriate TEI module in the ODD file (in this case, the namesdates module, defining specialised elements for the identification of names, dates, people, and places in a text).
Bibliography
- Barnard, David T., Cheryl A. Fraser, and George M. Logan. 1988. “Generalized Markup for Literary Texts.” Literary and Linguistic Computing 3 (1): 26–31. 10.1093/llc/3.1.26.
- Barnard, David T., Ron Hayter, Maria Karababa, George M. Logan, and John McFadden 1988. “SGML-Based Markup for Literary Texts: Two Problems and Some Solutions.” Computers and the Humanities 22 (4): 265–276.
- Berkowitz, Luci, Karl A. Squitier, and William H. A. Johnson. 1986. Thesaurus Linguae Graecae, Canon of Greek Authors and Works. New York/Oxford: Oxford University Press.
- Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. https://www.w3.org/TR/1998/REC-xml-19980210 (accessed September 2008).
- Burnard, Lou 1988. “Report of Workshop on Text Encoding Guidelines.” Literary and Linguistic Computing 3 (2): 131–133. 10.1093/llc/3.2.131.
- Burnard, Lou, and C. M. Sperberg-McQueen. 2006. “TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release.” February 2006 https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_lite.doc.html.
- DeRose, Steven J. 1999. “XML and the TEI.” Computers and the Humanities 33 (1–2): 11–30.
- Goldfarb, Charles F. 1990. The SGML Handbook. Oxford: Clarendon Press.
- Hockey, Susan 1980. Oxford Concordance Program Users’ Manual. Oxford: Oxford University Computing Service.
- Ide, Nancy M., and C. M. Sperberg-McQueen. 1988. “Development of a Standard for Encoding Literary and Linguistic Materials.” In Cologne Computer Conference 1988. Uses of the Computer in the Humanities and Social Sciences. Volume of Abstracts. Cologne, Germany, Sept 7–10 1988, p. E.6-3-4.
- ———. 1995. “The TEI: History, Goals, and Future.” Computers and the Humanities 29 (1): 5–15.
- Kay, Martin 1967. “Standards for Encoding Data in a Natural Language.” Computers and the Humanities, 1 (5): 170–177.
- Lancashire, Ian, John Bradley, Willard McCarty, Michael Stairs, and Terence Russon Woolridge. 1996 Using TACT with Electronic Texts. New York: Modern Language Association of America.
- Russel, D. B. 1967. COCOA: A Word Count and Concordance Generator for Atlas. Chilton: Atlas Computer Laboratory.
- Sperberg-McQueen, C. M. 1991. “Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts.” Literary and Linguistic Computing 6 (1): 34–46. 10.1093/llc/6.1.34.
- Sperberg-McQueen, C. M., and Lou Burnard (eds.). 1990. TEI P1: Guidelines for the Encoding and Interchange of Machine Readable Texts. Chicago/Oxford: ACH-ALLC-ACL Text Encoding Initiative. https://tei-c.org/Vault/Vault-GL.html (accessed October 2008).
- ———. 1993. TEI P2 Guidelines for the Encoding and Interchange of Machine Readable Texts Draft P2 (published serially 1992–1993); Draft Version 2 of April 1993: 19 chapters. https://tei-c.org/Vault/Vault-GL.html (accessed October 2008).
- ———. 1994. Guidelines for Electronic Text Encoding and Interchange. TEI P3. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
- ———. 1999. Guidelines for Electronic Text Encoding and Interchange. TEI P3. Revised reprint. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
- ———. 2002. TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML-compatible edition. XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium. https://tei-c.org/Vault/P4/doc/html/ (accessed October 2008).
- TEI Consortium. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Oxford, Providence, Charlottesville, Nancy: TEI Consortium. https://tei-c.org/Vault/P5/1.0.0/doc/tei-p5-doc/en/html/ (accessed October 2008).