TEI by Example. Module 0: Introduction to text encoding and the TEI Edward Vanhoutte Ron Van den Branden Edward Vanhoutte Ron Van den Branden Melissa Terras Association for Literary and Linguistic Computing (ALLC) Centre for Digital Humanities (CDH), University College London, UK Centre for Computing in the Humanities (CCH), King's College London, UK Centre for Scholarly Editing and Document Studies (CTB) , Royal Academy of Dutch Language and Literature, Belgium
Centre for Scholarly Editing and Document Studies (CTB) Royal Academy of Dutch Language and Literature Koningstraat 18 9000 Gent Belgium
ctb@kantl.be
Edward Vanhoutte Melissa Terras Ron Van den Branden
Centre for Scholarly Editing and Document Studies (CTB) , Royal Academy of Dutch Language and Literature, Belgium Centre for Scholarly Editing and Document Studies (CTB) , Royal Academy of Dutch Language and Literature, Belgium Gent
Centre for Scholarly Editing and Document Studies (CTB) Royal Academy of Dutch Language and Literature Koningstraat 18 9000 Gent Belgium

9 July 2010
TEI By Example. Edward Vanhoutte editor Ron Van den Branden editor Melissa Terras editor

Digitally born

TEI By Example offers a series of freely available online tutorials walking individuals through the different stages in marking up a document in TEI (Text Encoding Initiative). Besides a general introduction to text encoding, step-by-step tutorial modules provide example-based introductions to eight different aspects of electronic text markup for the humanities. Each tutorial module is accompanied with a dedicated examples section, illustrating actual TEI encoding practise with real-life examples. The theory of the tutorial modules can be tested in interactive tests and exercises.

en-GB fixed broken link and (example) character encoding added distinction gi -- gi scheme="..." -- tag final spellcheck release Added documentation on how to associate entity declarations with a document instance under 4.2.3. Added new section 4. XML ground rules: to be finished Added new section 5.3 Using TEI: to be revised -reshuffled modules: TBED01v00 has become TBED00v00; updated TBED00v00.xml Revision XML-izing text
Examples for Module 0: Introduction
Introduction

Contrary to the other examples sections, this examples section of the introductory TBE tutorial will illustrate different types of encoding for one sample text. These markup samples will range from procedural to descriptive markup languages, in a variety of formats (text, SGML, XML). Before starting, have a look at the following document:

In this short piece of prose, following text structures can be distinguished: a heading a paragraph a footnote

Apart from these structures, some textual phenomena can be distinguished: a title (Die Leiden des jungen Werther) emphasised text (exceptionally) a term (Weltschmertz) a name (Goethe)

Let's have a look how different encoding flavours treat these phenomena.

LaTeX

This example illustrates how the text above could be encoded in LaTeX, an open source typesetting language that can be interpreted by TeX typesetting programs for producing fixed-layout representations such as PDF. LaTeX is not an XML format, and makes use of procedural markup, whose meta-information (starting with the \ character) are instructions for the rendering software on the layout of the text content. As you can see, the actual text contents are preceded by a declaration of several style aspects determining how the text has to be rendered on a page. The text is divided captured as a {document}, in which all italicised words are indicated as italicised (\textit{}), without difference between the reasons for this typographic emphasis. The footnote is distinguished (\footnote{}), but there is no way of telling the computer that Goethe is a proper name.

OpenDocument Format

The same document can be encoded in the OpenDocument Format, an XML encoding scheme for representing electronic documents such as spreadsheets, charts, presentations and word processing documents, that can be interpreted by (desktop) publishing systems such as the Open Office software suite. Note that, despite ODF being expressed in XML, there are many similarities to the LaTeX approach in the previous example. ODF is a procedural encoding scheme as well, providing an XML vocabulary to describe different formatting styles. The text itself is encoded in a office:text element, in which several structural elements are distinguished: headings, paragraphs, footnotes, each with their own associated rendering instructions in the form of styles. All italicised text is represented in the encoding, with references to different style definitions that are responsible for rendering the text italic in the output. Here, too, there is no way of indicating that the visually unmarked Goethe is a proper name.

Review Die Leiden des jungen Werther 1 by Goethe is an exceptionally good example of a book full of Weltschmerz .
COCOA

In the next example, the sample text is encoded in COCOA. This encoding scheme shares with the LaTeX example above its non-XML character, but differs in that COCOA is a descriptive markup scheme. It provides a simple means to distinguish user-defined categories in a text, by labeling them unambiguously by means of one-letter tag names. There are two possibilities: either the text is encoded in the tag (e.g.: <H Review> identifies the text Review as belonging to the category H (for "heading")), or a tag is numbered (e.g.: <P 1> indicates that the text following it is part of the first paragraph). This enables the encoder not only to distinguish all text structures (heading (H), paragraph (P), footnote (F); but also to distinguish between the different textual phenomena that occur as italicised text (book title (B), emphasis (E), term (T)). Moreover, the typographically unmarked proper name Goethe can be tagged as such as well (N).

<H Review> <P 1><B Die Leiden des jungen Werther><F 1>by <N Goethe > is an <E exceptionally> good example of a book full of <T Weltschmerz>
TEI P3 (SGML)

The sample text could be encoded in TEI P3 as well. Being TEI, this is a descriptive encoding scheme that allows the encoder to explicate the structure and semantics of the textual features s/he wants to analyse. In our sample, we see the typical features of TEI documents (although some of the names have evolved since version P3): a document is encoded in a TEI.2 element, containing both a teiHeader section for the meta-information, and a text part for the actual text contents. The header must contain a minimal amount of meta-information, while the text content itself is encoded in body. Inside the text, the structural elements (heading -- head, paragraph -- p, footnote -- note @place=foot), as well as semantic features (title -- title, emphasis -- emph, term -- term) can be fully expressed with comprehensible tag names.

Note, however, that this is SGML, not XML: some elements can occur without end tags (title, body, p, head), and attribute values can occur without surrounding quotes (type=foot).

<TEI.2> <teiHeader> <fileDesc> <titleStmt> <title>Review: an electronic transcription </titleStmt> <publicationStmt> <p>Published as an example for the Introduction module of TBE. </publicationStmt> <sourceDesc> <p>No source: born digital. </sourceDesc> </fileDesc> </teiHeader> <text> <body> <head>Review <p><title>Die Leiden des jungen Werther <note place=foot>by <name>Goethe</name> is an <emph>exceptionally</emph> good example of a book full of <term>Weltschmerz</term>. </text> </TEI.2>
TEI P5 (XML)

Finally, this example illustrates how a TEI P5 (XML) encoding of the sample text could look. The latest version ot the TEI Guidelines specify a descriptive encoding scheme in XML format. As you'll see, there are much similarities with the TEI P3 encoding of the previous example: all structural and semantic text features can be indicated and labeled with fairly intuitive element names. Still, some differences stand out: in TEI P5, all elements must have end tags in TEI P5, all attribute values must be surrounded by quotes some basic element names have changed (e.g.: the first element of any TEI P5 text is now called TEI) in TEI P5, many details of the text ontology have been changed, some elements have been revised, improved, deleted, or added

The TBE tutorials will guide you through the most important sections of the TEI Guidelines that should enable you to encode the most common features of different text genres, and derive TEI encoding schemes according to your needs.

Review: an electronic transcription

Published as an example for the Introduction module of TBE.

No source: born digital.

Review

Die Leiden des jungen Werther by Goethe is an exceptionally good example of a book full of Weltschmerz.