Module 0: Introduction to Text Encoding and the TEI

4. XML: Ground Rules

4.1. Recommendation

XML is a metalanguage by which one can create separate markup languages for separate purposes. It is platform-, software-, and system-independent and no-one “owns” XML, although specific XML markup languages can be owned by their creators. Generally speaking, XML empowers the content provider and facilitates data integration, exchange, maintenance, and extraction. XML is currently the de facto standard on the World Wide Web partly because HTML (Hypertext Markup Language) was rephrased as an XML encoding language. XML is edited and managed by the W3C which also published the specification as a recommendation in 1998 (Bray, Paoli, and Sperberg-McQueen 1998).

The big selling point of XML is that it is text-based. This means that each XML encoding language is entirely built up in ASCII (American Standard Code for Information Interchange), or plain text, and can be created and edited using a simple text-editor like Notepad or its equivalents on other platforms. However, when you start working with XML, you will soon find that it is better to edit XML documents using a professional XML editor. While plain text-editors don’t know that you’re writing TEI, XML editors will help you write error-free XML documents, validate your XML against a DTD or a schema, force you to stick to a valid XML structure, and enable you to perform transformations.

Since ASCII only provides for characters commonly found in the English language, different character encoding systems have been designed such as Isolat-1 (ISO-8859-1) for Western languages, and Unicode (UTF-8 and UTF-16). By using these character encoding systems, non-ASCII characters such as French é, à, ç, Norwegian æ ø å, or Hebrew ק can be used in XML documents. These systems rely on ASCII notation for the expression of these non-ASCII characters. The French à, for instance, is represented by the string agrave in Isolat-1 and by the number 00E0 in Unicode.

Reference

See XML Resources, section 1 in the TBE Toolkit.

4.2. Components

Any XML encoding language consists of five components:

  • Processing Instructions
  • Elements
  • Attributes
  • Entity References
  • (P)CDATA

For example, a simple two-paragraph document could be encoded as follows in XML:

<?xml version="1.0" encoding="UTF-8"?>
<document>
<!-- paragraphs go here -->
<paragraph number="1">Paragraph one of
<title>an X​ML example</title>
.</paragraph>
<paragraph number="2">Paragraph two of this example.</paragraph>
</document>
Example 1. A sample XML document.

4.2.1. XML Declaration

An XML document is introduced by the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>
Example 2. An XML declaration.

The question mark ? in the XML declaration signals that this is a processing instruction. The following bits state that what follows is XML which complies with version 1.0 of the recommendation and that the used character encoding is UTF-8 or Unicode. An XML declaration is optional, but can only appear as the first content of an XML file.

4.2.2. Elements and Attributes

The two-paragraph document above is an example of an XML document, representing both information and meta-information. Information (plain text) is contained in XML elements, delimited by start tags (e.g., <document>) and end tags (e.g., </document>). Additional information to these XML elements can be given in attributes, consisting of a name (e.g., @number) and a value (e.g., "1"). Attributes can only occur within the start tag of an element. XML comments are delimited by start markers (<!--) and end markers (-->). Everything inside comments is ignored by XML processing software: it is said to be “commented out.”

4.2.3. Entity References

Entity references are predefined strings of data that a parser must resolve before parsing the XML document.

Note

A parser is a piece of software that recognises a programming or an encoding language with the possible intent to process or interpret it. An XML parser can for instance be used to validate an XML document, transform it to another format, or process information from the document.

Entity references may be useful in a number of cases:

  • representing character data which cannot easily be keyboarded or which is illegal in XML because some characters are reserved
  • escaping reserved characters in XML
  • providing “boilerplate text,” that is text which is or can be reused in new contexts or applications

An entity reference starts with an ampersand & and closes with a semicolon ;. The entity name is the string between these two symbols. For instance, the entity reference for the “less than” sign < is &lt;. The entity reference for the ampersand is &amp;.

Not all computers necessarily support the Unicode encoding scheme XML works with. Portability of individual characters from the Unicode system, however, is supported by entity references that refer to their numeric or hexadecimal notation. For example, the character ø is represented within an XML document as the Unicode character with hexadecimal value 00F8 and decimal value 0248. For exporting an XML document containing this character, it may be represented with the corresponding character reference &#x00F8; or &#0248; respectively, with the x indicating that what follows is a hexadecimal value. References of this type do not need to be predefined, since the underlying character encoding for XML is always the same.

For legibility purposes, however, it is also possible to refer to this character by use of a mnemonic name, such as oslash, provided that each such name is mapped to the required Unicode value by means of an ENTITY declaration.

<​!ENTITY oslash "#x00​F8">
Example 3. An entity declaration.

The ENTITY declaration uses a non-XML syntax inherited from SGML and starts with an opening delimiter < followed by an exclamation mark ! signalling that this is a declaration. The keyword ENTITY names that an entity is being declared here. What follows next is the entity name - here the mnemonic name &oslash; - for which a declaration is given and the declaration itself inside quotation marks. In this example, it is the hexadecimal value of the character.

The same character can also be declared in the following ways:

<!ENTITY oslash "ø">
<!ENTITY oslash "#0248">
Example 4. Alternative (but equivalent) entity declarations.

Character entities must also be used in XML to escape the “less than” sign < and the ampersand & which are illegal because the XML parser could mistake them for markup.

<p>Gimme pepper &amp; salt!</p>
<p>A &lt; B</p>
Example 5. Escaping reserved XML charactes with character entities.

Entities are not only capable to refer to character declarations but can also refer to strings of text with an unlimited extent. This way, repetitive keying of repeated information can be avoided (aka string substitution), or standard expressions or formulae can be kept up to date. The first is useful, for instance, for the expansion of &TBE; to “TEI by Example” before the test is validated. This corresponding ENTITY declaration is as follows:

<​!ENTITY T​BE "T​EI by Example​">
Example 6. An entity declaration.

The second is used in contracts, books of laws, etc. in which updating would otherwise mean the complete re-keying of the same (extensive) string of text. For example, the expression: This contract is concluded between &party1; and &party2; for the duration of 10 years starting from &date; in legal texts can be updated simply by changing the value of the ENTITY declarations:

<!ENTITY party1 "Rev Knyff ">
<!ENTITY party2 "Lt Rosen">
<!ENTITY date "2010-01-01">
Example 7. More entity declarations.

The substitution of the entities by their values in the given example results in the following expression: This contract is concluded between Rev Knyff and Lt Rosen for the duration of 10 years starting from 2007-01-01.

Note

The term boilerplate text dates back to the early 1900s, referring to the thick, tough steel sheets used to build steam boilers. From the 1890s onwards, printing plates of text for widespread reproduction such as advertisements or syndicated columns were cast or stamped in steel (instead of the much softer and less durable lead alloys used otherwise) ready for the printing press and distributed to newspapers around the United States. They came to be known as “boilerplates.”

ENTITY declarations are placed inside a DOCTYPE declaration which follows the XML declaration at the beginning of the XML document:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE rootelement [
<!ENTITY party1 "Rev Knyff ">
<!ENTITY party2 "Lt Rosen">
<!ENTITY date "2010-01-01">
]>
Example 8. A DOCTYPE declaration.

The DOCTYPE declaration starts with the opening delimiter <! which is followed by the keyword DOCTYPE. Next comes the name of the root element of the document. In the case of a TEI document, this will be TEI. The entity references which must be interpreted by the XML processor are put inside square brackets. An XML parser encountering this DOCTYPE declaration will expand the entities with the values given in the ENTITY declaration before the document itself is validated.

4.2.4. (P)CDATA

All text in an XML document will normally be parsed by a parser. When an XML element is parsed, the text between the XML tags is also parsed. The parser does that because XML elements can nest as in the following example:

<paragraph number="1">Paragraph one of
<title>an X​ML example</title>
.</paragraph>
Example 9. Nesting XML elements.

The XML parser will break this string up into an element <paragraph> with a subelements <title>. Text data that will be parsed by an XML parser is called parsed character data or PCDATA.

An XML document often contains data which need not be parsed by an XML parser. For instance, characters like < and & are illegal in XML elements because the parser will interpret them as the beginning of new elements or the beginning of an entity reference which will result in an error message. Therefore, these characters can be escaped by the use of the entity references &lt; and &amp;. When programming or scripting code (such as Javascript, which contains many occurrences of &lt; and &amp;) is included in an XML document, it should not be parsed by the XML parser. We can avoid this by treating it as unparsed character data or CDATA in the document:

<script><![CDATA[
function matchwo(a,b) {
if (a < b && a < 0) then {
return 1;
}
else {
return 0;
}
}
]]></script>
Example 10. Escaping a block of text as CDATA section.

A CDATA section starts with <![CDATA[ and ends with ]]>. Everything inside a CDATA section is treated as plain text by the parser.

4.3. Using XML

Depending on the nature of your XML documents and what you want to use them for, you will need different tools, ranging from freely available open source tools to expensive industrial software. In principle, the simplest plain text editor suffices to author or edit XML. In order to validate or transform XML, additional tools will be needed which often come included in dedicated XML editors: a validating parser, an XSLT processor, a tree-structure viewer etc. For publishing purposes, XML documents may be transformed to other XML formats, HTML or PDF—to name just a few of the possibilities—using XSLT and XSLFO scripts which are processed by an off-the-shelf or custom-made XSL processor. These published documents can be viewed in generic web browsers or PDF viewers. XML documents can further be indexed, excerpted, questioned, and analysed with tools specifically designed for the job.

Bibliography

  • Barnard, David T., Cheryl A. Fraser, and George M. Logan. 1988. “Generalized Markup for Literary Texts.” Literary and Linguistic Computing 3 (1): 26–31. 10.1093/llc/3.1.26.
  • Barnard, David T., Ron Hayter, Maria Karababa, George M. Logan, and John McFadden 1988. “SGML-Based Markup for Literary Texts: Two Problems and Some Solutions.” Computers and the Humanities 22 (4): 265–276.
  • Berkowitz, Luci, Karl A. Squitier, and William H. A. Johnson. 1986. Thesaurus Linguae Graecae, Canon of Greek Authors and Works. New York/Oxford: Oxford University Press.
  • Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. https://www.w3.org/TR/1998/REC-xml-19980210 (accessed September 2008).
  • Burnard, Lou 1988. “Report of Workshop on Text Encoding Guidelines.” Literary and Linguistic Computing 3 (2): 131–133. 10.1093/llc/3.2.131.
  • Burnard, Lou, and C. M. Sperberg-McQueen. 2006. “TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release.” February 2006 https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_lite.doc.html.
  • DeRose, Steven J. 1999. “XML and the TEI.” Computers and the Humanities 33 (1–2): 11–30.
  • Goldfarb, Charles F. 1990. The SGML Handbook. Oxford: Clarendon Press.
  • Hockey, Susan 1980. Oxford Concordance Program Users’ Manual. Oxford: Oxford University Computing Service.
  • Ide, Nancy M., and C. M. Sperberg-McQueen. 1988. “Development of a Standard for Encoding Literary and Linguistic Materials.” In Cologne Computer Conference 1988. Uses of the Computer in the Humanities and Social Sciences. Volume of Abstracts. Cologne, Germany, Sept 7–10 1988, p. E.6-3-4.
  • ———. 1995. “The TEI: History, Goals, and Future.” Computers and the Humanities 29 (1): 5–15.
  • Kay, Martin 1967. “Standards for Encoding Data in a Natural Language.” Computers and the Humanities, 1 (5): 170–177.
  • Lancashire, Ian, John Bradley, Willard McCarty, Michael Stairs, and Terence Russon Woolridge. 1996 Using TACT with Electronic Texts. New York: Modern Language Association of America.
  • Russel, D. B. 1967. COCOA: A Word Count and Concordance Generator for Atlas. Chilton: Atlas Computer Laboratory.
  • Sperberg-McQueen, C. M. 1991. “Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts.” Literary and Linguistic Computing 6 (1): 34–46. 10.1093/llc/6.1.34.
  • Sperberg-McQueen, C. M., and Lou Burnard (eds.). 1990. TEI P1: Guidelines for the Encoding and Interchange of Machine Readable Texts. Chicago/Oxford: ACH-ALLC-ACL Text Encoding Initiative. https://tei-c.org/Vault/Vault-GL.html (accessed October 2008).
  • ———. 1993. TEI P2 Guidelines for the Encoding and Interchange of Machine Readable Texts Draft P2 (published serially 1992–1993); Draft Version 2 of April 1993: 19 chapters. https://tei-c.org/Vault/Vault-GL.html (accessed October 2008).
  • ———. 1994. Guidelines for Electronic Text Encoding and Interchange. TEI P3. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
  • ———. 1999. Guidelines for Electronic Text Encoding and Interchange. TEI P3. Revised reprint. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
  • ———. 2002. TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML-compatible edition. XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium. https://tei-c.org/Vault/P4/doc/html/ (accessed October 2008).
  • TEI Consortium. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Oxford, Providence, Charlottesville, Nancy: TEI Consortium. https://tei-c.org/Vault/P5/1.0.0/doc/tei-p5-doc/en/html/ (accessed October 2008).