Licensed under a Creative Commons Attribution ShareAlike 3.0 License
TEI By Example offers a series of freely available online tutorials walking individuals through the different stages in marking up a document in TEI (Text Encoding Initiative). Besides a general introduction to text encoding, step-by-step tutorial modules provide example-based introductions to eight different aspects of electronic text markup for the humanities. Each tutorial module is accompanied with a dedicated examples section, illustrating actual TEI encoding practise with real-life examples. The theory of the tutorial modules can be tested in interactive tests and exercises.
XML is a metalanguage by which one can create separate markup languages for separate purposes. It is platform-, software-, and system-independent and no one 'owns' XML, although specific XML markup languages can be owned by its creators. Generally speaking, XML empowers the content provider and facilitates data integration, exchange, maintenance, and extraction. XML is currently the de facto standard on the World Wide Web partly because HTML (Hypertext Markup Language) was rephrased as an XML encoding language. XML is edited and managed by the W3C which also published the specification as a recommendation in 1998.
The big selling point of XML is that it is text based. This means that each XML encoding language is entirely built up in ASCII (American Standard Code for Information Interchange), or plain text, and can be created and edited using a simple text-editor like Notepad or its equivalents on other platforms. However, when you start working with XML, you will soon find that it is better to edit XML documents using a professional XML editor. While plain text-editors don't know that you're writing TEI, XML editors will help you to write error-free XML documents, validate your XML against a DTD or a schema, force you to stick to a valid XML structure, and enable you to perform transformations.
Since ASCII only provides for characters commonly found in the English language, different character encoding systems have been designed such as Isolat-1 (ISO-8859-1) for Western languages and Unicode (UTF-8 and UTF-16). By using these character encoding systems, non-ASCII characters such as French é, à, ç, Norwegian æ ø å, or Hebrew ק can be used in XML documents. These systems rely on ASCII notation for the expression of these non-ASCII characters. The French à, for instance, is represented by the string 'agrave' in Isolat-1 and by the number '00E0' in Unicode.
Any XML encoding language consists of five components.
For example, a simple two-paragraph document could be encoded as follows in XML:
An XML document is introduced by the XML declaration.
The question mark
? in the XML declaration signals that this is a processing instruction. The following bits state that what follows is XML which complies with version 1.0 of the recommendation and the used character encoding is UTF-8 or Unicode.
The two-paragraph document above is an example of an XML document, representing both information and
meta-information. Information (plain text) is contained in XML elements,
delimited by start tags (e.g.
<!--) and end markers (
Entity references are predefined strings of data that a parser must resolve before parsing the XML document.
An entity reference starts with an ampersand
& and closes with a semicolon
;. The entity name is the string between these two symbols. For instance, the entity reference for the less than sign
< the entity reference for the ampersand is
Not all computers support the Unicode encoding scheme XML works with. Portability of individual characters from the Unicode system, however, is supported by entity references that refer to their numeric or hexadecimal notation. For example, the character ø is represented within an XML document as the Unicode character with hexadecimal value
00F8 and decimal value
0248. For exporting an XML document containing this character, it may be represented by the character (or entity) reference
ø respectively, with the 'x' indicating that what follows is a hexadecimal value. References of this type do not need to be predefined, since the underlying character encoding for XML is always the same.
For legibility purposes, however, it is also possible to refer to this character by use of a mnemonic name, such as
oslash., provided that each such name is mapped to the required Unicode value by means of an ENTITY declaration.
The ENTITY declaration uses a non-XML syntax inherited from SGML and starts with an opening delimiter
< followed by an exclamation mark
! signalling that this is a declaration. The keyword
ENTITY names that an entity is being declared here. What follows next is the entity name - here the mnemonic name
ø - for which a declaration is given and the declaration itself inside quotation marks. In this example, it is the hexadecimal value of the character.
The same character can also be declared in the following ways.
Character entities must also be used in XML to escape the less than sign
< and the ampersand
& which are illegal because the XML parser could mistake them for markup.
Gimme pepper & salt!
A < B
Entities are not only capable to refer to character declarations but can also refer to strings of text with an unlimited extent. This way repetitive keying of repeated information can be avoided (aka string substitution), or standard expressions or formulae can be kept up to date. The first is useful, for instance, for the expansion of &TBE; to "TEI by Example" before the test is validated. This corresponding ENTITY declaration is as follows:
The second is used in contracts, books of laws etc. in which updating would otherwise mean the complete rekeying of the same (extensive) string of text. For example, the expression "This contract is concluded between &party1; and &party2; for the duration of 10 years starting from &date;" in legal texts can be updated simply by changing the value of the ENTITY declarations:
The substitution of the entities by their values in the given example results in the following expression "This contract is concluded between Rev Knyff and Lt Rosen for the duration of 10 years starting from 2007-01-01"
ENTITY declarations are placed inside a DOCTYPE declaration which follows the XML declaration at the beginning of the XML document.
Since the DOCTYPE declaration is a processing instruction, it starts with the opening delimiter
<! which is followed by the keyword
DOCTYPE. Next part is the name of the root element of the document. In the case of a TEI document, this will be
TEI. The entity references which must be interpreted by the XML processor are put inside square brackets. An XML parser encountering this DOCTYPE declaration will expand the entities with the values given in the ENTITY declaration before the document itself is validated.
All text in an XML document will normally be parsed by a parser. When an XML element is parsed, the text between the XML tags is also parsed. The parser does that because XML elements can nest as in the following example:
The XML parser will break this string up into an element
An XML document often contains data which need not be parsed by an XML parser. For instance, characters like
& are illegal in XML elements because the parser will interpret them as the beginning of new elements or the beginning of an entity reference which will result in an error message. Therefore, these characters can be escaped by the use of the entity references
&; is included in an XML document, it should not be parsed by the XML parser. We can avoid this by treating it as unparsed character data or CDATA in the document:
A CDATA section starts with
<![CDATA[ and ends with
]]>. Everything inside a CDATA section is ignored by the parser.
Depending on the nature of your XML documents and what you want to use them for, you will need different tools, ranging from freely available open source tools to highly priced industrial software. In principle, the simplest plain text editor suffices to author or edit XML. In order to validate or transform XML, additional tools will be needed which often come included in dedicated XML editors: validating parser, XSLT processor, tree-structure viewer etc. For publishing purposes, XML documents may be transformed to other XML formats, HTML or PDF - to name just a few of the possibilities - using XSLT and XSLFO scripts which are processed by an off the shelf or custom made XSL processor. These published documents can be viewed in generic web browsers or PDF viewers where it considers transformations to HTML or PDF. XML documents can further be indexed, excerpted, questioned and analysed with tools specifically designed for the job.