Licensed under a Creative Commons Attribution ShareAlike 3.0 License
TEI By Example offers a series of freely available online tutorials walking individuals through the different stages in marking up a document in TEI (Text Encoding Initiative). Besides a general introduction to text encoding, step-by-step tutorial modules provide example-based introductions to eight different aspects of electronic text markup for the humanities. Each tutorial module is accompanied with a dedicated examples section, illustrating actual TEI encoding practise with real-life examples. The theory of the tutorial modules can be tested in interactive tests and exercises.
Computers can only process texts whose characters are represented by a system that relates to the binary system computers can interpret. This is called character encoding. One such character encoding scheme based on the English alphabet is ASCII (American Standard Code for Information Interchange). Character encoding facilitates the storage of text in computers and the transmission of text through telecommunication networks. Character encoding, however, does not say anything about the semantics, interpretation or structure of a text. Such information on a text is called meta-information. If we want to add any meta-information to a text so that it can be processed by computers, we need to encode or markup texts. We can do this by inserting natural language expressions (or codes representing them) in the text with the same character encoding the text is using, but separated from the text by specific markers. One such an expression, we call a tag. All of the tags used to encode a text together constitute a markup language. The application of a markup language to a text, we call text encoding.
The Text Encoding Initiative (TEI) is a standard for the representation of textual material in digital form through the means of text encoding.This standard is the collaborative product of a community of scholars, chiefly from the humanities, social sciences, and linguistics who are organized in the TEI Consortium (TEI-C http://www.tei-c.org). The TEI Consortium is a non-profit membership organization and governs a wide variety of activities such as the development, publication, and maintenance of the text encoding standard documented in the TEI Guidelines, the discussion and development of the standard on the TEI mailing list (TEI-L) and in Special Interest Groups (SIG), the gathering of the TEI community on yearly members meetings, and the promotion of the standard in publications, on workshops, training courses, colloquia, and conferences. These activities are generally open to non-members as well.
By ‘TEI Guidelines’ one may refer both to the markup language and tag set proposed by the TEI Consortium and to its documentation online or in print. Informally ‘TEI Guidelines’ is often abbreviated to ‘TEI’. In this article ‘TEI Guidelines’ is used as the general term for the encoding standard. The TEI Guidelines are widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. Since the TEI is expressed in terms of the eXtensible Markup Language (XML) and since it provides procedures and mechanics to adapt to one’s own project needs, the TEI Guidelines define an open standard that is generally applicable to any text and purpose.
This introductory module first introduces the concepts of text encoding and markup languages in the humanities and then introduces the TEI encoding principles. Next, the article provides a brief historical survey of the TEI Guidelines and ends with a presentation of the Consortium's organization.
Since the earliest uses of computers and computational techniques in the
humanities at the end of the 1940s, scholars, projects, and research groups
had to look for systems that could provide representations of data which the
computer could process. Computers, as Michael Sperberg-McQueen has reminded
us are binary machines that ‘can contain and operate on patterns of
electronic charges, but they cannot contain numbers, which are abstract
mathematical objects not electronic charges, nor texts, which are complex,
abstract cultural and linguistic objects.’
Although several projects were able to produce meaningful scholarly results
with this internally consistent approach, the particular nature of each set
of conventions or encoding scheme had lots of disadvantages. Texts prepared
in such a proprietary scheme by one project could not readily be used by
other projects; software developed for the analysis of such texts could
hence not be used outside the project due to an incompatibility of encoding
schemes and non-standardization of hardware. However, with the increase of
texts being prepared in machine-readable format, the call for an economic
use of resources increased as well. Already in 1967, Michael Kay argued in
favour of a ‘standard code in which any text received from an outside source
can be assumed to be.’
When human beings read texts, they perceive both the information stored in the linguistic code of the text and the meta-information which is inferred from the appearance and interpretation of the text. By convention, italics are, for instance, used as a code signalling a title of a book, play, or movie; a foreign word or phrase; or emphatic use of the language. Through their cognitive abilities, readers usually have no problems selecting the most appropriate interpretation of an italic string of text. Computers, however, need to be informed about these issues in order to be able to process them. This can be done by way of a markup language that provides rules to formally separate information (the text in a document) from meta-information (information about the text in a document).Whereas markup languages in use in the typesetting community were mainly of a procedural nature that is they indicate procedures that a particular application should follow (e.g. printing a string of text in italics), the humanities were also and mainly considered with descriptive markup that identifies the entity type of tokens (e.g. identifying that a string of text is a title of a book or a foreign word). Unlike procedural or presentational markup, descriptive markup establishes a one to one mapping between logical elements in the text and their markup. In order to achieve this, descriptive markup languages tend to formally separate information (the text in a document) from meta-information (information about the text in a document).
Some sort of standardization of markup for the encoding and analysis of
literary texts was reached by the COCOA encoding scheme originally
developed for the COCOA program in the 1960s and 1970s
The call for a markup language that could guarantee reusability,
interchange, system- and software-independence, portability and
collaboration in the humanities was answered by the publication of the
Standard Generalized Markup Language (SGML) as an ISO standard in 1986
In order to achieve universal exchangeability and software and platform independence, SGML made use exclusively of the ASCII codes. As mentioned above, SGML is not a markup language itself, but a metalanguage by which one can create separate markup languages for separate purposes. This means that SGML defines the rules and procedures to specify the vocabulary and the syntax of a markup language in a formal Document Type Definition (DTD). Such a DTD is a formal description of, for instance, names for all elements, names and default values for their attributes, rules about how elements can nest and how often they can occur, and names for re-usable pieces of data (entities). The DTD enables full control, parsing, and validation of SGML encoded documents. By and large the most popular SGML DTD is the Hypertext Markup Language (HTML) developed for the exchange of graphical documents over the internet.
A markup scheme with all these qualities was exactly what the humanities
were looking for in their quest for a descriptive encoding standard for
the preparation and interchange of electronic texts for scholarly
research. There was a strong consensus among the computing humanists
that SGML offered a better foundation for research oriented text
encoding than other such schemes.
publication of the eXtensible Markup Language (XML) 1.0 as a W3C
recommendation in 1998
Because of its advantages and widespread popularity, XML became the metalanguage of choice for expressing the rules for descriptive text encoding in TEI.
XML is a metalanguage by which one can create separate markup languages for separate purposes. It is platform-, software-, and system-independent and no one 'owns' XML, although specific XML markup languages can be owned by its creators. Generally speaking, XML empowers the content provider and facilitates data integration, exchange, maintenance, and extraction. XML is currently the de facto standard on the World Wide Web partly because HTML (Hypertext Markup Language) was rephrased as an XML encoding language. XML is edited and managed by the W3C which also published the specification as a recommendation in 1998.
The big selling point of XML is that it is text based. This means that each XML encoding language is entirely built up in ASCII (American Standard Code for Information Interchange), or plain text, and can be created and edited using a simple text-editor like Notepad or its equivalents on other platforms. However, when you start working with XML, you will soon find that it is better to edit XML documents using a professional XML editor. While plain text-editors don't know that you're writing TEI, XML editors will help you to write error-free XML documents, validate your XML against a DTD or a schema, force you to stick to a valid XML structure, and enable you to perform transformations.
Since ASCII only provides for characters commonly found in the English language, different character encoding systems have been designed such as Isolat-1 (ISO-8859-1) for Western languages and Unicode (UTF-8 and UTF-16). By using these character encoding systems, non-ASCII characters such as French é, à, ç, Norwegian æ ø å, or Hebrew ק can be used in XML documents. These systems rely on ASCII notation for the expression of these non-ASCII characters. The French à, for instance, is represented by the string 'agrave' in Isolat-1 and by the number '00E0' in Unicode.
Any XML encoding language consists of five components.
For example, a simple two-paragraph document could be encoded as follows in XML:
An XML document is introduced by the XML declaration.
The question mark
? in the XML declaration signals that this is a processing instruction. The following bits state that what follows is XML which complies with version 1.0 of the recommendation and the used character encoding is UTF-8 or Unicode.
The two-paragraph document above is an example of an XML document, representing both information and
meta-information. Information (plain text) is contained in XML elements,
delimited by start tags (e.g.
<!--) and end markers (
Entity references are predefined strings of data that a parser must resolve before parsing the XML document.
An entity reference starts with an ampersand
& and closes with a semicolon
;. The entity name is the string between these two symbols. For instance, the entity reference for the less than sign
< the entity reference for the ampersand is
Not all computers support the Unicode encoding scheme XML works with. Portability of individual characters from the Unicode system, however, is supported by entity references that refer to their numeric or hexadecimal notation. For example, the character ø is represented within an XML document as the Unicode character with hexadecimal value
00F8 and decimal value
0248. For exporting an XML document containing this character, it may be represented by the character (or entity) reference
ø respectively, with the 'x' indicating that what follows is a hexadecimal value. References of this type do not need to be predefined, since the underlying character encoding for XML is always the same.
For legibility purposes, however, it is also possible to refer to this character by use of a mnemonic name, such as
oslash., provided that each such name is mapped to the required Unicode value by means of an ENTITY declaration.
The ENTITY declaration uses a non-XML syntax inherited from SGML and starts with an opening delimiter
< followed by an exclamation mark
! signalling that this is a declaration. The keyword
ENTITY names that an entity is being declared here. What follows next is the entity name - here the mnemonic name
ø - for which a declaration is given and the declaration itself inside quotation marks. In this example, it is the hexadecimal value of the character.
The same character can also be declared in the following ways.
Character entities must also be used in XML to escape the less than sign
< and the ampersand
& which are illegal because the XML parser could mistake them for markup.
Gimme pepper & salt!
A < B
Entities are not only capable to refer to character declarations but can also refer to strings of text with an unlimited extent. This way repetitive keying of repeated information can be avoided (aka string substitution), or standard expressions or formulae can be kept up to date. The first is useful, for instance, for the expansion of &TBE; to "TEI by Example" before the test is validated. This corresponding ENTITY declaration is as follows:
The second is used in contracts, books of laws etc. in which updating would otherwise mean the complete rekeying of the same (extensive) string of text. For example, the expression "This contract is concluded between &party1; and &party2; for the duration of 10 years starting from &date;" in legal texts can be updated simply by changing the value of the ENTITY declarations:
The substitution of the entities by their values in the given example results in the following expression "This contract is concluded between Rev Knyff and Lt Rosen for the duration of 10 years starting from 2007-01-01"
ENTITY declarations are placed inside a DOCTYPE declaration which follows the XML declaration at the beginning of the XML document.
Since the DOCTYPE declaration is a processing instruction, it starts with the opening delimiter
<! which is followed by the keyword
DOCTYPE. Next part is the name of the root element of the document. In the case of a TEI document, this will be
TEI. The entity references which must be interpreted by the XML processor are put inside square brackets. An XML parser encountering this DOCTYPE declaration will expand the entities with the values given in the ENTITY declaration before the document itself is validated.
All text in an XML document will normally be parsed by a parser. When an XML element is parsed, the text between the XML tags is also parsed. The parser does that because XML elements can nest as in the following example:
The XML parser will break this string up into an element
An XML document often contains data which need not be parsed by an XML parser. For instance, characters like
& are illegal in XML elements because the parser will interpret them as the beginning of new elements or the beginning of an entity reference which will result in an error message. Therefore, these characters can be escaped by the use of the entity references
&; is included in an XML document, it should not be parsed by the XML parser. We can avoid this by treating it as unparsed character data or CDATA in the document:
A CDATA section starts with
<![CDATA[ and ends with
]]>. Everything inside a CDATA section is ignored by the parser.
Depending on the nature of your XML documents and what you want to use them for, you will need different tools, ranging from freely available open source tools to highly priced industrial software. In principle, the simplest plain text editor suffices to author or edit XML. In order to validate or transform XML, additional tools will be needed which often come included in dedicated XML editors: validating parser, XSLT processor, tree-structure viewer etc. For publishing purposes, XML documents may be transformed to other XML formats, HTML or PDF - to name just a few of the possibilities - using XSLT and XSLFO scripts which are processed by an off the shelf or custom made XSL processor. These published documents can be viewed in generic web browsers or PDF viewers where it considers transformations to HTML or PDF. XML documents can further be indexed, excerpted, questioned and analysed with tools specifically designed for the job.
The conclusions and the work of the TEI community are formulated as
guidelines, rules, and recommendations rather than standards, because it
is acknowledged that each scholar must have the freedom of expressing
their own theory of text by encoding the features they think important
in the text. A wide array of possible solutions to encoding matters is
demonstrated in the TEI Guidelines which therefore should be considered
a reference manual rather than a tutorial. Mastering the complete TEI
encoding scheme implies a steep learning curve, but few projects require
a complete knowledge of the TEI. Therefore, a manageable subset of the
full TEI encoding scheme was published as TEI Lite, currently describing
A significant part of the rules in the TEI Guidelines apply to the expression of descriptive and structural meta-information about the text. Yet, the TEI defines concepts to represent a much wider array of textual phenomena, amounting to a total of 503 elements and 210 attributes. These are organized into 21 modules, grouping related elements and attributes:
Each of these modules and the use of the elements they define are discussed extensively in a dedicated chapter of the TEI Guidelines .
Among more technical ones, Steven DeRose
pointed out substantial advantages of XML to the TEI community: by
allowing for more flexible automatic parsing strategies and easy
delivery of electronic documents with cheap ubiquitous tools such as web
browsers, XML could spread the notion of descriptive markup to a wide
audience that will thus be acquainted with the concepts articulated in
the TEI Guidelines.
In order to use TEI for the encoding of texts, users must make sure that their texts belong to the TEI
namespace (http://www.tei-c.org/ns/1.0) and adhere to the requirements of
the text model proposed by the TEI. In order to facilitate this conformance,
it is possible (and strongly suggested) to associate TEI texts with formal
representations of this text model. These formal
structural grammars of a TEI compatible model of the text can be
expressed in a number of ways, commonly referred to as a TEI schema.
Technically, a TEI schema can be expressed in a variety of formal languages
such as Document Type Definition (http://www.w3.org/TR/REC-xml/#dt-doctype), W3C
XML Schema (http://www.w3.org/XML/Schema), or the RELAX
NG schema language (http://www.relaxng.org/). It is important to notice that no such
thing as 'the TEI schema' exists. Rather, users are expected to select their
desired TEI elements and attributes from the TEI modules, possibly with
alterations or extensions where required. In this way, TEI offers a stable
base with unambiguous means for the representation of basic textual
phenomena, while providing standardized mechanisms for user customization
for uncovered features. It is a particular feature of TEI that these
abstract text models themselves can be expressed as TEI texts, using the
documentation elements defined in the dedicated module Documentation Elements. A minimal TEI customization file looks as
for use by whoever wants it
created on Thursday 24th July 2008 10:20:17 AM by the form at http://www.tei-c.org/Roma/
My TEI Customization starts with modules tei, core, header, and textstructure
Besides the common minimal TEI structure (
TEI infrastructure module, the core module with all common TEI elements, the header module defining all teiHeader elements, and the textstructure module defining the elements
representing the minimal structure of TEI texts..
In the vein of
http://www.literateprogramming.com/, a TEI customisation file not
only contains the formal declaration of TEI elements inside
ODD files (One Document Does it all),
because they serve as a source for the derivation of
In order to
accommodate the process of creating customised TEI schemas and prose
documentation, the TEI has developed a dedicated piece of software called
This is a dedicated ODD processor, offering an intuitive web-based interface
for the creation and basic editing of ODD files, generation of according TEI
schemas and prose documentation in a number of presentation formats.
A TEI schema, stating all structural conditions and restraints for the elements and attributes in TEI texts can then be used to automatically validate actual TEI documents with an XML parser. Consider, for example, following fragments:
When validated against a TEI schema derived from the previous ODD file, file [A] will be recognised as a valid TEI document, while file [B] won't:
After the concise overview of the most recent version of TEI (P5) in the preceding section, this section explains the historical development of the TEI Guidelines.
Shortly after the publication of the SGML specification as an ISO
Standard, a diverse group of 32 humanities computing scholars gathered
at Vassar College in Poughkeepsie, New York in a two-day meeting (11
& 12 November 1987) called for by the Association for Computers and
the Humanities (ACH http://www.ach.org), funded by the National Endowment for the
Humanities (NEH), and convened by Nancy Ide and Michael Sperberg
McQueen. The main topic of the meeting was the question how and whether
an encoding standard for machine-readable texts intended for scholarly
research should be developed. Amongst the delegates were representatives
from the main European text archives and from important North American
academic and commercial research centres. Contrary to the disappointing
outcomes of other such meetings in San Diego in 1977 or in Pisa in 1980,
this meeting did reach its goal with the formulation and the agreement
on the following set of methodological principles – the so called
Poughkeepsie Principles – for the preparation of text encoding
guidelines for literary, linguistic, and historical research
For the implementation of these principles the ACH was joined by the
Association for Literary and Linguistic Computing (ALLC http://www.allc.org) and the
Association for Computational Linguistics (ACL http://www.aclweb.org/).
Together they established the Text Encoding Initiative (TEI) whose
mission it was to develop the
Principles into workable text-encoding guidelines. The Text
Encoding Initiative very soon came to adopt SGML, published a year
before as ISO standard, as its framework. Initial funding was provided
by the US National Endowment for the Humanities, Directorate General
XIII of the Commission of the European Communities, the Canadian Social
Science and Humanities Research Council, and the Andrew W. Mellon
From the Poughkeepsie Principles the TEI concluded that the TEI Guidelines should:
A Steering Committee consisting of representatives of the ACH, the ACL, and the ALLC appointed Michael Sperberg-McQueen as editor-in-chief and Lou Burnard as European editor of the Guidelines.
The first public proposal for the TEI Guidelines was published in July
1990 under the title
Guidelines for the Encoding and
Interchange of Machine-Readable Texts with the TEI document
number TEI P1 (for Proposal 1). This version was reprinted with minor
changes and corrections, as version 1.1 in November 1990.
The following step was the publication of the TEI P3
Guidelines for Electronic Text Encoding and Interchange in
Recognizing the benefits for the TEI community, the P4 revision of the
TEI Guidelines was published in 2002 by the newly formed TEI Consortium
in order to provide equal support for XML and SGML applications using
the TEI scheme
In 2003 the TEI Consortium asked their membership to convene Special
Interest Groups (SIGs) whose aim could be to advise revision of certain
chapters of the Guidelines and suggest changes and improvements in view
of the P5. With the establishment of the new TEI Council, which
superintends the technical work of the TEI Consortium, it became
possible to agree on an agenda to enhance and modify the Guidelines more
fundamentally which resulted in a full revision of the Guidelines
published as TEI P5.
The TEI Consortium was established in 2000 as a not-for-profit membership organization to sustain and develop the Text Encoding Initiative (TEI). The Consortium is supported by a number of host institutions. It is managed by a Board of Directors, and its technical work is overseen by an elected technical Council who take responsibility over the content of the TEI Guidelines.
The TEI charter outlines the consortium’s goals and fundamental principles. Its goals are:
The Consortium honours four fundamental principles:
Involvement in the consortium is possible in three categories: voting membership which is open to individuals, institutions, or projects; non-voting subscription, which is open to personal individuals only, and sponsorship, which is open to individual or corporate sponsors. Only members have the right to vote on consortium issues and in elections to the Board and the Council, have access to a restricted website with pre-release drafts of Consortium working documents and technical reports, announcements and news, and a database of members, Sponsors, and Subscribers, with contact information, and benefit from discounts on training, consulting, and certification. The Consortium members meet annually at a Members’ Meeting where current critical issues in text encoding are discussed, and members of the Council and members of the Board of Directors are elected. The membership fee payable varies depending on the kind of project or institution and its location depending on where the economy of the member’s country falls in the four-part listing of Low, Lower-Middle, Middle-Upper, and High Income Economies, as defined by the World Bank.
Computers can only deal with explicit data. The function of markup is to represent textual material into digital form through the explicating act of text-encoding. Descriptive markup reveals what the encoder thinks to be implicit or hidden aspects of a text, and is thus an interpretive medium which often documents scholarly research next to structural information about the text. In order for this research to be exchangeable, analyzable, re-usable, and preservable, texts in the field of the humanities should be encoded according to a standard which defines a common vocabulary, grammar, and syntax, whilst leaving the implementation of the standard up to the encoder. A result of communal efforts among computing humanists, the Text Encoding Initiative documents such a standard in the TEI Guidelines. These guidelines are fully adaptable and customizable to one's specific project whilst enhancing this project's compatibility with other projects employing the TEI. Since over two decades, the TEI has been used extensively in projects from different disciplines, fields, and subjects internationally. The ongoing engagements of a broad user community through the organization of the TEI Consortium consolidates the importance of the text encoding standard and informs its continuous development and maintenance.
You have reached the end of this tutorial module providing an introduction to the TEI and text encoding for the humanities. You can now either