Module 3: Prose
2. Structure #
Consider following text:
Although its meaning might not be clear at first sight, we generally recognize this text as prose, irrespective of any knowledge about its contents or meaning. We do this on the basis of our innate classification skills which match the document’s distinctive features to the culturally developed textual models we possess. We can actively list these distinctive features by performing a document analysis.
Note
If this text is vaguely familiar to you, that’s because we took some passages from the TEI Guidelines and processed them in true Oulipo style with the N+7 Machine. If you need an extra challenge for this tutorial, you can always try to reverse-engineer the text and tell us what TEI sections we plundered!Challenge
Make a list of all structural units you can distinguish in the text above and give them a name.
Solution
The list you have compiled provides a “passport” of the document type we call prose. In this document we distinguish the following structural units:
- Paragraphs
- Divisions
- Subdivisions
- The document
- Headings
- Document title
- Subtitle
- Lists
- Quotations
- Citations
- Bibliographic and general references
- Page numbers
- Figures
- Tables
For each one of these units there is a corresponding TEI element.
Here is where to find these units in the document:
2.1. Paragraphs #
The paragraph is generally recognized as a structural textual unit that is easy to spot. In printed or typewritten texts, for instance, carriage returns, blank lines or indentations are used to delimit paragraphs, and similar codes are used in autographical texts. The TEI element to encode a paragraph is simply <p>.
Note
Because <p> denotes a prose paragraph and prose can occur in all kinds of texts of different genres, <p> can be used to encode prose sections in texts of all genres as well.The number of paragraphs in a text depends completely on that text. Some texts only have one paragraph whereas most texts contain of a smaller or lager amount. Anyhow, paragraphs cannot nest within each other, but appear as siblings next to each other:
There may be contexts in which the encoder doesn’t want to use <p> to encode units of texts which are analogous to paragraphs. Then, <ab> can be used to encode so-called “anonymous blocks” of text. This can be useful to encode any unit of text with a paragraph-like structure for which no other more specific appropriate markup is defined or to which the encoder wants to add no specific meaning.
2.2. Divisions #
Several paragraphs (or anonymous blocks) can be grouped into hierarchical divisions and subdivisions such as documents, parts, chapters, sections, subsections, etc. Divisions of any sort are encoded using <div>. Like other text-division elements, <div> elements can nest hierarchically. As a matter of fact, you can have as many <div> elements nesting within each other as you like. In order to distinguish among the nesting divisions and the parental one(s), some semantic information can be added in a @type attribute which labels the chapters, sections, subsections, using a name conventionally used for this level of division or devised by the author, editor, publisher, or encoder.
Note
The @type attribute can have any value defined by the encoder, although it is intended solely for conventional names of different classes of text blocks. These may vary according to the genre and period of the text. As the TEI Guidelines point outa major subdivision of an epic or of the Bible is generally called a ‘book,’ that of a report is usually called a ‘part’ or ‘section,’ that of a novel a ‘chapter’ — unless it is an epistolary novel, in which case it may be called a ‘letter.’ Even texts which are not organised as linear prose narratives, or not as narratives at all, will frequently be subdivided in a similar way: a drama into ‘acts’ and ‘scenes’; a reference book into ‘sections’: a diary or day book into ‘entries’; a newspaper into ‘issues’ and ‘sections,’ and so forth. (TEI Guidelines, 4.1 Divisions of the Body)
As illustrated in the example above, some sort of numbering can be added in the @n attribute. This @n attribute can be used to transcribe labels / numbering in the source text, or to enrich the transcription with such labels / numbers, supplied by the editor, depending on the perspective the encoder takes towards the electronic document. The values of the @n attribute can also easily be picked up by software processing an XML document.
Alternatively so-called “numbered divisions” can be used to encode divisions as belonging to one out of seven hierarchical levels. Numbered divisions nest hierarchically and numerically, which means that <div2> nests inside <div1>, <div3> inside <div2>, <div4> inside <div3>, <div5> inside <div4>, <div6> inside <div5>, and <div7> inside <div6>:
Overall, preference is given to unnumbered divisions (<div>), unless a strong case can be made in favour of numbered divisions. The two systems, however, cannot be mixed in one document.
Text divisions can also be preceded by introductory <p> elements.
However, <p> elements can not follow <div> elements or occur in between divisions: this is a hard limitation on the text model defined by the TEI. Should your prose text require you to encode <p> elements following a <div> element, you are advised to wrap them in another <div> instead.
Summary
Text divisions of any kind can be encoded using <div> elements, which can nest to an arbitrary depth and whose type and numbering may be documented inside @type and @n attributes, respectively. Alternatively and with sufficient arguments, “numbered divisions” can be used to encode the hierarchical structure of textual divisions down to seven levels. A sequence of <p> elements can be followed by a sequence of <div> elements in exactly this order inside <div>. Yet, <p> can not occur after a <div> element.2.3. Headings #
The examples up to now do not represent the document truthfully, because all headings have so far been transcribed only very shallowly as anonymous blocks (<ab>). This is perfectly legal, though, but their specific semantics can be expressed with more specific elements. Time now to put this right. Headings at all levels are encoded with <head>, as the following example illustrates:
As mentioned earlier, XML processing tools can into account the value of the @n attribute (as well as many other pieces of information) for numbering text divisions, when rendering a TEI document. The following example can be considered equivalent to the previous one:
A <head> element can be characterised further with a @type attribute, as demonstrated for the document’s main title and subtitle in the following example:
A @subtype attribute can provide further refinement for sub-categorisation of the @type attribute.
Summary
Headings at all levels are encoded with <head>. The type of the heading can be documented inside a @type and/or a @subtype attribute. Whether or not to encode the numbering of headings as text in the document, or as the value of the @n attribute on the parent <div> element, is up to the encoder.2.4. Lists #
Lists of any kind contain one or more items. A list is encoded with the element <list>, an item with the element <item>:
List items can be formatted in various manners: numbered, lettered, bulleted, or unmarked. Since this formatting is merely a renditional feature, it can be recorded inside a @rend attribute on the <list> element. The following is an example of a numbered list:
The following is an example of a bulleted list:
Depending on the encoding needs, the numbers in the numbered list can be labeled as such, or documented as value of the @n attribute on the element <item>. Here is an example of the first option:
And here is the equivalent example using attribute values:
However, if a record of the exact list markers in the source text is not important, and the rendition of lists in the output is to be normalised by XML processing tools, the list marker can equally be omitted from the encoding.
As mentioned earlier, <head> is also used to mark other units than <div>, and can equally be used to encode the heading of a list.
Lists can also be formatted inline, in the running text. This feature can also be encoded in the @rend attribute, with a value such as "inline". Multiple renditional features can be combined inside @rend:
Again, the appearance and structure of the list can be encoded using @n attributes:
Or, if the enumerator needs to be encoded as text contents, this can be done with <label>:
All the lists we have encountered so far, shared the same properties: a sequence of list items with some kind of formal label (bullets, letters, numbers), no matter if they were formatted as block lists or inline. Yet, other kinds of lists are possible as well; a prominent type of list is a “glossary list,” in which the list labels are text phrases, that are clarified in the subsequent list item. Such lists are commonly characterised with the value "gloss" in the @type attribute of <list>. They must consist of a sequence of <label> and <item> pairs. Even though there’s no such list in the example text, this is an example:
Notice, how this example shows how lists can nest: inside a list <item>, further <list> elements are allowed. Those can be of different types. The previous example could be rendered as follows:
Summary
Lists are encoded with the <list> element and contain one or more <item> elements. Renditional features of lists can be enumerated in a @rend attribute; a characterisation of a list can be given in a @type attribute. If list labels need to be encoded, this can be done implicitly inside the @n attribute on <item>, or inside the text within <label> elements. Lists can nest: <item> elements can contain deeper-level <list> elements.2.5. Quotation #
The use of quotation marks in a text can signal different things, such as direct or indirect speech or thought, technical terms, jargon, phrases which are mentioned but not used, citations from authorities, or indeed any part of the text attributed by the author or narrator to some agency other than the narrative voice. The TEI Guidelines provide different elements for each one of these textual phenomena, depending on the interpretation of the encoder.
2.5.1. Speech and Thought #
The general element for quotation is <q>. This can be used for all kinds of quotations when no distinction is needed among different types:
The <q> element may be fine-tuned by a @type attribute. If we consider the quotation in the previous example as spoken, we may encode it thus:
If we consider the quotation in this example as a representation of thoughts, we may encode it as follows:
The text preceding the quotation identifies a “true paranoid” as the speaker or thinker. This can be recorded inside a @who attribute on the <q> element. This is a “pointer” attribute, which refers to the identification code of another element, by prefixing it with a hash character (#), in order to indicate it as the identifier part of a formal URI reference:
However, there exists a more explicit element <said> for the encoding of speech or thought, which allows the encoder to distinguish these from other quoted text:
Next to the @who attribute, the <said> element may carry the attributes @aloud and @direct, whose values are "true", "false", "inapplicable", or "unknown". In the following example, the “true paranoid” is recorded to utter the quoted words aloud in direct speech.
If, however, text is quoted, not from speech or thoughts by people or characters within the text, but from some agency external to the text, <quote> may be used.
Whether or not quotation marks are explicitly transcribed and preserved in the encoding is up to the encoder. Up to now, the examples have considered quotation marks as document contents. Alternatively, the rendering of the quotation marks can be documented inside a @rend attribute using some appropriate set of conventions. A possible alternative for one of the examples above could be:
Yet, a more robust approach would be the definition of a standard rendition for quoted speech via the <rendition> element in the header, which can be referenced in the global @rendition element. For example:
Reference
See Module 1: Common Structure, Elements, and Attributes, section 5.8 for a discussion of the @rendition attribute, and Module 2: The TEI Header, section 3.2.1 on documentation of the editorial practice.Summary
Direct and indirect speech and thought can be encoded with the general <q> element carrying appropriate values for the @who and the @type attributes. Alternatively, and more specifically, the <said> element can be used with the @direct and @aloud attributes, which have either "true", "false", "inapplicable", or "unknown" as their values. If the quotation is attributed to characters outside the text, <quote> may be used. Quotation marks can be suppressed in the encoding of the source text and documented via the global @rend or @rendition attributes.2.5.2. Citations #
A citation is a specific type of quotation where some other kind of document is quoted together with its bibliographic reference. This means that the elements <quote> and <bibl> are essential parts of <cit>:
Like with lists, the rendering of the citation as a block or inline citation can be documented inside an @rend attribute:
Again, the question on how to treat quotation marks in the quoted text, is determined by the editorial policy. See section 2.5.1 for possible approaches.
Summary
Citations can be encoded with the <cit> element, which groups the actual citation in a <quote> element, and a bibliographic reference in a <bibl> element. The rendering of the citation can be recorded inside an @rend attribute. Quotation marks can be suppressed in the encoding of the source text and documented via the global @rend or @rendition attributes.2.5.3. Words or Phrases Mentioned #
The <mentioned> element is used to mark words or phrases mentioned but not used in the text. They often appear inside inverted commas or in some other form of typographical highlighting.
2.5.4. Disclaimed Responsibility #
Where the author or narrator disclaims responsibility over words or phrases and distances himself or herself from the words in question without even attributing them to any other voice in particular, the <soCalled> element can be used. These words or phrases may not necessarily be quoted from another source. So called “scare quotes” or italics are often used to mark these cases.
Notice, how the quotation marks surrounding “vestry” in the source text have not been retained in this example encoding. Again, this is an editorial decision.
2.5.5. Technical Terms, Jargon and Glosses #
Technical terms and jargon may consist of a single word, an acronym, a phrase, or a symbol and can be encoded with <term>. Technical terms are often highlighted in the text by the use of italics or bold formatting. Their explanation or gloss <gloss> is often given in quotation marks. These elements may occur in combination with each other or on their own.
2.5.6. Summary #
Quotation marks are used to signal speech and thought (<q>, <said>), quotations <quote>, citations (<cit> with <quote> and <bibl>), words or phrases mentioned <mentioned>, words or phrases over which the author or narrator disclaims responsibility <soCalled>, terminology <term> and glosses <gloss>. Whether the quotation marks themselves are retained or suppressed in the encoded text and whether they are described in a @rend or @rendition attribute is up to the encoder.
2.6. Bibliographic and General References #
The discussion of citations in section 2.5.2 already touched on another important textual feature: references of all sorts. Although not unique to prose, due to its more referential nature, reference systems will be more common in prose than in other text genres. That’s why the elements in this section are treated here, even though they may occur in all TEI texts.
2.6.1. Bibliographic References #
As seen in section 2.5.2, citations often are accompanied by some sort of bibliographic reference. TEI provides means to encode bibliographic information in a number of ways, depending on the required level of detail:
- <bibl>: a loose bibliographic description
- <biblStruct>: a structured bibliographic description
- <biblFull>: a fully structured bibliographic description
Since bibliographic descriptions form a mandatory part of the <sourceDesc> section of the TEI header, a full discussion of these elements is provided in Module 2: The TEI Header, section 3.1.7. Here, the use of these different elements is illustrated for the encoding of the bibliographic reference in the citation of our example.
The simplest form to encode the bibliographic reference for the citation has been given above:
This is a loose bibliographic description, consisting of unstructured plain text. Though the work may not be known to us, the typographic conventions we’re used to in such references enable us to distinguish a couple of bibliographic categories, such as the author, publication date, and page referenced:
Notice, how <bibl> allows you to explicitly encode these bibliographic reference components, in any order. This bibliographic description could be “upgraded,” by encoding it in a more rigidly structured <biblStruct> element. This requires a <monogr> element describing the work as a monograph:
This form of reference inevitably requires more structure, and details: at least the title of the work is required in <title>. Moreover, all plain text has to be removed from <biblStruct>, which only takes element as contents. The last option, <biblFull>, would impose the structure of more or less a full <fileDesc> TEI header section on the description of the work (see Module 2: The TEI Header, section 3.1). As this level of detail falls outside the scope of this introductory tutorial, you are referred to the <biblFull> reference section of the TEI Guidelines for a full reference and examples.
Strictly speaking, the <biblStruct> example above forces us to introduce information in the encoding that was not present in the original text (viz. the title, which is a mandatory element of <monogr>). Depending on the editorial principles, this may or may not be desired. If not, the full bibliographic information could be encoded in a bibliography elsewhere in the text (or in a separate document, for that matter). The TEI provides a specialised <listBibl> element for grouping bibliographic descriptions:
The presence of a structured list with bibliographic descriptions could allow us to rephrase the bibliographic pointer where it occurs under the citation. This mechanism is introduced in section 2.6.2.
Summary
Bibliographic descriptions may be provided in one of the bibliographic elements <bibl> (for loose bibliographic descriptions), <biblStruct> (for structured bibliographic descriptions), or <biblFull> (for exhaustive bibliographic descriptions). Bibliographic descriptions may be grouped in a <listBibl> element.2.6.2. References and Pointers #
Strictly speaking, the bibliographic reference under the citation in our example is an abbreviated reference, pointing at a bibliographic item, namely the book mentioned. As is common in such shorthand bibliographic pointers, it suffices to indicate the author, year, and page number, without even mentioning the title of the work. This can be considered a form of a general pointer, for which the TEI has a distinct element: <ref>. Instead of <bibl>, it could equally be encoded as follows:
The same element can be used to encode any kind of reference. For example, in the second paragraph of the section labeled “1. Paranoids,” the phrase “described in 16.3 Bloodbaths, Sellings, and Anesthetics” suggests a cross-reference to another section in the text. It could be encoded as follows:
The <ref> element has a specific attribute, @target, that allows the encoder to identify the exact target of the reference in the form of a URI reference (simply speaking, they’re like web addresses). Like any of the TEI pointing attributes, it can refer to:
- the identification code of an element in the same document: the value then consists of the # sign, followed by the @xml:id value of the target element
- the identification code of an element in another document: the value then consists of the path to that document, suffixed with the # sign and the @xml:id value of the target element
- an entire remote document: the value then just consists of the path to that document
For example, the previous references could be formally anchored to their referents as follows:
Here, the bibliographic reference assumes a complete bibliography in a document named biblliography.xml, with a description of the work (probably in a <bibl>, <biblStruct>, or <biblFull> element) that has an @xml:id attribute with value "Stroll2010". In the second example, the reference points to the @xml:id value of another element in the same document (most likely a <div> element), which has been uniquely identified as "div16.3".
Notice how the bibliographic reference in this example could be identified as such: either by providing a @type="bibl" attribute on the <ref> element, or simply by embedding a <bibl> element inside it, in which the bibliographic details could still be encoded as such:
As a matter of fact, the pointer itself may be interpreted as a component of the shorthand bibliographic description. Instead of wrapping the bibliographic description in a <ref> element, the encoder might as well identify the pointer with an empty <ptr> element:
As you can see, <ref> and <ptr> are two means to the same end: explicitly pointing to another element. There’s one important difference:
- <ref> can have content, which can be considered the “label” for the formal reference that is identified in the @target attribute. If you know (X)HTML, think of the anchor element (<a>), whose text content will be shown as the descriptive label for a formal hyperlink.
- <ptr> must be empty. You could compare it to a kind of footnote marker in a printed text.
Summary
References to other identified parts of an electronic document, or other documents in a whole, can be encoded with the <ref> and <ptr> elements. Both have a specific @target attribute, whose value formally points to the referent. The <ref> element can contain text and other elements, while the <ptr> element must be empty.2.7. Page Breaks #
Page breaks may be encoded with the <pb> element. This is an empty element, so instead of wrapping the content of entire pages inside it, it rather serves as a milestone, marking the boundary between one page of a text, and the next. Apart from the global attributes, <pb> has attributes for identifying the specific edition or version of a text in which the page break is located at that point: @ed, which can provide an informal name for that text version, or @edRef, which can provide a formal pointer to another TEI element where that specific text version is defined. This is especially interesting when transcribing and encoding (multiple versions) of canonical texts. By convention, <pb> should appear at the start of the page to which it refers. The page number can be recorded as value of an @n attribute. In the following example, the <pb> element is placed at the start of page 2: