Module 3: Prose
4. Advanced Encoding #
4.1. Segments #
It is often convenient for various kinds of analysis to distinguish smaller units inside paragraphs or anonymous blocks. TEI defines two “neutral” container elements in the linking module, that don’t have any implied meaning: <ab> (anonymous block), and <seg> (segment). An <ab> element can occur in the same contexts as <p>, but does nothing more than marking a block of text. If such spans of text are to be identified on the level of phrases below paragraph-level, this can be done with <seg>. Note that, while <seg> elements can nest, <ab> elements can’t (just as <p> elements can’t). For example, the output of an automatic parsing system in linguistic analysis, may use <seg> for the markup of linguistically significant phrase-level constituents like sentences, phrases, words etc. in a theory-neutral manner.
Note
Specialized “linguistic segment category” elements are defined in section 17.1 Linguistic Segment Categories of the TEI Guidelines.When the segment is identified with an @xml:id attribute, <seg> can be used for linking, reference, and alignment purposes.
Note
See section 16.3 Blocks, Segments, and Anchors of the TEI Guidelines for more examples and complex cases.4.2. Figures #
Graphical elements may be indicated with the empty <graphic> element. This suffices to merely point out the presence of a graphical element. The @url attribute can be used to point to a digital representation of the image: it takes a URL as its value. Suppose a digital facsimile of the image in the example text is available, this could be encoded as follows:
In this case, the URL points to a file hi_elk.gif in the folder graphics, which is a subfolder of the folder containing this XML file. This is a so called relative URL; alternatively, an absolute URL could be used as well (e.g., file:///F:/TBE/images/hi_elk.gif).
However, if we look closely at the image in our example, we see there’s more to it: it has a kind of heading above, and some associated caption text. Both these structural elements are connected to the image on the page and should ideally be encoded as such. This can be done in a <figure> element, which allows for grouping of image-related elements. The <figure> element is defined in the figures module. Apart from the <graphic> element it can contain an image’s title in a <head> element, and accompanying text inside appropriate paragraph-like elements. For our example, this could look like this:
The <figure> element also allows for a meta-description of the contents of the image, inside the <figDesc> element. It can either be used to replace the actual image, if you want to provide a description rather than the image itself, or to complement it:
Instead of linking to an external digital representation of an image with the @url attribute on <grahic>, an image can also be included inside a TEI text, as an encoded version of its binary data. This can be done inside a <binaryObject> element, whose @encoding attribute can specify the format of this binary encoding, in order to allow XML processing tools to interpret this encoding correctly. If no format is specified, Base64 is assumed. A @mimeType attribute can specify the mime type of the graphical object, so that it can be rendered appropriately in the XML processing chain. For example, this is how a Base64 ASCII representation of the binary JPEG scan of the image in our example text can be encoded:
Notice, that, just like <graphic>, <binaryObject> can be used without a <figure> wrapper as well.
Note
If these specific TEI elements for graphical elements are insufficient for your needs, it is perfectly possible to make use of more advanced representation standards like SVG in TEI. For more information, have a look at section 22.6 Combining TEI and Non-TEI Modules of the TEI Guidelines.Summary
The presence of graphical elements in a document can be indicated in the empty <graphic> element. A digital representation can be pointed to in its @url attribute. Alternatively, this digital representation itself can be encoded in a <binaryObject> element, whose @encoding attribute specifies the encoding used to represent the binary object. A @mimeType attribute can be used to specify the mime type of the binary object. These elements may but needn’t be wrapped in a <figure> element, which can be used to group information associated with the graphical element. Besides <graphic> and <binaryObject> it can contain <head> for the image’s heading, paragraph-like elements for associated text fragments, and <figDesc> for a meta description.4.3. Tables #
Tables can be encoded in TEI with the <table> element. Tables are first organised in rows, and rows contain a number of cells. Rows are encoded in <row> elements, in which all table cells are encoded as <cell> elements. For example, the first two rows of the table in our example can be encoded as:
Notice how the first cell of the first row is left empty and could be represented as a <cell> element without any content: this is effectively an empty cell <cell/>. The other rows contain three cells. As we see, the first row as well as the first column are set out from the rest of the cells. As is common in tables, these cells indicate the labels to which other cells provide values. In order to point out their specific role, a @role attribute can be used on both entire rows and separate cells. Suggested values are "label" and "data" (default):
The third row deviates from the previous two. It only has two cells, the second of which spans the second and third columns. This can be recorded with an @cols attribute on this specific cell. Its value is the total of columns occupied by this cell.
Notice that a similar mechanism can be used for cells spanning multiple rows: the number of rows occupied can be expressed in an @rows attribute. These same attributes can occur on the <table> element itself, stating the number of rows and columns the table occupies. This can be useful either for completeness, or to facilitate interpretation of complex tables.
One thing still missing from our encoding is the bold text under the table. This can be considered the table’s heading. Again, the generic <head> element can be used to capture this information:
Notice, however, that <head> as member of the model.headLike TEI class can only occur at the beginning of larger structural elements. Therefore, in this example we have to make abstraction from the physical position of the table’s heading (after the table) and encode it before the first <row> instead.