Module 0: Introduction

2. Text Encoding in the Humanities

Since the earliest uses of computers and computational techniques in the humanities at the end of the 1940s, scholars, projects, and research groups had to look for systems that could provide representations of data which the computer could process. Computers, as Michael Sperberg-McQueen has reminded us are binary machines that ‘can contain and operate on patterns of electronic charges, but they cannot contain numbers, which are abstract mathematical objects not electronic charges, nor texts, which are complex, abstract cultural and linguistic objects.’ [1] This is clearly seen in the mechanics of early input devices such as punched cards where a hole at a certain coordinate actually meant a I or 0 (true or false) for the character or numerical represented by this coordinate according to the specific character set of the computer used. Because different computers used different character sets with a different number of characters, texts first had to be transcribed into that character set. All characters, punctuation marks, diacritics, and significant changes of type style had to be encoded with an inadequate budget of characters. This resulted in a complex of ‘flags’ for distinguishing upper-case and lower-case letters, for coding accented characters, the start of a new chapter, paragraph, sentence, or word. These ‘flags’ were also used for adding analytical information to the text such as word classes, morphological, syntactic, and lexical information. Ideally, each project used its own set of conventions consistently throughout. Since this set of conventions was usually designed on the basis of an analysis of the textual material to be transcribed to machine readable text, another corpus of textual material would possibly need another set of conventions. The design of these sets of conventions was also heavily dependent on the nature and infrastructure of the project, such as the type of computers, software, and devices such as magnetic tapes of a certain kind that were available.
Although several projects were able to produce meaningful scholarly results with this internally consistent approach, the particular nature of each set of conventions or encoding scheme had lots of disadvantages. Texts prepared in such a proprietary scheme by one project could not readily be used by other projects; software developed for the analysis of such texts could hence not be used outside the project due to an incompatibility of encoding schemes and non-standardization of hardware. However, with the increase of texts being prepared in machine-readable format, the call for an economic use of resources increased as well. Already in 1967, Michael Kay argued in favour of a ‘standard code in which any text received from an outside source can be assumed to be.’ [2] This code would behave as an exchange format which allowed the users to use their own conventions at output and at input. [3]

Bibliography

[1] Sperberg-McQueen, C.M.. Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts. Literary and Linguistic Computing 1991, 6 (1): 34-46 (34).
[2] Kay, M. Standards for Encoding Data in a Natural Language. Computers and the Humanities 1967, 1 (5): 170-177 (171)
[3] Kay, M. Standards for Encoding Data in a Natural Language. Computers and the Humanities 1967, 1 (5): 170-177 (172)
[4] Russel, D.B. COCOA: A Word Count and Concordance Generator for Atlas. Atlas Computer Laboratory: Chilton, 1967.
[5] Hockey, S. Oxford Concordance Program Users’ Manual. Oxford University Computing Service: Oxford, 1980.
[6] Lancashire, I.; Bradley, J.; McCarty, W.; Stairs, M.; Woolridge, T.R. Using TACT with Electronic Texts. Modern Language Association of America: New York, 1996.
[7] Berkowitz, L.; Squiter, K. A. Thesaurus Linguae Graecae, Canon of Greek Authors and Works. Oxford University Press: New York/Oxford, 1986.
[8] Goldfarb, C.E. The SGML Handbook. Clarendon Press: Oxford, 1990.
[9] Barnard, D.T.; Fraser, C.A.; Logan, G.M.. Generalized Markup for Literary Texts. Literary and Linguistic Computing 1988, 3 (1): 26-31 (28-29)
[10] Barnard, D.T.; Fraser, C.A.; Logan, G.M.. Generalized Markup for Literary Texts. Literary and Linguistic Computing 1988, 3 (1): 26-31.
[11] Barnard, D.T., R. Hayter, M. Karababa, G. Logan, and J. McFadden (1988b). SGML-Based Markup for Literary Texts: Two Problems and Some Solutions. Computers and the Humanities 1988, 22 (4): 265-276.
[12] Barnard, D.T., R. Hayter, M. Karababa, G. Logan, and J. McFadden (1988b). SGML-Based Markup for Literary Texts: Two Problems and Some Solutions. Computers and the Humanities 1988, 22 (4): 265-276.
[13] Bray, Tim; Paoli, Jean; Sperberg-McQueen, C.M. Extensible Markup Language (XML) 1.0.W3C Recommendation 10-February-1998. http://www.w3.org/TR/1998/REC-xml-19980210 (accessed September 2008)
[14] Bray, Tim; Paoli, Jean; Sperberg-McQueen, C.M. Extensible Markup Language (XML) 1.0.W3C Recommendation 10-February-1998. http://www.w3.org/TR/1998/REC-xml-19980210 (accessed September 2008)
[15] Burnard, L.; Sperberg-McQueen. C.M. TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release. February 2006 http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_lite.doc.html
[16] DeRose, Steven J. XML and the TEI. Computers and the Humanities 1999, 33 (1-2): 11-30 (19).
[17] Burnard, L. Report of Workshop on Text Encoding Guidelines. Literary and Linguistic Computing 1988, 3 (2): 131-133 (132-133).
[18] Ide, N.M.; Sperberg-McQueen, C.M. Development of a Standard for Encoding Literary and Linguistic Materials. In Cologne Computer Conference 1988. Uses of the Computer in the Humanities and Social Sciences. Volume of Abstracts. Cologne, Germany, Sept 7-10 1988, p. E.6-3-4 (E.6-4).
[19] Ide, N.; Sperberg-McQueen, C.M. The TEI: History, Goals, and Future. Computers and the Humanities 1995, 29 (1): 5-15 (6).
[20] Sperberg-McQueen, M.; Burnard, L. (eds.). TEI P1: Guidelines for the Encoding and Interchange of Machine Readable Texts. ACH-ALLC-ACL Text Encoding Initiative: Chicago/Oxford, 1990. Available from http://www.tei-c.org/Vault/Vault-GL.html (accessed October 2008)
[21] Sperberg-McQueen, M.; Burnard, L. (eds.). TEI P2 Guidelines for the Encoding and Interchange of Machine Readable Texts Draft P2 (published serially 1992-1993); Draft Version 2 of April 1993: 19 chapters. Available from http://www.tei-c.org/Vault/Vault-GL.html (accessed October 2008)
[22] Sperberg-McQueen, C.M.; Burnard, L. (eds.) (1994). Guidelines for Electronic Text Encoding and Interchange. TEI P3. Text Encoding Initiative: Oxford, Providence, Charlottesville, Bergen, 1994.
[23] Sperberg-McQueen, C.M.; Burnard L. (eds.). Guidelines for Electronic Text Encoding and Interchange. TEI P3. Revised reprint. Text Encoding Initiative: Oxford, Providence, Charlottesville, Bergen, 1999.
[24] Sperberg-McQueen, C.M.; Burnard, L. (eds.). TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML-compatible edition. XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz Text Encoding Initiative Consortium: Oxford, Providence, Charlottesville, Bergen, 2002. http://www.tei-c.org/P4X/ (accessed October 2008)
[25] TEI Consortium (eds.). TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium: Oxford, Providence, Charlottesville, Nancy. http://www.tei-c.org/Guidelines/P5/ (accessed October 2008).