This file is freely available and you are hereby authorised to copy, modify, and redistribute it in any way without further reference or permissions.
Made from scratch.
Slovenian Research Infrastructure for Language Resources and Tools CLARIN.SI.
A TEI schema for linguistically annotated corpora, primarily meant as example of good practice for CLARIN.SI. This is a very general TEI schema - for actual practice, the companion document "tei_clarin_example.xml" should be consulted.
This element is required. It is customary to specify the TEI namespace http://www.tei-c.org/ns/1.0
on it, using the xmlns attribute.
First published as part of TEI P2, this is the P5 version using a name space.
No source: this is an original work.
This is about the shortest TEI document imaginable.
Unpublished demonstration file.
No source: this is an original work.
If abbreviations are expanded silently, this practice should be documented in the <editorialDecl>, either with a <normalization> element or a <p>.
As with other culturally-constructed traits such as sex, the way in which this concept is described in different cultural contexts may vary. The normalizing attributes are provided as a means of simplifying that variety to Western European norms and should not be used where that is inappropriate. The content of the element may be used to describe the intended concept in more detail, using plain text.
Any number of alternations, pointers or extended pointers.
A consistent format should be adopted
Available for academic research purposes only.
In the public domain
Available under licence from the publishers.
The MIT License applies to this document.
Copyright (C) 2011 by The University of Victoria
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The value attribute may take any value permitted for attributes of the W3C datatype Boolean: this includes for example the strings true
or 1
which are equivalent.
The scheme attribute needs to be supplied only if more than one taxonomy has been declared.
The who attribute may be used to point to any other element, but will typically specify a <respStmt> or <person> element elsewhere in the header, identifying the person responsible for the change and their role in making it.
It is recommended that changes be recorded with the most recent first. The status attribute may be used to indicate the status of a document following the change documented.
The function of this element seems to overlap with both the org attribute on <div> and the <samplingDecl> in the <encodingDesc>.
The conversion element is designed to store information about converting from one unit of measurement to another. The formula attribute holds an XPath expression that indicates how the measurement system in fromUnit is converted to the system in toUnit. Do not confuse the usage of the dating attributes (from and to) in the examples with the attributes (fromUnit and toUnit) designed to reference units of measure.
May be used to note the results of proof reading the text against its original, indicating (for example) whether discrepancies have been silently rectified, or recorded using the editorial tags described in section 3.4. Simple Editorial Changes.
Errors in transcription controlled by using the WordPerfect spelling checker, with a user defined dictionary of 500 extra words taken from Chambers Twentieth Century Dictionary.
For derivative texts, details of the ancestor may be included in the source description.
When used in a specification element such as
This element is intended primarily for use in document production or manipulation, rather than in the transcription of pre-existing materials; it makes it easier to specify the location of indices, tables of contents, etc., to be generated by text preparation or word processing software.
...
Cf. the general <date> element in the core tag set. This specialized element is provided for convenience in marking and processing the date of the documents, since it is likely to require specialized handling for many applications. It should be used only for the date of the entire document, not for any subset or part of it.
Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose.
The list presented here is primarily for illustrative purposes.
The content of <f> may be textual, with the assumption that the data type of the feature value is determined by the schema—this is the approach used in many language-technology-oriented projects and recommendations.
Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose
For many literary texts, a simple binary opposition between fiction
and fact
is naïve in the extreme; this parameter is not intended for purposes of subtle literary analysis, but as a simple means of characterizing the claimed fictiveness of a given text. No claim is made that works characterized as fact
are in any sense true
.
Where running heads are consistent throughout a chapter or section, it is usually more convenient to relate them to the chapter or section, e.g. by use of the rend attribute. The <fw> element is intended for cases where the running head changes from page to page, or where details of page layout and the internal structure of the running heads are of paramount importance.
The name
ctlig
like the following: per-glyph
as follows: The <gap>, <unclear>, and <del> core tag elements may be closely allied in use with the <damage> and <supplied> elements, available when using the additional tagset for transcription of primary sources. See section 11.3.3.2. Use of the gap, del, damage, unclear, and supplied Elements in Combination for discussion of which element is appropriate for which circumstance.
The <gap> tag simply signals the editors decision to omit or inability to transcribe a span of text. Other information, such as the interpretation that text was deliberately erased or covered, should be indicated using the relevant tags, such as <del> in the case of deliberate deletion.
The <handShift> element may be used either to denote a shift in the document hand (as from one scribe to another, on one writing style to another). Or, it may indicate a shift within a document hand, as a change of writing style, character or ink. Like other milestone elements, it should appear at the point of transition from some other state to the state which it describes.
End-of-line hyphenation silently removed where appropriate
May contain character data and phrase-level elements. Typical content will be
<idno> should be used for labels which identify an object or concept in a formal cataloguing system such as a database or an RDF store, or in a distributed system such as the World Wide Web. Some suggested values for type on <idno> are ISBN, ISSN, DOI, and URI.
#sym
.How does it go?
...
Southern dialect (my own variety, at least) has only
whereas Negro Non-Standard basilect has both these and I done goneI done went
.I done go
White Southern dialect also has
which, when they occur in Negro dialect, should probably be considered as borrowings from other varieties of English.I've done goneI've done went
Each individual keyword (including compound subject headings) should be supplied as a <term> element directly within the <keywords> element. An alternative usage, in which each <term> appears within a <item> inside a <list> is permitted for backwards compatibility, but is deprecated.
If no control list exists for the keywords used, then no value should be supplied for the scheme attribute.
British English and French
Particularly for sublanguages, an informal prose characterization should be supplied as content for the element.
Labels specifically relating to usage should be tagged with the special-purpose <usg> element rather than with the generic<lbl> element.
May contain an optional heading followed by a series of items, or a series of label and item pairs, the latter being optionally preceded by one or two specialized headings.
These decrees, most blessed Pope Hadrian, we propounded in the public council ... and they
confirmed them in our hand in your stead with the sign of the Holy Cross, and afterwards
inscribed with a careful pen on the paper of this page, affixing thus the sign of the Holy
Cross.
When this element appears within the <creation> element it documents the set of revision campaigns or stages identified during the evolution of the original text. When it appears within the <revisionDesc> element, it documents only changes made during the evolution of the encoded representation of that text.
The type attribute may be used to indicate the type of morpheme, taking values such as clitic, prefix, stem, etc. as appropriate.
The attributes available for this element are not appropriate in all cases. For example, it makes no sense to specify the temporal duration of a graphic. Such errors are not currently detected.
The mimeType attribute must be used to specify the MIME media type of the resource specified by the url attribute.
Where both upper- and lower-case i, j, u, v, and vv have been normalized, to modern
20th century typographical practice, the
Spacing between words and following punctuation has been regularized to zero spaces; spacing between words has been regularized to one space.
Spelling converted throughout to Modern American usage, based on Websters 9th Collegiate dictionary.
Detailed analyses of quantities and units of measure in historical documents may also use the feature structure mechanism described in chapter 18. Feature Structures. The <num> element is intended for use in simple applications.
I reached
Light travels at 10
It is an error to supply the max attribute in the absence of a value for the value attribute.
The Royalof Arts
The content of this element may be used as an alternative to the more formal specification made possible by its attributes; it may also be used to supplement the formal specification with commentary or clarification.
Although the simplest form of a path is a straight line between two points, a line with more than two points may bend at any point. The order of coordinates in points is significant, because the line follows the coordinate sequence.
To specify a closed polygon, use the <zone> element rather than the <path> element.
May contain either a prose description organized as paragraphs, or a sequence of more specific demographic elements drawn from the
Female respondent, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2.
May contain a prose description organized as paragraphs, or any sequence of demographic elements in any combination.
The global xml:id attribute should be used to identify each speaking participant in a spoken text if the who attribute is specified on individual utterances.
Note that a persona is not the same as a role. A role may be assumed by different people on different occasions, whereas a persona is unique to a particular person, even though it may resemble others. Similarly, when an actor takes on or enacts the role of a historical person, they do not thereby acquire a new persona.
The abbreviated pointer may be dereferenced to produce either an absolute or a relative URI reference. In the latter case it is combined with the value of xml:base in force at the place where the pointing attribute occurs to form an absolute URI in the usual manner as prescribed by XML Base.
In the context of this project, private URIs with
the prefix "ref" point to
All punctuation marks in the source text have been retained and represented using the
appropriate Unicode code point. In cases where a punctuation mark and nearby markup convey
the same information (for example, a sentence ends with a question mark and is also tagged
as
I would agree with Saint Augustine that “An unjust law is no law at all
.”
I would agree with Saint Augustine that “An unjust law is no law at all.”
Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose
May be used to indicate that a passage is distinguished from the surrounding text for reasons concerning which no claim is made. When used in this manner, <q> may be thought of as syntactic sugar for <hi> with a value of rend that indicates the use of such mechanisms as quotation marks.
Tübingen— to enter the letter
uwith an umlaut hold down the
optionkey and press
0 0 f c
No quotation marks have been retained. Instead, the
All quotation marks are retained in the text and are represented by appropriate Unicode characters.
The dur attribute is used to indicate the original duration of the recording.
Recorded on a Sony TR444 walkman by unknown participants; remastered
to digital tape at
Recorded from FM Radio to digital tape
g_bl
using strikethroughg_t
)If the target attribute is used to reference the related bibliographic item, the element must be empty.
Only one of the attributes active and mutual may be supplied; the attribute passive may be supplied only if the attribute active is supplied. Not all of these constraints can be enforced in all schema languages.
Harry gripped the edges of the stool and thought,
May contain character data mixed with any other elements defined in the dictionary tag set.
La valeur n'attend pas le nombre des années
As with other culturally-constructed traits such as age, the way in which this concept is described in different cultural contexts may vary. The normalizing attributes are provided only as an optional means of simplifying that variety to one or more external standards for purposes of interoperability, or project-internal taxonomies for consistency, and should not be used where that is inappropriate or unhelpful. The content of the element may be used to describe the intended concept in more detail, using plain text.
Elizabethis spoken loudly, the words
Yesand
Come and try thiswith normal volume, and the words
come onvery loudly.
The content of this element may be used as an alternative to the more formal specification made possible by its attributes; it may also be used to supplement the formal specification with commentary or clarification.
This element should be used wherever it is desired to record an unusual space in the source text, e.g. space left for a word to be filled in later, for later rubrication, etc. It is not intended to be used to mark normal inter-word space or the like.
(The "aftermath" starts here)
(The "aftermath" continues here)
(The "aftermath" ends in this paragraph)
aftermathThe who attribute may be used to indicate more precisely the person or persons participating in the action described by the stage direction.
The <damage>, <gap>, <del>, <unclear> and <supplied> elements may be closely allied in use. See section 11.3.3.2. Use of the gap, del, damage, unclear, and supplied Elements in Combination for discussion of which element is appropriate for which circumstance.
The <surface> element represents any two-dimensional space on some physical surface forming part of the source material, such as a piece of paper, a face of a monument, a billboard, a scroll, a leaf etc.
The coordinate space defined by this element may be thought of as a grid lrx - ulx units wide and uly - lry units high.
The <surface> element may contain graphic representations or transcriptions of written zones, or both. The coordinate values used by every <zone> element contained by this element are to be understood with reference to the same grid.
Where it is useful or meaningful to do so, any grouping of multiple <surface> elements may be indicated using the <surfaceGrp> element.
Contains an optional heading and a series of rows.
Any rendition information should be supplied using the global rend attribute, at the table, row, or cell level as appropriate.
Should contain one TEI header for the corpus, and a series of <TEI> elements, one for each text.
This element should not be used to document the languages or writing systems used for the bibliographic or manuscript description itself: as for all other TEI elements, such information should be provided by means of the global xml:lang attribute attached to the element containing the description.
In all cases, languages should be identified by means of a standardized
The attributes key and ref, inherited from the class
Prose and a mixture of speech elements
Although individual transcriptions may consistently use <u> elements for turns or other units, and although in most cases a <u> will be delimited by pause or change of speaker, <u> is not required to represent a turn or any communicative event, nor to be bounded by pauses or change of speaker. At a minimum, a <u> is some phonetic production by a given speaker.
The same element is used for all cases of uncertainty in the transcription of element content, whether for written or spoken material. For other aspects of certainty, uncertainty, and reliability of tagging and transcription, see chapter 21. Certainty, Precision, and Responsibility.
The <damage>, <gap>, <del>, <unclear> and <supplied> elements may be closely allied in use. See section 11.3.3.2. Use of the gap, del, damage, unclear, and supplied Elements in Combination for discussion of which element is appropriate for which circumstance.
The hand attribute points to a definition of the hand concerned, as further discussed in section 11.3.2.1. Document Hands.
A definitive list of current Unicode property names is provided in The Unicode Standard.
A definitive list of current Unicode property names is provided in The Unicode Standard.
A definitive list of current Unihan property names is provided in the Unicode Han Database.
On this element, the global xml:id attribute must be supplied to specify an identifier for this point in time. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence.
The <writing> element will usually be short and most simply transcribed as a character string; the content model also allows a sequence of paragraphs and paragraph-level elements, in case the writing has enough internal structure to warrant such markup. In either case the usual phrase-level tags for written text are available.
May contain character data and phrase-level elements; usually contains a <ref> or a <ptr> element.
This element encloses both the actual indication of the location referred to, which may be tagged using the <ref> or <ptr> elements, and any accompanying material which gives more information about why the reader is being referred there.
The position of every zone for a given surface is always defined by reference to the coordinate system defined for that surface.
A graphic element contained by a zone represents the whole of the zone.
A zone may be of any shape. The attribute points may be used to define a polygonal zone, using the coordinate system defined by its parent surface.
A zone is always a closed polygon. Repeating the initial coordinate at the end of the sequence is optional. To encode an unclosed path, use the <path> element.
This
The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by ISO 8601, using the Gregorian calendar.
If both when-iso and dur-iso are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. That is,
In providing a
The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by
The most commonly-encountered format for the date portion of a temporal attribute is yyyy-mm-dd
, but yyyy
, --mm
, ---dd
, yyyy-mm
, or --mm-dd
may also be used. For the time part, the form hh:mm:ss
is used.
Note that this format does not currently permit use of the value 0000 to represent the year 1 BCE; instead the value -0001 should be used.
ISO 12620:2009 is a standard describing the data model and procedures for a Data Category Registry (DCR). Data categories are defined as elementary descriptors in a linguistic structure. In the DCR data model each data category gets assigned a unique Peristent IDentifier (PID), i.e., an URI. Linguistic resources or preferably their schemas that make use of data categories from a DCR should refer to them using this PID. For XML-based resources, like TEI documents, ISO 12620:2009 normative Annex A gives a small Data Category Reference XML vocabulary (also available online at http://www.isocat.org/12620/), which provides two attributes, dcr:datcat and dcr:valueDatcat.
The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter 15.3. Associating Contextual Information with a Text. Only one element of a particular type may have a default attribute with a value of true.
The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter 15.3. Associating Contextual Information with a Text.
If both when and dur or dur-iso are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. In order to represent a time range by a duration and its ending time the when-iso attribute must be used.
In providing a
If both when and dur are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. In order to represent a time range by a duration and its ending time the when-iso attribute must be used.
In providing a
The members of this attribute class are typically used to represent any kind of editorial intervention in a text, for example a correction or interpretation, or to date or localize manuscripts etc.
Each pointer on the source (if present) corresponding to a witness or witness group should reference a bibliographic citation such as a
Looking into the future aeons from the supreme moment of
the cosmos, I saw the populations still with all their
strength maintaining the
The global n attribute may be used to encode the homograph numbers attached to entries for homographs.
This attribute class provides formula for use in defining a value used in mathematical calculation. It can be used to store a mathematical operation needed to convert from one system of measurement to another. We use the teidata.xpath datatype to express this value in order to communicate mathematical operations on an XML node or nodes. The $fromUnit variable notation simplifies referencing of the fromUnit attribute on the parent <conversion> element. Note that div
is required to express the division operator in XPath.
All name-only attributes need an xs:boolean attribute value inside value.
As Willard McCarty (‘Collaboration’ is a problematic and should be a contested
term.
Grammatical theories are in flux, and the more we learn, the
less we seem to know.
Usually either script or scriptRef, and similarly, either scribe or scribeRef, will be supplied.
This attribute class provides an attribute for describing a computer resource, typically available over the internet, using a value taken from a standard taxonomy. At present only a single taxonomy is supported, the Multipurpose Internet Mail Extensions (MIME) Media Type system. This typology of media types is defined by the Internet Engineering Task Force in RFC 2046. The list of types is maintained by the Internet Assigned Numbers Authority (IANA). The mimeType attribute must have a value taken from this list.
It needs to be stressed that the two attributes in this class are meant for strictly lexicographic and linguistic uses, and not for editorial interventions. For the latter, the mechanism based on <choice>, <orig>, and <reg> needs to be employed.
These attributes make it possible to encode simple language corpora and to add a layer of linguistic information to any tokenized resource. See section 17.4.2. Lightweight Linguistic Annotation for discussion.
This attribute class provides a triplet of attributes that may be used either to regularize the values of the measurement being encoded, or to normalize them with respect to a standard measurement system.
The unit should normally be named using the standard symbol for an SI unit (see further lines or characters.
The span is defined as running in document order from the start of the content of the pointing element to the end of the content of the element pointed to by the spanTo attribute (if any). If no value is supplied for the attribute, the assumption is that the span is coextensive with the pointing element. If no content is present, the assumption is that the starting point of the span is immediately following the element itself.
When appropriate, values from an established typology should be used. Alternatively a typology may be defined in the associated TEI header. If values are to be taken from a project-specific list, this should be defined using the