This API is intended to serve as a backend to XML-based annotation APIs or as middleware to annotations in other formats where the use of XML-based tools is needed for the purposes of querying and creation of annotation bases. We now describe the subset of tasks on annotation bases that can be handled by this API and those that cannot.
There is an important class of queries that CANNOT be handled. This is the "online corpus-level query" ie the query is over the entire corpus or a large subset of it and results must be obtained immediately. This is not a limitation of this API per se but because XML based representations of corpuses are much larger than textual representations.
Annotations encode structure in some way and representing this structure arbitrarily in memory causes us to lose the ability to use standard APIs to work with them. To address this issue, we use the notion of a "View"(from Relational Database Management Systems(RDBMS)). A view is any structure computed from the underlying base. In RDBMSs the only structure available was tables and hence all views were tables derived from an underlying set of tables.
We define a View to be any structure computed from the underlying annotation base. It is not possible for an API to handle all possible views, but we aim to define a set of base types from which views can be derived rather easily. This version of the API provides support for trees and graphs. We aim to add types for tables and collections in future versions once issues regarding their encoding and use are better understood.
This kind of typing is acheived through the XML Schema type definition
language and core schema for viewing is defined as follows. See the schema docs
for documentation in another format.
<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:vw="http://www.annotation.org/agschema/atlas/ag/view" targetNamespace="http://www.annotation.org/agschema/atlas/ag/view" elementFormDefault="qualified"> <annotation> <documentation> This lays out the base types for views of an annotation. The view chart contains a list of elements corresponding to the roots of your annotations and a list of views. This need not be the annotation graph format but it must be an instance of some XML Schema type. Any number of views can be attached to this view chart. This allows one to store annotations is multiple files and put them together for the purposes of querying and editing on a ViewChart. See also the Reference type and the org.annotation.agschema.atlas.ag.loadstore package in the Java API. @author Nikhil Dinesh </documentation> </annotation> <complexType name="View" abstract="true" > <annotation> <documentation> A view can be anything but it should derive from this type by extension. See the tree view in namespace http://www.annotation.org/agschema/atlag/ag/view/tree </documentation> </annotation> <attribute name="id" type="ID" /> <attribute name="type" type="string" /> </complexType> <complexType name="ViewChart"> <annotation> <documentation> The view chart contains the root of a set annotations as well as a list of views. Note that instances of this type should not be the persistent form of your data. The views should be computed in memory. If a frequently desired view is computationally very expensive, then you might want to rethink your annotation format. </documentation> </annotation> <sequence> <any namespace="##other" minOccurs="1" maxOccurs="unbounded"/> <element name="View" type="vw:View" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> <element name="ViewChart" type="vw:ViewChart" /> <complexType name="Reference"> <annotation> <documentation> This is a utility type from which most of our structure types like trees and graphs derive. The attribute is named idRef because it is usually a reference to some annotation which has an ID. However for offline queries we might need to serialize views independent of annotations for browsing or other purposes and so it has the type NMTOKEN. </documentation> </annotation> <sequence> <any namespace="http://www.annotation.org/agschema/atlas/ag/view/userobject" minOccurs="0" maxOccurs="1" /> </sequence> <attribute name="idRef" type="NMTOKEN" use="required" /> </complexType> </schema>
What this means is that multiple annotations residing in different locations can be put together on the view chart. The views you define should derive from the type View. See the definitions for trees and graphs in the schema docs. The Reference type is used in views to refer to annotations in the ViewChart. Most of the subtypes of view provided here rely on the Reference type. For example in the TreeView each TreeNode is a Reference type. The TreeNode objects we compute implement both org.w3c.dom.Node and javax.swing.tree.TreeNode so you can switch between either.
A schema with targetNamespace="http://www.annotation.org/agschema/atlas/ag/view" would have its implementation in the package org.annotation.agschema.atlas.ag.view. Any annotation API must have a default representation of annotations so we can provide convenience methods. Our default is the Annotation Graphs [1] which was chosen because we can encode almost anything in there. The in-memory representation of this is given by the schema http://www.annotation.org/agschema/atlas/ag/rep. If you choose to go with something other than Annotation Graphs you would have to do the following two steps: 1. Define your in-memory representation as an XMLSchema and provide Java APIs which handle them. The classes provided should extend from AGComplexTypeBase. Providing classes is optional and this effect can be achieved by setting the org.annotation.agschema.atlas.ag.AGDocument to non-strict mode but you would then be working with DOM for the most part. 2. Define ContentHandlers and LexicalHandlers for your data. You should be able to use a few of the default ones provided in the org.annotation.agschema.atlas.ag.loadstore package to aid you. All the other packages including the signal handling mechanism are generic but provide convenience methods for Annotation Graphs which you may or may not be able to use. Also note that it is completely possible to extend the Annotation Graph Schema representation in any way you see fit but these are not stable standards yet, so preserve your persistent representation external to schemas defined by this API. The following conventions are adopted but not enforced: a. Unique type attribution - XML Schema allows you to distinguish between elements by name and type. But our Java API only distinguishes by type. So in XML Schema, you could define: <complexType name="foo"> <sequence> <element name="a" type="bar" /> <element name="b" type="bar" /> </sequence> </complexType> But the way our Java API is set up we do not treat names as significant. The reason for this is complicated and I will not get into that here. However we do not prevent you from using such types. It is only that no convenience methods can be provided for them. We also avoid the use of anonymous types. b. One namespace, one schema - The way our XmlEntityResolver is set up, it requires a schema to be uniquely identified by its targetNamespace. For the schemas to be truly portable, we avoid the use of the schemaLocation attribute anywhere. If you notice it in the schemas you obtain, it is only for the schema documentation tool used and serves no other purpose. As with everything else in the API this is not a strict rule. You may set up your own XmlEntityResolver. Our default resolver also requires that these schemas be present on the classpath. See AGXMLEntityResolver for details. DTDs are resolved by publicId. The only DTD that comes with this API is the Annotation Graph DTD. c. For the Annotation Graph Schema, Features with names having prefixes ag.er are assumed to contain a list of IDREFS and names with prefixes ag.ec are assumed to contain a list of NMTOKENS. This is so that we know where all the references are and multiple documents can be united easily. Note that as per the AG DTD the Feature element can only contain Text children but I have observed people using markup in there. That is why the Feature has anyType as opposed to a string type. The default feature is http://www.annotation.org/agschema/atlas/ag/rep/feature:TextFeature and it is recommended that only that be used so that the Annotation Graph Schema is always interoperable with the DTD. d. All the schema representations are assumed to be in-memory representations. It is recommended that the persistent form be whatever representation you usually use. For annotation graphs methods are provided to serialize to a form which respects the DTD. The DOCTYPE declaration recommended is <!DOCTYPE AGSet PUBLIC "-//LDC/DTD AG 1.0//EN" "ag.dtd">. The publicId is defined by this API and is not official. We keep the systemId as "ag.dtd" so that other applications written for Annotation Graphs can be used without using our methods of resolving entities. e. Validation - PSVI stands for Post Schema-Validation Infoset [4] that is provided by the Xerces parser [5]. Validation of instance documents against schemas is expensive and so only our configuration APIs depend on PSVI. All the other stuff is non-validating and PSVI independent. We rely on Xerces to do the validation and it is up to you to decide when and if you want to validate. A good place to use validation is in the pre-serialization phase because the end user will not notice the delay and it is also a good debugging tool. The following are requirements: a. Interpretation of the ViewChart "any" element - Note that the definition of the ViewChart as it stands allows you to include subtypes of views as the "any" elements because all we require is that they be from some other namespace. The "any" elements are for your annotations and these should NOT be instances of Views. Doing this will cause the API to malfunction. b. Reserved namespaces - All namespaces starting with http://www.annotation.org/agschema are reserved for internal use except http://www.annotation.org/agschema/atlas/ag/view/userobject.The packages that correspond to the schemas are easily accesible. We discuss the packages which are not.
So with any standoff architecture it is assumed that the SignalInterval suffices to retrieve your data. The units of the end points of these intervals is up to you. This assumption may not be correct for signals of speech and video(where potentially something like framerate may be needed). Hopefully a simple extension of the SignalInterval will suffice for these with the additional information.
Note that XPath will not be the whole answer to all your queries. For nested queries and other such you will need a combination of XPath, and programming through the DOM interface or the methods provided by the API. Needless to say you can switch seamlessly between all these modes. From my tests, Xalan's XPath API is faster if you want to do multiple queries on a static DOM. If you need to do XPath on a rapidly changing DOM, the performance gap closes and Jaxen is more memory efficient.
To minimize the amount of programming needed, and to keep speed high use the setUserData methods on the nodes. This is different from the setUserObject methods which are intended for XML structure. For example if you have two AGTreeViews, and need pointers from one to the other - the setUserObject method can be used for labels and other local information to the node, and the setUserData can be used to maintain the pointers. This will be especially useful while editing or showing a graphical display.
There is one type of query that cannot be handled by this API. This is the online multi-file query. That is there is one query which we want to run against a large part of the annotation base and wish to view the results immediately. In my opinion any XML-based annotation API will not be able to support this kind of query for some time to come. Serialization and object creation times are bottlenecks. We can get way with it if results are dense, so results can be fetched while the user looks at current results(this may be the case in early annotation correction phases) but in general this kind of thing cannot be attempted with this API.
Tables and Collection views will be implemented if someone is willing to send us examples of use. Also, the graph API should respect some generic interface which has a lot of algorithms implemented.
Support for simple types needs to be added. For example when we have a list of IDREFs it seems unreasonable to have to split it and create new Strings. Rather a list view of one String should be maintained. But this is a little tricky to implement in that it is not quite clear whether these should have live DOM support and whether they should be created through the DOM events API or by some other mechanism.
It would be nice to be able have an internal mechanism(by implementing some interface) for XPath so we can speed it up and use it a little more carelessly.
@inproceedings{BirdLiberman99dtag, author={Steven Bird and Mark Liberman}, title={Annotation graphs as a framework for multidimensional linguistic data analysis}, year=1999, booktitle={Towards Standards and Tools for Discourse Tagging -- Proceedings of the Workshop}, publisher={Somerset, NJ: Association for Computational Linguistics}, pages={1--10}, note={[xxx.lanl.gov/abs/cs.CL/9907003]} }[2] XML Schema Specification