The Annotation Graph Schema API

This work is a derivative of the Annotation Graph definition and implementation . The terms - XML, DTD, XML Schema, DOM, XSLT, XPath, XQuery and XPointer refer to the specifications of The World Wide Web Consortium. This document is intended to serve as a supplement to the API Documentation and Schema Documentation and is under revision. We would like to thank Tom Morton and Jeremy Lacivita for granting us permission to use the annotation.org namespace.

Introduction

This API is intended to serve as a backend to XML-based annotation APIs or as middleware to annotations in other formats where the use of XML-based tools is needed for the purposes of querying and creation of annotation bases. We now describe the subset of tasks on annotation bases that can be handled by this API and those that cannot.

Online tasks - These are tasks like editing, browsing and queries for which results of a user initiated action must be displayed immediately(online). This API aims to help in reducing the time to deploy editors and browsers by handling the representational components in a general and extensible way.
There is an important class of queries that CANNOT be handled. This is the "online corpus-level query" ie the query is over the entire corpus or a large subset of it and results must be obtained immediately. This is not a limitation of this API per se but because XML based representations of corpuses are much larger than textual representations.
Offline tasks - These are tasks like testing features for machine learning or any queries where the user is willing to wait. For these kinds of tasks the API gives you a lot of convenience and the ability to easily plug-in standard querying/selection mechanisms like XPath to aid you in your queries. A significant amount of effort has gone into providing the ability to perform these tasks as fast as possible, so we can handle a lot of stuff in the grey area between online and offline tasks.

Here are some features of this API:

Annotations can be stored in any format as long as they are read into memory as XML. This enables this API to integrate with whatever representation you have. Our mechanism for doing this is the LoadStore package which is quite efficient and should be simple to use if you've worked with TrAX(javax.xml.transform) before.
Annotations in multiple XML files can be united in memory. This gives you the ability to keep various annotation projects seperate and work with any combination of them together.
We have picked the Annotation Graph representation as our in-memory default, so utility methods can be provided. But it would be reasonably easy for you to make this work with any other representation.

The View Chart

Annotations encode structure in some way and representing this structure arbitrarily in memory causes us to lose the ability to use standard APIs to work with them. To address this issue, we use the notion of a "View"(from Relational Database Management Systems(RDBMS)). A view is any structure computed from the underlying base. In RDBMSs the only structure available was tables and hence all views were tables derived from an underlying set of tables.

We define a View to be any structure computed from the underlying annotation base. It is not possible for an API to handle all possible views, but we aim to define a set of base types from which views can be derived rather easily. This version of the API provides support for trees and graphs. We aim to add types for tables and collections in future versions once issues regarding their encoding and use are better understood.

This kind of typing is acheived through the XML Schema type definition language and core schema for viewing is defined as follows. See the schema docs for documentation in another format.

 <schema xmlns="http://www.w3.org/2001/XMLSchema" 
        xmlns:vw="http://www.annotation.org/agschema/atlas/ag/view"
        targetNamespace="http://www.annotation.org/agschema/atlas/ag/view" 
        elementFormDefault="qualified">

   <annotation>
      <documentation>

        This lays out the base types for views of an annotation. The view chart
        contains a list of elements corresponding to the roots of your
        annotations and a list of views.

        This need not be the annotation graph format but it must be an instance
        of some XML Schema type. Any number of views can be attached to this view chart.

        This allows one to store annotations is multiple files and put them together for
        the purposes of querying and editing on a ViewChart. 

        See also the Reference type and the org.annotation.agschema.atlas.ag.loadstore package in
        the Java API.

        @author Nikhil Dinesh

      </documentation>
   </annotation>

   <complexType name="View" abstract="true" >
      <annotation>
        <documentation>

          A view can be anything but it should derive from this type by extension. 

          See the tree view in namespace http://www.annotation.org/agschema/atlag/ag/view/tree

        </documentation>
      </annotation>
      <attribute name="id" type="ID" />
      <attribute name="type" type="string" /> 
   </complexType>


   <complexType name="ViewChart">
     <annotation>
       <documentation>
         
         The view chart contains the root of a set annotations as well as a list of 
         views. Note that instances of this type should not be the persistent form 
         of your data. The views should be computed in memory. If a frequently desired 
         view is computationally very expensive, then you might want to rethink your
         annotation format.
        
       </documentation>
     </annotation>
     <sequence>
       <any namespace="##other" minOccurs="1" maxOccurs="unbounded"/>
       <element name="View" type="vw:View" minOccurs="0" maxOccurs="unbounded" />
     </sequence>
   </complexType>


   <element name="ViewChart" type="vw:ViewChart" />
   
   <complexType name="Reference">
     <annotation>
       <documentation>
        
        This is a utility type from which most of our structure types like trees
        and graphs derive. The attribute is named idRef because it is usually
        a reference to some annotation which has an ID. However for offline
        queries we might need to serialize views independent of annotations
        for browsing or other purposes and so it has the type NMTOKEN.
         
       </documentation>
     </annotation>
     <sequence>
       <any namespace="http://www.annotation.org/agschema/atlas/ag/view/userobject" minOccurs="0" maxOccurs="1" />
     </sequence>
     <attribute name="idRef" type="NMTOKEN" use="required" />
   </complexType>

</schema>

What this means is that multiple annotations residing in different locations can be put together on the view chart. The views you define should derive from the type View. See the definitions for trees and graphs in the schema docs. The Reference type is used in views to refer to annotations in the ViewChart. Most of the subtypes of view provided here rely on the Reference type. For example in the TreeView each TreeNode is a Reference type. The TreeNode objects we compute implement both org.w3c.dom.Node and javax.swing.tree.TreeNode so you can switch between either.

Layout of the API

There are two components to this API. First, the schema definitions which define the types of all the representations computed and the second is the Java API which implements these definitons and provides other functionality. We distinguish between two kinds of representations - the persistent representation and the in-memory representation. The API does not recommend any particular persistent representation. Unless explicitly stated otherwise, the representation we are talking about is the in-memory representation. Before we describe what is in each package, here is a discussion of the various conventions adopted.

A schema with targetNamespace="http://www.annotation.org/agschema/atlas/ag/view" would have its implementation
in the package org.annotation.agschema.atlas.ag.view. Any annotation API must have a default representation of
annotations so we can provide convenience methods. Our default is the Annotation Graphs [1] which
was chosen because we can encode almost anything in there. The in-memory representation of this is
given by the schema http://www.annotation.org/agschema/atlas/ag/rep. If you choose to go with something other
than Annotation Graphs you would have to do the following two steps:

1. Define your in-memory representation as an XMLSchema and provide Java APIs which handle them. The
classes provided should extend from AGComplexTypeBase. Providing classes is optional and this effect
can be achieved by setting the org.annotation.agschema.atlas.ag.AGDocument to non-strict mode but you would then
be working with DOM for the most part.

2. Define ContentHandlers and LexicalHandlers for your data. You should be able to use a few of the
default ones provided in the org.annotation.agschema.atlas.ag.loadstore package to aid you.

All the other packages including the signal handling mechanism are generic but provide convenience
methods for Annotation Graphs which you may or may not be able to use. Also note that it is completely
possible to extend the Annotation Graph Schema representation in any way you see fit but these are
not stable standards yet, so preserve your persistent representation external to schemas
defined by this API.

The following conventions are adopted but not enforced:

a. Unique type attribution - XML Schema allows you to distinguish between elements by name and type.
But our Java API only distinguishes by type. So in XML Schema, you could define:

But the way our Java API is set up we do not treat names as significant. The reason for this is
complicated and I will not get into that here. However we do not prevent you from using such types.
It is only that no convenience methods can be provided for them. We also avoid the use of
anonymous types.

b. One namespace, one schema - The way our XmlEntityResolver is set up, it requires a schema to
be uniquely identified by its targetNamespace. For the schemas to be truly portable, we avoid the
use of the schemaLocation attribute anywhere. If you notice it in the schemas you obtain, it is
only for the schema documentation tool used and serves no other purpose. As with everything else
in the API this is not a strict rule. You may set up your own XmlEntityResolver. Our default resolver
also requires that these schemas be present on the classpath. See AGXMLEntityResolver for details.
DTDs are resolved by publicId. The only DTD that comes with this API is the Annotation Graph DTD.

c. For the Annotation Graph Schema, Features with names having prefixes ag.er are assumed
to contain a list of IDREFS and names with prefixes ag.ec are assumed to contain a list of NMTOKENS.
This is so that we know where all the references are and multiple documents can be united easily.

Note that as per the AG DTD the Feature element can only contain Text children but I have observed
people using markup in there. That is why the Feature has anyType as opposed to a string type. The
default feature is http://www.annotation.org/agschema/atlas/ag/rep/feature:TextFeature and it is recommended
that only that be used so that the Annotation Graph Schema is always interoperable with the DTD.

d. All the schema representations are assumed to be in-memory representations. It is recommended
that the persistent form be whatever representation you usually use. For annotation graphs methods
are provided to serialize to a form which respects the DTD. The DOCTYPE declaration recommended
is <!DOCTYPE AGSet PUBLIC "-//LDC/DTD AG 1.0//EN" "ag.dtd">. The publicId is defined by this
API and is not official. We keep the systemId as "ag.dtd" so that other applications written for
Annotation Graphs can be used without using our methods of resolving entities.

e. Validation - PSVI stands for Post Schema-Validation Infoset [4] that is provided by the Xerces
parser [5]. Validation of instance documents against schemas is expensive and so only our configuration
APIs depend on PSVI. All the other stuff is non-validating and PSVI independent. We rely on Xerces
to do the validation and it is up to you to decide when and if you want to validate. A good place to
use validation is in the pre-serialization phase because the end user will not notice the
delay and it is also a good debugging tool.

The following are requirements:

a. Interpretation of the ViewChart "any" element - Note that the definition of the ViewChart as it
stands allows you to include subtypes of views as the "any" elements because all we require is that
they be from some other namespace. The "any" elements are for your annotations and these should
NOT be instances of Views. Doing this will cause the API to malfunction.

b. Reserved namespaces - All namespaces starting with http://www.annotation.org/agschema are reserved
for internal use except http://www.annotation.org/agschema/atlas/ag/view/userobject.

The packages that correspond to the schemas are easily accesible. We discuss the packages which are not.

XmlFactory Package

The XmlFactory packages configures the EntityResolvers and through that the parsers and Document instances that are created. The important classes here are the following:

AGXmlEntityResolver - This lets the parsers and validators need to know where all the entities are. The entities are usually Grammars either Schemas or DTDs. The parser queries the resolver with namespace and/or publicId and/or systemIds. However since we never use the schemaLocation attribtue anywhere and systemIds in instance documents are always ignored, all we use is the namespace and/or publicId(usually only namespace for schemas and only publicId for DTDs) and the resolver looks up the configuration file and passes in the input source. See the schema http://www.annotation.org/agschema/atlas/ag/xmlentities for more information.
AGXmlGrammarPool - The parser usually compiles the DTD or Schema into some model so it doesnt have to read it in each time. Optionally you can preparse all the grammars and put them in the pool so that validation if needed will be faster. Xerces XNI users should be familiar with this.
AGElementProviderImpl - This is the implementation of the org.annotation.agschema.atlas.ag.AGElementProvider interface which decides what type of Elements should be created by the AGDocument. The configuration is given by the schema http://www.annotation.org/agschema/atlas/ag/elementprovider. In otherwords when the AGDocument's createElementNS method is invoked it usually means that some instance of AGComplexTypeBase must be created. We cannot tell what this is from the namespaceURI and qName alone. The AGElementProvider uses the namespace, localName, xsiTypeNS and xsiTypeLocalName(these are given by the xsi:type attribute on the element which must always be set for subtypes where "xsi" is the prefix for the XMLSchema-instance namespace.). If the conventions recommended above are followed then this should always suffice. However if you allow anonymous typing also the configuration mechanism will need more information and you can extend the interface appropriately.

LoadStore Package

The most important mechanism here is that of the PassThroughHandler. If you have worked with org.xml.sax.XMLFilter before, this is quite similar. A PassThroughHandler passes on SAX events to some other handler (the destination) after applying some transformations. For example if we have an AG DTD document and we want to read in as an AG Schema instance some namespace declarations must be changed. Another use is for ID generation. When a document is read in ids may need to be recreated either for editing or for putting multiple documents under a single ViewChart. You could of course use XSLT for this but these mechanisms are much faster. Also if your persistent annotations are in some non-XML form this is one way to load it in to XML. We usually create a chain of PassThroughHandlers with the destination of the last one as a TransformerHandler which does our SAX2DOM transformation. Creation in this way allows us to access the xsi:type information so that abstract types can be instantiated. Note that it is possible to get the AGDocument to work with a DOMParser. It is intentionally not implemented because the schemas here should not be your persistent form. The AGSource, and AGResult interfaces and their implementations are very similiar to TrAX(Transformation API for XML) sources and results. This is the javax.xml.transform stuff.

Signal Handler Package

An annotation is assumed to refer to an interval in some signal. For example (0,10]. The endpoints of these intervals can be arbitrary strings which have units. A SignalHandler handles intervals on a SignalDescriptor. Note that this SignalDescriptor is similar but not the same as the Annotation Graph Signal element. It just contains a description of the signal but it is not a part of any Annotation Graph document. So even if you represent signals in your XML differently, this Signal should suffice to describe it. The signal handler takes as input a SignalInterval and returns an Object. For example, our SmallTextAsCharStreamHandler will return a CharSequence. The SignalManager lets you configure the set of SignalHandlers to use as a particular instance document may refer to more than one signal and one SignalHandler will be created for each Signal.

So with any standoff architecture it is assumed that the SignalInterval suffices to retrieve your data. The units of the end points of these intervals is up to you. This assumption may not be correct for signals of speech and video(where potentially something like framerate may be needed). Hopefully a simple extension of the SignalInterval will suffice for these with the additional information.

Test Package

The test package is the easiest place to look for examples of usage. There is approximately one test for each package and will show how to use every functionality. The command line arguments for each tests is a single xml file which can be found in {sys path to install dir/testcmdline/testName.xsi}. The command line schema is http://www.annotation.org/agschema/atlas/ag/cmdline (see schema docs) and the org.annotation.agschema.atlas.ag.cmdline package if you would like to use it for other things as well.

Configuration Files

The config files and the schemas are part of agResources.jar. Two configuration files are needed to set up everything:

AGXmlEntityResolverConfig.xsi - This tells us the location of schemas and dtds on the classpath. If you add additional schemas or dtds you must change this configuration. See apidocs for AGXmlEntityResolver for ways to accomplish this.
AGElementProviderConfig.xsi - This tells the API what objects to create when createElementNS is invoked. If you subtype any of the classes, you will need to modify this file. See AGXmlEntityResolver in the Java API for ways to do this.

If you change the set up you might want to run one or more of our test cases to see if everything works ok. This will require you to edit the commandline arguments file to provide the system dependant arguments like paths to the input files. The test command line files will be in {sys path to install dir}/testcmdline. Unfortunately because of licensing issues I cannot distribute my test data with this package. If you are licensed to access the Penn Treebank and/or the Penn Discourse Treebank send me email at nikhild at seas dot upenn dot edu and we can set up a way for you to get at the test data. If not any sample Annotation Graph xml files should do.

Querying

Once the views are computed or to compute the views the XML base needs to be queried. The goal of all this is to let you plug in standard mechanisms like XPath and XQuery to do this. There is no XPath API included with our distribution, but I have tested the XPathAPIs included with Xalan and Jaxen. With Java™ 1.5 there is an XPathAPI in the default distribution, so that can be used as well. I do not know of any open source XQuery implementation. The DOM Level2 Traversal and Range API also provides less expressive but quicker ways to query XML structure.

Note that XPath will not be the whole answer to all your queries. For nested queries and other such you will need a combination of XPath, and programming through the DOM interface or the methods provided by the API. Needless to say you can switch seamlessly between all these modes. From my tests, Xalan's XPath API is faster if you want to do multiple queries on a static DOM. If you need to do XPath on a rapidly changing DOM, the performance gap closes and Jaxen is more memory efficient.

To minimize the amount of programming needed, and to keep speed high use the setUserData methods on the nodes. This is different from the setUserObject methods which are intended for XML structure. For example if you have two AGTreeViews, and need pointers from one to the other - the setUserObject method can be used for labels and other local information to the node, and the setUserData can be used to maintain the pointers. This will be especially useful while editing or showing a graphical display.

There is one type of query that cannot be handled by this API. This is the online multi-file query. That is there is one query which we want to run against a large part of the annotation base and wish to view the results immediately. In my opinion any XML-based annotation API will not be able to support this kind of query for some time to come. Serialization and object creation times are bottlenecks. We can get way with it if results are dense, so results can be fetched while the user looks at current results(this may be the case in early annotation correction phases) but in general this kind of thing cannot be attempted with this API.

Limitations and TODOs

As mentioned above there is only direct support for handling signals of text. Handling other signals like speech and video is not currently on our agenda and you may be better served by the AGTK.

Tables and Collection views will be implemented if someone is willing to send us examples of use. Also, the graph API should respect some generic interface which has a lot of algorithms implemented.

Support for simple types needs to be added. For example when we have a list of IDREFs it seems unreasonable to have to split it and create new Strings. Rather a list view of one String should be maintained. But this is a little tricky to implement in that it is not quite clear whether these should have live DOM support and whether they should be created through the DOM events API or by some other mechanism.

It would be nice to be able have an internal mechanism(by implementing some interface) for XPath so we can speed it up and use it a little more carelessly.

References

[1]

@inproceedings{BirdLiberman99dtag,
  author={Steven Bird and Mark Liberman},
  title={Annotation graphs as a framework for multidimensional
linguistic data analysis},
  year=1999,
  booktitle={Towards Standards and Tools for Discourse Tagging --
Proceedings of the Workshop},
  publisher={Somerset, NJ: Association for Computational Linguistics},
  pages={1--10},
  note={[xxx.lanl.gov/abs/cs.CL/9907003]}
}

[2] XML Schema Specification
[3] XPath Recommendation
[4] Post Schema-Validation Infoset
[5] Xerces2 Java Parser
[6] XML Beans
[7] XML
[8] The World Wide Web Consortium
[9] Annotation Graph Implementation
[10] Schema Docs
[11] API Docs
[12] SAX
[13] Xalan
[14] Jaxen
[15] DOM
[16] DOM Level2 Traversal and Range