The Annotation Graph Schema API
This work is a derivative of the Annotation Graph definition
and implementation . The terms - XML, DTD, XML Schema, DOM,
XSLT, XPath, XQuery and XPointer refer to the specifications of The World Wide Web Consortium.
This document is intended to serve as a supplement to the API Documentation
and Schema Documentation and is under revision. We would
like to thank Tom Morton and
Jeremy Lacivita for granting us permission to use the
annotation.org namespace.
Introduction
This API is intended to serve as a backend to XML-based annotation APIs or as
middleware to annotations in other formats where the use of XML-based tools
is needed for the purposes of querying and creation of annotation bases. We now describe
the subset of tasks on annotation bases that can be handled by this API and those that
cannot.
-
Online tasks - These are tasks like editing, browsing and queries for which results of a
user initiated action must be displayed immediately(online). This API aims to help in reducing
the time to deploy editors and browsers by handling the representational components in
a general and extensible way.
There is an important class of queries that CANNOT be handled. This is the "online corpus-level query"
ie the query is over the entire corpus or a large subset of it and results must be obtained immediately.
This is not a limitation of this API per se but because XML based representations of corpuses
are much larger than textual representations.
-
Offline tasks - These are tasks like testing features for machine learning or any queries
where the user is willing to wait. For these kinds of tasks the API gives you a lot of
convenience and the ability to easily plug-in standard querying/selection mechanisms
like XPath to aid you in your queries. A significant amount of effort
has gone into providing the ability to perform these tasks as fast as possible, so we can
handle a lot of stuff in the grey area between online and offline tasks.
Here are some features of this API:
-
Annotations can be stored in any format as long as they are read into memory as XML. This
enables this API to integrate with whatever representation you have. Our mechanism for
doing this is the LoadStore package which is quite efficient and should be simple to use
if you've worked with TrAX(javax.xml.transform) before.
-
Annotations in multiple XML files can be united in memory. This gives you the ability to
keep various annotation projects seperate and work with any combination of them together.
-
We have picked the Annotation Graph representation as our in-memory default, so
utility methods can be provided. But it would be reasonably easy for you to make this work
with any other representation.
The View Chart
Annotations encode
structure in some way and representing this structure arbitrarily
in memory causes us to lose the ability to use standard APIs to
work with them. To address this issue, we use the notion of a
"View"(from Relational Database Management Systems(RDBMS)). A view is
any structure computed from the underlying base. In RDBMSs the only
structure available was tables and hence all views were tables derived
from an underlying set of tables.
We define a View to be any structure computed from the underlying
annotation base. It is not possible for an API to handle all possible
views, but we aim to define a set of base types from which views can be
derived rather easily. This version of the API provides support for trees
and graphs. We aim to add types for tables and collections in future versions
once issues regarding their encoding and use are better understood.
This kind of typing is acheived through the XML Schema type definition
language and core schema for viewing is defined as follows. See the schema docs
for documentation in another format.
<schema xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:vw="http://www.annotation.org/agschema/atlas/ag/view"
targetNamespace="http://www.annotation.org/agschema/atlas/ag/view"
elementFormDefault="qualified">
<annotation>
<documentation>
This lays out the base types for views of an annotation. The view chart
contains a list of elements corresponding to the roots of your
annotations and a list of views.
This need not be the annotation graph format but it must be an instance
of some XML Schema type. Any number of views can be attached to this view chart.
This allows one to store annotations is multiple files and put them together for
the purposes of querying and editing on a ViewChart.
See also the Reference type and the org.annotation.agschema.atlas.ag.loadstore package in
the Java API.
@author Nikhil Dinesh
</documentation>
</annotation>
<complexType name="View" abstract="true" >
<annotation>
<documentation>
A view can be anything but it should derive from this type by extension.
See the tree view in namespace http://www.annotation.org/agschema/atlag/ag/view/tree
</documentation>
</annotation>
<attribute name="id" type="ID" />
<attribute name="type" type="string" />
</complexType>
<complexType name="ViewChart">
<annotation>
<documentation>
The view chart contains the root of a set annotations as well as a list of
views. Note that instances of this type should not be the persistent form
of your data. The views should be computed in memory. If a frequently desired
view is computationally very expensive, then you might want to rethink your
annotation format.
</documentation>
</annotation>
<sequence>
<any namespace="##other" minOccurs="1" maxOccurs="unbounded"/>
<element name="View" type="vw:View" minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
<element name="ViewChart" type="vw:ViewChart" />
<complexType name="Reference">
<annotation>
<documentation>
This is a utility type from which most of our structure types like trees
and graphs derive. The attribute is named idRef because it is usually
a reference to some annotation which has an ID. However for offline
queries we might need to serialize views independent of annotations
for browsing or other purposes and so it has the type NMTOKEN.
</documentation>
</annotation>
<sequence>
<any namespace="http://www.annotation.org/agschema/atlas/ag/view/userobject" minOccurs="0" maxOccurs="1" />
</sequence>
<attribute name="idRef" type="NMTOKEN" use="required" />
</complexType>
</schema>
What this means is that multiple annotations residing in different locations can be put together
on the view chart. The views you define should derive from the type View. See the definitions for
trees and graphs in the schema docs. The Reference type is used in views to refer to annotations
in the ViewChart. Most of the subtypes of view provided here rely on the Reference type. For example
in the TreeView each TreeNode is a Reference type. The TreeNode objects we compute
implement both org.w3c.dom.Node and javax.swing.tree.TreeNode so you can switch between either.
Layout of the API
There are two components to this API. First, the schema definitions which define the types of all
the representations computed and the second is the Java API which implements these definitons and
provides other functionality. We distinguish between two kinds of representations - the persistent
representation and the in-memory representation. The API does not recommend any particular persistent
representation. Unless explicitly stated otherwise, the representation we are talking about is the
in-memory representation. Before we describe what is in each package, here is a discussion of the
various conventions adopted.
A schema with targetNamespace="http://www.annotation.org/agschema/atlas/ag/view" would have its implementation
in the package org.annotation.agschema.atlas.ag.view. Any annotation API must have a default representation of
annotations so we can provide convenience methods. Our default is the Annotation Graphs [1] which
was chosen because we can encode almost anything in there. The in-memory representation of this is
given by the schema http://www.annotation.org/agschema/atlas/ag/rep. If you choose to go with something other
than Annotation Graphs you would have to do the following two steps:
1. Define your in-memory representation as an XMLSchema and provide Java APIs which handle them. The
classes provided should extend from AGComplexTypeBase. Providing classes is optional and this effect
can be achieved by setting the org.annotation.agschema.atlas.ag.AGDocument to non-strict mode but you would then
be working with DOM for the most part.
2. Define ContentHandlers and LexicalHandlers for your data. You should be able to use a few of the
default ones provided in the org.annotation.agschema.atlas.ag.loadstore package to aid you.
All the other packages including the signal handling mechanism are generic but provide convenience
methods for Annotation Graphs which you may or may not be able to use. Also note that it is completely
possible to extend the Annotation Graph Schema representation in any way you see fit but these are
not stable standards yet, so preserve your persistent representation external to schemas
defined by this API.
The following conventions are adopted but not enforced:
a. Unique type attribution - XML Schema allows you to distinguish between elements by name and type.
But our Java API only distinguishes by type. So in XML Schema, you could define:
<complexType name="foo">
<sequence>
<element name="a" type="bar" />
<element name="b" type="bar" />
</sequence>
</complexType>
But the way our Java API is set up we do not treat names as significant. The reason for this is
complicated and I will not get into that here. However we do not prevent you from using such types.
It is only that no convenience methods can be provided for them. We also avoid the use of
anonymous types.
b. One namespace, one schema - The way our XmlEntityResolver is set up, it requires a schema to
be uniquely identified by its targetNamespace. For the schemas to be truly portable, we avoid the
use of the schemaLocation attribute anywhere. If you notice it in the schemas you obtain, it is
only for the schema documentation tool used and serves no other purpose. As with everything else
in the API this is not a strict rule. You may set up your own XmlEntityResolver. Our default resolver
also requires that these schemas be present on the classpath. See AGXMLEntityResolver for details.
DTDs are resolved by publicId. The only DTD that comes with this API is the Annotation Graph DTD.
c. For the Annotation Graph Schema, Features with names having prefixes ag.er are assumed
to contain a list of IDREFS and names with prefixes ag.ec are assumed to contain a list of NMTOKENS.
This is so that we know where all the references are and multiple documents can be united easily.
Note that as per the AG DTD the Feature element can only contain Text children but I have observed
people using markup in there. That is why the Feature has anyType as opposed to a string type. The
default feature is http://www.annotation.org/agschema/atlas/ag/rep/feature:TextFeature and it is recommended
that only that be used so that the Annotation Graph Schema is always interoperable with the DTD.
d. All the schema representations are assumed to be in-memory representations. It is recommended
that the persistent form be whatever representation you usually use. For annotation graphs methods
are provided to serialize to a form which respects the DTD. The DOCTYPE declaration recommended
is <!DOCTYPE AGSet PUBLIC "-//LDC/DTD AG 1.0//EN" "ag.dtd">. The publicId is defined by this
API and is not official. We keep the systemId as "ag.dtd" so that other applications written for
Annotation Graphs can be used without using our methods of resolving entities.
e. Validation - PSVI stands for Post Schema-Validation Infoset [4] that is provided by the Xerces
parser [5]. Validation of instance documents against schemas is expensive and so only our configuration
APIs depend on PSVI. All the other stuff is non-validating and PSVI independent. We rely on Xerces
to do the validation and it is up to you to decide when and if you want to validate. A good place to
use validation is in the pre-serialization phase because the end user will not notice the
delay and it is also a good debugging tool.
The following are requirements:
a. Interpretation of the ViewChart "any" element - Note that the definition of the ViewChart as it
stands allows you to include subtypes of views as the "any" elements because all we require is that
they be from some other namespace. The "any" elements are for your annotations and these should
NOT be instances of Views. Doing this will cause the API to malfunction.
b. Reserved namespaces - All namespaces starting with http://www.annotation.org/agschema are reserved
for internal use except http://www.annotation.org/agschema/atlas/ag/view/userobject.
The packages that correspond to the schemas are easily accesible. We discuss the packages which are
not.
XmlFactory Package
The XmlFactory packages configures the EntityResolvers and through that the parsers and Document
instances that are created. The important classes here are the following:
-
AGXmlEntityResolver - This lets the parsers and validators need to know where all the entities are. The entities
are usually Grammars either Schemas or DTDs. The parser queries the resolver with namespace and/or
publicId and/or systemIds. However since we never use the schemaLocation attribtue anywhere and systemIds
in instance documents are always ignored, all we use is the namespace and/or publicId(usually only namespace
for schemas and only publicId for DTDs) and the resolver looks up the configuration file and passes in
the input source. See the schema http://www.annotation.org/agschema/atlas/ag/xmlentities for more information.
-
AGXmlGrammarPool - The parser usually compiles the DTD or Schema into some model so it doesnt have to
read it in each time. Optionally you can preparse all the grammars and put them in the pool so that validation
if needed will be faster. Xerces XNI users should be familiar with this.
-
AGElementProviderImpl - This is the implementation of the org.annotation.agschema.atlas.ag.AGElementProvider interface
which decides what type of Elements should be created by the AGDocument. The configuration is given by
the schema http://www.annotation.org/agschema/atlas/ag/elementprovider. In otherwords when the AGDocument's createElementNS
method is invoked it usually means that some instance of AGComplexTypeBase must be created. We cannot tell
what this is from the namespaceURI and qName alone. The AGElementProvider uses the namespace, localName,
xsiTypeNS and xsiTypeLocalName(these are given by the xsi:type attribute on the element which must always
be set for subtypes where "xsi" is the prefix for the XMLSchema-instance namespace.). If the conventions
recommended above are followed then this should always suffice. However if you allow anonymous typing also
the configuration mechanism will need more information and you can extend the interface appropriately.
LoadStore Package
The most important mechanism here is that of the PassThroughHandler. If you have worked with org.xml.sax.XMLFilter
before, this is quite similar. A PassThroughHandler passes on SAX events to some other handler
(the destination)
after applying some transformations. For example if we have an AG DTD document and we want to read in as an
AG Schema instance some namespace declarations must be changed. Another use is for ID generation. When
a document is read in ids may need to be recreated either for editing or for putting multiple documents
under a single ViewChart. You could of course use XSLT for this but these mechanisms are much faster. Also
if your persistent annotations are in some non-XML form this is one way to load it in to XML. We usually
create a chain of PassThroughHandlers with the destination of the last one as a TransformerHandler which
does our SAX2DOM transformation. Creation in this way allows us to access the xsi:type information so that
abstract types can be instantiated. Note that it is possible to get the AGDocument to work with a DOMParser.
It is intentionally not implemented because the schemas here should not be your persistent form. The AGSource,
and AGResult interfaces and their implementations are very similiar to TrAX(Transformation API for XML)
sources and results. This is the javax.xml.transform stuff.
Signal Handler Package
An annotation is assumed to refer to an interval in some signal. For example (0,10]. The endpoints of these
intervals can be arbitrary strings which have units. A SignalHandler handles intervals on a
SignalDescriptor. Note that this SignalDescriptor is similar but not the same
as the Annotation Graph Signal element. It just contains a description of the signal but it is not a
part of any Annotation Graph document. So even if you represent signals in your XML differently, this
Signal should suffice to describe it. The signal handler takes as input a SignalInterval and returns
an Object. For example, our SmallTextAsCharStreamHandler will return a CharSequence. The SignalManager
lets you configure the set of SignalHandlers to use as a particular instance document may refer to
more than one signal and one SignalHandler will be created for each Signal.
So with any standoff architecture it is assumed that the SignalInterval suffices to retrieve your
data. The units of the end points of these intervals is up to you. This assumption may not be
correct for signals of speech and video(where potentially something like framerate may be needed).
Hopefully a simple extension of the SignalInterval will suffice for these with the additional information.
Test Package
The test package is the easiest place to look for examples of usage. There is approximately one test
for each package and will show how to use every functionality. The command line arguments for each tests
is a single xml file which can be found in {sys path to install dir/testcmdline/testName.xsi}. The
command line schema is http://www.annotation.org/agschema/atlas/ag/cmdline (see schema docs)
and the org.annotation.agschema.atlas.ag.cmdline package if you would like to use it for other things as well.
Configuration Files
The config files and the schemas are part of agResources.jar. Two configuration files are needed to set
up everything:
-
AGXmlEntityResolverConfig.xsi - This tells us the location of schemas and dtds on the classpath. If
you add additional schemas or dtds you must change this configuration. See apidocs for
AGXmlEntityResolver for ways to accomplish this.
-
AGElementProviderConfig.xsi - This tells the API what objects to create
when createElementNS is invoked. If you subtype any of the classes, you will need to modify this
file. See AGXmlEntityResolver in the Java API for ways to do this.
If you change the set up you might want to run one or more of our test cases to see if
everything works ok. This will require you to edit the commandline arguments file to provide
the system dependant arguments like paths to the input files. The test
command line files will be in {sys path to install dir}/testcmdline. Unfortunately because of
licensing issues I cannot distribute my test data with this package. If you are licensed to
access the Penn Treebank and/or the Penn Discourse Treebank send me email at
nikhild at seas dot upenn dot edu and we can set up a way for you to get at the test data. If
not any sample Annotation Graph xml files should do.
Querying
Once the views are computed or to compute the views the XML base needs to be queried. The goal
of all this is to let you plug in standard mechanisms like XPath and XQuery
to do this. There is no XPath API included with our distribution, but I have
tested the XPathAPIs included with Xalan and Jaxen. With
Java™ 1.5 there is an XPathAPI in the default distribution, so that can be used as well. I
do not know of any open source XQuery implementation. The DOM Level2 Traversal and Range
API also provides less expressive but quicker ways to query XML structure.
Note that XPath will not be the whole answer to all your queries. For nested queries and other such
you will need a combination of XPath, and programming through the DOM interface or the methods
provided by the API. Needless to say you can switch seamlessly between all these modes. From
my tests, Xalan's XPath API is faster if you want to do multiple queries on a static DOM. If
you need to do XPath on a rapidly changing DOM, the performance gap closes and Jaxen is more
memory efficient.
To minimize the amount of programming needed, and to keep speed high use the setUserData methods
on the nodes. This is different from the setUserObject methods which are intended for XML structure.
For example if you have two AGTreeViews, and need pointers from one to the other - the setUserObject
method can be used for labels and other local information to the node, and the setUserData can
be used to maintain the pointers. This will be especially useful while editing or showing a
graphical display.
There is one type of query that cannot be handled by this API. This is the online multi-file
query. That is there is one query which we want to run against a large part of the annotation
base and wish to view the results immediately. In my opinion any XML-based annotation
API will not be able to support this kind of query for some time to come. Serialization and
object creation times are bottlenecks. We can get way with it if results are dense, so results
can be fetched while the user looks at current results(this may be the case in early annotation
correction phases) but in general this kind of thing cannot be attempted with this API.
Limitations and TODOs
As mentioned above there is only direct support for handling signals of text. Handling other
signals like speech and video is not currently on our agenda and you may be better served by the
AGTK.
Tables and Collection views will be implemented if someone is willing to send us examples of
use. Also, the graph API should respect some generic interface which has a lot of algorithms
implemented.
Support for simple types needs to be added. For example when we have a list of IDREFs it seems
unreasonable to have to split it and create new Strings. Rather a list view of one String should
be maintained. But this is a little tricky to implement in that it is not quite clear whether
these should have live DOM support and whether they should be created through the DOM events
API or by some other mechanism.
It would be nice to be able have an internal mechanism(by implementing some interface) for XPath so
we can speed it up and use it a little more carelessly.
References
[1]
@inproceedings{BirdLiberman99dtag,
author={Steven Bird and Mark Liberman},
title={Annotation graphs as a framework for multidimensional
linguistic data analysis},
year=1999,
booktitle={Towards Standards and Tools for Discourse Tagging --
Proceedings of the Workshop},
publisher={Somerset, NJ: Association for Computational Linguistics},
pages={1--10},
note={[xxx.lanl.gov/abs/cs.CL/9907003]}
}
[2] XML Schema Specification
[3] XPath Recommendation
[4] Post Schema-Validation Infoset
[5] Xerces2 Java Parser
[6] XML Beans
[7] XML
[8] The World Wide Web Consortium
[9] Annotation Graph Implementation
[10] Schema Docs
[11] API Docs
[12] SAX
[13] Xalan
[14] Jaxen
[15] DOM
[16] DOM Level2 Traversal and Range