JANUARY 2004
CONTENTS
CATALOGUING THE LATIN INSCRIPTIONS IN THE
BRITISH MUSEUM: IN XML? BY JONATHAN PRAG
THE MARK-UP OF ARCHAEOLOGICAL EXCAVATION REPORTS
USING THE DTD OF THE TEXT ENCODING INITIATIVE (TEI) BY CHRISTIANE
MECKSEPER
GETTING YOUR DATA BACK; AN INTRODUCTION TO
XQUERY BY MARK BELL
Mark Bell
Welcome to the second edition of the newsletter. I am glad to
say there was positive feedback on the first issue. I am also glad
to say this issue has articles by authors other that me.
Submissions however long or short are always welcome. Please keep
the comments coming. I hope to produce another newsletter soon
after Easter, when there should be some reports from the CAA
conference in Prato – see below.
In the last newsletter I said there was going to be a specific
session on XML at CAA 2000 in Prato. James Landrum emailed me to
say “There will not be a technology-specific session on XML
and Databases at CAA2004, however there will be a number of
sessions in which XML and databases play roles, and therefore there
may be papers relevant to XML and related markup issues in a
variety of sessions”. Further details of papers should be
posted on the CAA2004 website at
conference will run from 13-17 April 2004.
INSCRIPTIONS IN THE BRITISH MUSEUM: IN XML?
Jonathan Prag
School of Archaeology and Ancient History
University of Leicester
Although this project is temporarily stalled through lack of
time, I hope to return to cataloguing the Latin inscriptions of the
British Museum's Greek and Roman department in the near future. At
present the project is dependent on the research time I personally
have available while holding a teaching post - so not very
much…
The project presents a number of difficulties and in an attempt
to solve this, XML seems to offer the potential for the best
solution(s). Firstly there is a wide range of data which needs
integrating, ranging from images to acquisition records to object
data.
Secondly epigraphic texts present a particular problem when it
comes to database storage. Ideally one wants a wholly searchable
text; however epigraphic conventions, sigla, etc. are not conducive
to searchable texts; furthermore they are not readily entered into
many standard database packages. XML offers probably the best
solution to this range of problems.
The potential of XML for epigraphic texts is currently being
explored through the EPAPP
(http://www.kcl.ac.uk/humanities/cch/epapp/)
and EpiDoc (http://epidoc.sourceforge.net/)
projects. Those interested in the details of the problems are
directed to the EpiDoc website.
In essence a word such as "servus" ('slave') could appear in an
epigraphic text, by way of example, as oddly as: s]er<v>u[s-,
where additionally the letters e and u have subscript dots beneath
them (and you try including those in a plain text or HTML e-mail,
let alone in a database field); including all that information
while still permitting a text search of the field which would find
the word 'servus' is not easy. It is intended that this specific
project will be an adjunct to the EPAPP project with the latter
acting in an advisory and support role. Thirdly, the Museum's own
cataloguing system presents a range of difficulties when it comes
to integration of databases. The existing digital catalogue (which
does not include the Latin epigraphic material) is a much more
minimal catalogue than the specific requirements of this material
demand. However, the medium term plans for the general catalogue
envisage the 'bolting on' of local specific catalogues such as the
one under discussion here. In this case the choice of base database
format is perhaps both more problematic and less immediately
obvious. Indeed it may be that the core data is collated first in
an Access database, from where it can then be exported in minimal
form to the existing BM catalogue and used as the base for the
fuller XML catalogue which in turn will perhaps be more readily
appended to the BM catalogue (the XML interface of which is
undergoing gradual development). It should be noted however that
this aspect of the project is still very much at the discussion
stage. Additionally, a further project at the discussion stage
involves the integration of details of all classical inscriptions
currently held in the UK; this too will raise further problems of
data-type integration. Fourthly preferred publication - again, the
EPAPP project presents the possible 'ideal' format for such a
database, since a paper publication is unlikely. But here too, as
EPAPP has been demonstrating, one advantage of XML is the potential
for very varied output with considerable ease. Whether the
catalogue will ultimately be hosted online, made available in
digital form through other means, or not made generally available
in digital form but used simply to generate a hard copy end-product
is also still under discussion. Here too XML would seem to possess
greater flexibility than 'traditional' database packages (granted
however this is a more debateable point). The initial cataloguing
is envisaged to take in the region of 2-3 years (depending upon
available time), and the final finished product will probably be a
year or two longer in the making although it is hoped to
have the work essentially complete in time for the International
Epigraphic Congress being hosted by London and Oxford in 2007.
THE MARK-UP OF ARCHAEOLOGICAL EXCAVATION
REPORTS USING THE DTD OF THE TEXT ENCODING INITIATIVE
(TEI)
Christiane Meckseper
In September 2000 I undertook research for a Masters degree in
Information Systems at the University of Sheffield into the
feasibility of using XML as a mark-up language for the publication
of archaeological field reports produced by commercial units,
so-called 'grey literature' (Meckseper & Warwick 2003). The basis for the dissertation was an article by Gray and Walford
(1999) published in Internet Archaeology, who had first recommended
the use of XML for the publication of archaeological data.
The main purpose of my research was to find a suitable DTD, or
XML schema, that could be used for the description of
archaeological data. Gray and Walford had suggested a rudimentary
customary archaeological DTD but I felt that it would be beyond the
scope of a Masters dissertation to build an archaeological DTD from
scratch. I surveyed several DTDs with an archaeological background
or connection, most notably David Shloen's XSTAR project
http://www.oi.uchicago.edu/OI/PROJ/XSTAR/XSTAR.html).
Other DTDs that could be used to mark up aspects of an
archaeological report are the Geography Mark-up Language (GML) for
geographical information (http://www.opengis.org/techno/specs/00-029/GML.html);
the Historical Event Mark-up and Linking (HEML) project (http://heml.mta.ca/heml-cocoon/);
and the DTD based on SPECTRUM, an established museum process and
documentation standard (Degenhardt Drenth 2001) developed by the
Museum Documentation Association (MDA), working in collaboration
with CIMI and other organizations.
However, as no single DTD at the time seemed suitable for the
task, and as I was almost exclusively dealing with written textual
material I decided to use the DTD of the Text Encoding Initiative
(TEI) for the mark-up of the archaeological reports (http://www.tei-c.org/P4X/).
The TEI is an international project to develop guidelines for the
encoding of textual material in electronic form for research
purposes (http://www.tei-c.org/). The TEI DTD is often used to produce electronic facsimiles of original paper publications and the elements defined in the TEI DTD therefore concentrate largely on the structural rendition of a text. However, the TEI DTD also provides extensive means of marking up names, place names and locations, dates and even levels of certainty, all of which are useful elements to mark up aspects of archaeological data and reports.
It is also possible to extend the TEI DTD with custom-made
elements, which could be used to add a further level of
archaeological detail to the mark-up of a report. Using XML's
extensive possibilities of defining elements with attributes, it
would also be very easy to introduce a controlled vocabulary into
the archaeological descriptions based on the English Heritage
Thesauri, MIDAS or other wordlists. For example an element could
take the form of:
"The <monument schema=" English Heritage Thesaurus of
Monument Types" type="inhumation">articulated
skeletons</monument> were all aligned E-W and laid out with
arms across the pelvis."
Another method of including a richer level of archaeological
mark-up could be the use of elements from another, purely
archaeological DTD, or DTDs like the GML mentioned above, in the
form of namespaces (
http://www.w3.org/TR/REC-xml-names/).
Using the TEI’s structural text mark-up, separate sections
of reports, like specialist reports, abstracts or structural
descriptions, could also be extracted from a body of archaeological
reports and re-compiled. For example, a specialist would be able to
extract all human bone reports from a corpus of archaeological
texts. This is one of the recommendations for archaeological
publication by the PUNS survey, a survey into the user needs of
archaeological publication, published by the Council for British
Archaeology (Jones, S., MacSween, A., Jeffrey, S., Morris, R.,
Heyworth, M. 2001).
As part of my research I marked up a very small sample of field
reports produced by ARCUS, the commercial unit based at the
University of Sheffield. However, I think what is needed in
relation to XML and archaeological excavation reports is a pilot
project that would mark up a body of texts using the TEI or another
suitable DTD, and then carried out extensive testing into their
usability and searcheability.
Further postgraduate research is currently being undertaken at
the University of York (Falkingham, pers.comm) into the potential
of xml for archaeological grey literature reports and a
'multi-layered' presentation once a document has been marked up.
This is also thinking of developing XML mark-up for reports that
would allow a degree of interoperability. For example a suitable
mark-up could allow data to be extracted from reports in a useful
format for input into SMR database fields and OASIS records. The
development of the FISH interoperability toolkit will also be an
interesting project to follow in this context.
A small non-representative survey of commercial archaeological
units, undertaken as part of the dissertation, showed that even
though there is a large willingness to publish material in an
electronic format, it is still beyond the financial capabilities of
many units to implement a system of electronic publication, let
alone incorporating extensive XML mark-up. However, hopefully with
the increasing acceptance of electronic publication and the
development of necessary tools to easily integrate different kinds
of archaeological electronic data and to more cost-effectively
process existing electronic versions of texts (for example in the
form of inserting semi-automated mark-up), funding methods can be
found that will allow the integration of richer electronic
publication into a commercial environment.
Bibliography
Gray, J., Walford, K. (1999). “One Good Site Deserves
Another: Electronic Publishing in Field Archaeology.” In:
Internet Archaeology 7. [Online]. Available at:
http://intarch.ac.uk/journal/issue7/gray_toc.html.
[Accessed: 07.01.2004].
Degenthart Drenth, B. (2001). Building on the mda SPECTRUM-XML
DTD for Collections Management Data Interchange. [Online].
http://www.archimuse.com/mw2001/papers/degenhart/degenhart.html.
[Accessed: 07.01.2004].
Jones, S., MacSween, A., Jeffrey, S., Morris, R., Heyworth, M.
(2001). From the Ground Up. The Publication of Archaeological
Projects: a user needs survey. (CBA Publications).
MeckseperC. & Warwick, C. (2003). "The Publication of
Archaeological Excavation Reports using XML." In: Journal for
Literary and Linguistic Computing, Volume 18, Issue 1. pp. 63
http://www3.oup.co.uk/litlin/hdb/Volume_18/Issue_01/pdf/180063.pdf
[Accessed: 07.01.2004].
INTRODUCTION TO XQUERY
Mark Bell
There are two parts to data management. The first part is
structuring and storing your data. XML as a data format is ideal
for this. The second and more important half of the process is to
retrieve information. Until very recently there were few tools
around for data retrieval, but the publication of a number of
standards has lead to the creation of some useful tools.
The World Wide Web Consortium ( http://www.w3.org/) has published
drafts for XPath version 2.0, XLST version 2.0, and Xquery version
2.0 all on the 12th November 2003.
XPath is a way of describing where nodes are in an XML document
and is used by the other standards to navigate through XML
documents.
XSLT (Extensible Stylesheet Language Transformations) is used
for transforming XML data into other formats such as HTML or
PDF.
XQuery is to XML documents what SQL is to a database – a
standard way to query a document.
The heart of XQuery is FLWOR (pronounced "flower"). FLWOR comes
from For, Let, Where, Order by, and Return.
The for and let clauses specify a sequence of values, which can
then be filtered with a where clause and ordered using an order by
clause. The return clause shows what should be returned. An XQuery
expression to return title and author tags from an XML file of book
information will look something like this:
<results>
{
for $b in doc("http://bstore1.example.com/bib.xml")/bib/book
return
<result>
{ $b/title }
{ $b/author }
</result>
}
</results>
(taken from the W3C use cases – see below).
An important point is that tags can be embedded in the return
expression so the output can be valid XML or HTML. XQuery comes
with a huge range of predefined functions for text and mathematical
operations as well as the option for users to create their own
functions.
There are now several good introductions to XQuery on the web.
For example: X Is for XQuery by Jason Hunter (http://otn.oracle.com/oramag/oracle/03-may/o33devxml.html)
or What is XQuery? (http://www.xml.com/pub/a/2002/10/16/xquery.html).
The W3C set of XQuery use cases (http://www.w3.org/TR/xquery-use-cases)
are easy to follow and show what can be done with XQuery using
examples.
Conclusion
As XML becomes a more widely used standard, users are going to
need better ways to find information in their XML documents. XQuery
is already to powerful tool for data retrieval and no doubt is
going to become as popular as SQL.
A note on programming tools for XQuery
XQuery is only a standard, and must be implemented by
programming tools. One such tool set is SAXON. It is an XSLT and
XQuery Processor, and an open source product. It can be downloaded
from http://saxon.sourceforge.net/.
It is written in Java and can be called from a Java program or run
from a command line. Only the latest version 7.8 includes support
for XQuery 1.0 as well as XSLT 2.0. Version 7.8 updates the product
to align it with the 12 November 2003 working drafts.
Microsoft is giving away for free a set of class libraries for
use with the 1.0 and 1.1 releases of the Microsoft .NET framework
software development kit. This can be downloaded from http://xqueryservices.com/.
Also the .Net development kit is available for free from
http://msdn.microsoft.com/library/default.asp?url=/downloads/list/netdevframework.asp.
This is the command line version.
Using the class libraries an XQuery file can be run against an
XML document and the results either sent to a file or to the
command line.
Links
XML Path Language (XPath) Version 2.0. Available at: http://www.w3.org/TR/xpath20/
XSLT 2.0 Available at: http://www.w3.org/TR/xslt20/
XQuery 1.0: An XML Query Language Available at: http://www.w3.org/TR/xquery/
END OF NEWSLETTER NUMBER 2
This newsletter is copyright © Mark Bell and the individual authors, 2004.
Please contact the editor before reproducing material from this newsletter.