Home

Archweb

Home

Archaeology and XML Newsletter 3

May 2004

CONTENTS OF THIS NEWSLETTER

INTRODUCTION

NEWS

XML FOR ARCHAEOLOGY: A ROUNDTABLE DISCUSSION

THE HERITAGE EXCHANGE PROTOCOL

AN AMATEUR ARCHAEOLOGIST'S USE OF XML

INTRODUCTION

Mark Bell

Welcome to newsletter number 3 for May 2004. Just after Easter I was at the

Computer Applications in Archaeology Conference in Prato, Italy. I was interested to

see that there were plenty of papers on XML and related subjects. Two calls for

feedback and information that circulated at the conference are listed in this newsletter.

I hope to bring you some more items arising from this CAA conference in future

newsletters.

Please note the important changes to the newsletter sign-up process given in the News

section.

The next newsletter is planned for July/August, depending on feedback and

contributions of course.

Mark Bell

NEWS

Important changes to the newsletter sign-up.

I have set up a new domain at www.archweb.co.uk and the newsletter will be sent out

from there in future. To change your subscription details you need to go to the new

site and log in. The username and password you used for the newsletter sign up has

been moved to the new site. Once you have logged in you can then go to mailing list

preferences and change your subscription options.

There is now an option to receive the newsletter either in HTML format or in plain

text. As the HTML format seems to mangle web addresses at the moment, I have set

everyone up to receive a plain text newsletter. I will let you know when this has been

fixed.

The old newsletters are archived on the site and there will soon be an option to leave

comments about the newsletters on the site.

(Technical note – the new site is built using the phpWebSite content management

system, an open source product, written in PHP by Appalachian State University.

For further details see http://phpwebsite.appstate.edu/).

XML FOR ARCHAEOLOGY: A ROUNDTABLE DISCUSSION

Dr William Kilbride

Abstract:

Published in 1998, the eXtensible Markup Language (XML) promises to

be the first step towards the next generation Web, allowing

communities to design languages that suit their particular needs and

integrate them harmoniously into a general infrastructure. In the

five years since being launched, a number of disciplines have taken

the initiative by creating their own mark-up specifications to render

and process domain specific information. MathML is used to share

mathematical expressions; CML is used to share molecular information

in Chemistry and a number of languages now exist to share musical

notation.

In contrast, archaeologist have not risen to the challenges and

opportunities of XML to the same degree. Various projects and

services like HEIRPORT, ArcheoBlog, Spectrum and OASIS use XML to

generate, render or share files, but only do so under specific and

restricted conditions. XML tools have been relatively slow to develop

partly because the standards upon which they are based have also been

slow to emerge. Moreover, variations in organisational structures and

intellectual traditions mean that such tools and standards often have

only limited relevance and application.

This roundtable is intended as a discussion forum for those

interested in XML for archaeology. A position paper will present a

number of case studies of XML applications, and the management and

strategic context of these applications. It will highlight the

presumed benefits of a wider XML development as against the implied

costs, and will identify possible areas for long term, middle term

and short term development.

Participants will be presented with a number of discussion points to

which they will be asked to respond. The roundtable will end with a

series of recommendations on how XML can be exploited more fully for

archaeology.

Notate Bene: The success of this roundtable and its recommendations

depends in part on the expertise of the whole group present.

Participants will be asked to contribute to this discussion and are

expected to have a grasp of the issues in advance. An expert panel

will present some grounding in the topic, but the hope of the

organisers is to stimulate informed discussion.

Dr William Kilbride

Assistant Director

Archaeology Data Service

Dept of Archaeology

University of York  

England YO1 7EP, UK

HEEP:  - HISTORICAL ENVIRONMENT EXCHANGE PROTOCOL

THE FORUM FOR INFORMATION STANDARDS IN HERITAGE (FISH)

Dr Tyler Bell

Technical Director

Oxford ArchDigital Ltd

The Historical Environment Exchange Protocol:

A CRM-based Web Service for the querying, amalgamation and exchange

of heritage information between heterogeneous data sources.

The Forum for Information Standards in Heritage (FISH) has

commissioned The Historical Environment Exchange Protocol (HEEP) as

part of the FISH Toolkit, a series of XML-based applications designed

specifically for the heritage sector.  The Historical Environment

Exchange Protocol forms the core of the Heritage Web Service, an open

standard designed to facilitate interoperability in the heritage

sector.  The HEEP will be released in the summer of 2004; all

components of the FISH Toolkit will be available for public use by

September 2004.

The HEEP is a transport-independent architecture which standardises

the manner in which heritage information is queried by client

applications, and the format in which the requested data is

delivered. It also standardises how HEP-enabled servers report their

capabilities, and the format in which exceptions are reported.  The

protocol does not dictate the manner, format or technical platform in

which the data is stored and managed.  The Historical Environment

Exchange Protocol simply acts as a platform-independent 'connector'

between heritage datasets and the XML schema used to transport the

data.

The XML schema underlying The Historical Environment Exchange

Protocol is based on a mapping of the MIDAS standard to the CIDOC

Conceptual Reference Model (CRM). MIDAS is a data standard for

historic environment information, developed by English Heritage and

now maintained under the auspices of FISH. The CRM is a next-

generation semantic framework, developed specifically for "describing

the implicit and explicit concepts and relationships used in cultural

heritage documentation"; it is soon to be ratified as ISO Standard.

FISH is a consortium of UK heritage institutions formed in 2001 to

"co-ordinate, develop, maintain and promote standards for the

recording of heritage information".  Contributing Organisations

include The National Trust, English Heritage, The Archaeology Data

Service, The Royal Commission on the Ancient and Historical Monuments

of Scotland, The Museum Documentation Association, and several

others.

The HEEP and other elements of the FISH Toolkit are being developed

by Oxford ArchDigital Ltd.  Contributions and comments are welcome

throughout the development process.  All questions should be

addressed to the FISH Toolkit Project Manger, Edmund Lee

edmund.lee@english-heritage.org.uk.

Further Information:

FISH: http://www.fish-forum.info

Oxford ArchDigital:  http://oxarchdigital.com

MIDAS:  http://www.jiscmail.ac.uk/files/FISH/web_midasintro.htm

The CRM:  http://cidoc.ics.forth.gr/index.html

Note that HEEP "Historical Environment Exchange Protocol" was formally called HEEP

(Heritage Exchange Protocol).

AN AMATEUR ARCHAEOLOGIST'S USE OF XML

John Palmer

I would like to describe briefly the XML set-up that I use to

maintain the work-in-progress files for my research on the Roman

Purbeck stone industry.

Origins

This study began in 1996 when I found myself employed only three days

a week but fortunately not suffering a corresponding reduction of

income; I decided to use some of my spare time studying archaeology

at King Alfred's College in Winchester, in which city I had lived for

25 years. For my first long project (over the summer vacation) I put

forward a proposal in these terms:

Proposed area of study:

Shale, stone {and salt} industries of Purbeck in the Roman period

Primary sources:

Dorset County Museum collection

Poole and Wimborne museums

Sites: asking advice from County Museum

Secondary sources:

Royal Commission on Historic Monuments inventory, Dorset South-east,

1952

This being accepted (on the understanding that the braces round

{salt} meant that I would only go into this subject if I ran out of

sources for shale and stone), I spent some time that summer visiting

Dorset and exploring both the field and the library sources for the

Roman stone and shale industries. From this came to a paper which was

duly submitted as coursework. (You can read it at

www.palmyra.uklinux.net/purbeck1996.html).

It was about two years after this that (being fully employed once

more) I returned to the subject of the stone industry. By this time I

had dropped the Kimmeridge Shale industry, feeling that it was

already well covered by other workers. (For an introduction to

Kimmeridge Shale, try Calkin 1953.) On the other hand Purbeck Stone

was relatively neglected, the last major review of the subject being

30 years old (Beavis 1970). I determined from the start to put my

provisional findings and working notes on the World Wide Web, so that

others with related interests might note what I was doing and

hopefully make suggestions, corrections and comments and maybe join

in the project. I have certainly never regretted this decision, and I

am very grateful to the people who have shown interest and helped

guide my efforts.

At this point it would be well worth your while to view my current

presentation of the data at www.palmyra.uklinux.net/pur-preface.html.

You should bear in mind that in its present form it is far larger and

more complex than when I began it. At the beginning the matter on the

Web consisted of two files only: the database, being basically a list

of Roman Purbeck stone artefacts, and the bibliography, basically the

reference-list from my 1996 paper augmented by other citations which

I had added since then. These two were quite easy to maintain as HTML

files, being each no more than a few tens of thousands of characters

long.

The database was basically a long unordered list <ul>, containing

items of the following general kind:

<li>

 <ul>

   <li><b>name</b> ..identifying name of artefact.. </li>

   <li><b>site</b> ..where found and when.. </li>

   <li><b>publ</b> <a href="..">..reference to publication..</a>

</li>

   <li><b>desc</b> ..description.. </li>

   <!-- other properties of the artefact added here -->

 </ul>

</li>

The bibliography was also a long <ul> but contained items like this

one:

<li>

 <a name="Bidwell1979">Bidwell PT 1979</a>,

 <em>The legionary Bath House and Basilica and Forum at Exeter</em>,

 Exeter City Council and Univ of Exeter:

 Exeter Archaeological Reports <b>1</b>

</li>

Most of the hrefs in the database were naturally to items in the

bibliography, but from the start I allowed myself unlimited

references from anywhere in my files, to anywhere else in my own

website and also to other resources on the WWW, as I felt then as now

that these were important guides that would assist my own analysis of

the data and could also be useful to other readers.

Soon these two simple files grew. By early 2000 I had moved to Dorset

and into semi-retirement. I was now on the actual country of my study

and had easy access to the excellent library of the Dorset Natural

History and Archaeological Society, which I had joined back in 1996.

The database had split into several files, such as

Mortars (stone grinding bowls)

Other vessels (baths, basins, etc.)

Other portable artefacts

Roofing tiles of stone

Paving material

Other architectural stone

Inscriptions

Quarry sites

etc.

Moreover the internal organisation of the data on each artefact had

become quite varied. Naturally there were often many publ citations

for each artefact, and often several desc descriptions, sometimes one

from each author cited. Not every artefact even had a distinctive

name; but new properties of items, like map-references, location

(i.e. in what museum), substance (real Purbeck marble, other Purbeck

stone, etc.), and date (1st century, etc.), had been added in many

cases. The number and order of these properties varied greatly, and

this was making it difficult to study and to update the data, which

by now were becoming a resource of some archaeological importance, as

I described in a paper in the Dorset Proceedings (Palmer 2001).

The problem of keeping these data in order was not assisted by the

fact that the syntax of HTML (any version) is designed for specifying

logical subdivisions of a text, but not the significant properties of

any particular kind of subject-matter (such as stone artefacts).

Although I was accustomed to using nsgmls (James Clark) to verify the

conformance of my HTML to the appropriate DTD (data type

description), I decided that I needed a DTD more closely related to

the subject I was studying.

This DTD took shape in the summer of 2001.

The articles recorded in each file constitute a collection. Each

article in the collection is an item. I allow myself to group the

items by inserting subheads at suitable points in the list, but this

is little more than a presentational device.

<!ELEMENT collection - - ( (subhead | item)* ) >

For convenience in defining elements in the DTD I introduce an

entity:

<!ENTITY % textvar "(#PCDATA|br|em|b|a|code|img)*" >

And also this entity, to allow myself some non-ascii characters:

<!ENTITY % ISOlat1 SYSTEM "/usr/html/sgml-lib/ISOlat1.ent">

%ISOlat1;

A subhead is just a few words:

<!ELEMENT subhead - - (%textvar;) >

An item, however, has a fixed structure in which the subdivisions

always appear in the same order. This to me is an important aid to

reading and understanding the data. (In the old HTML notation there

was nothing to enforce this order.)

<!ELEMENT item - - ( name?,

number?,

cat*,

site+,

grid*,  

source*, publ*, desc*,

loc*, subst*, date*,

interp*, comment*, cont* ) >

Follow this link for meanings and uses of the inner elements.

You'll observe that an item must have a site, but all the other parts

are optional; more than one is allowed of all parts except name,

number and site. (Actually, number is not used at all and is only in

the DTD in case I should want to start cataloguing artefacts in the

style of the great corpuses (corpora?) like RIB (Collingwood and

Wright 1965).)

br, em and b are mere presentational devices and mean what they do in

HTML, i.e. linebreak, emphasise, and bold-face.

<!ELEMENT br - - (#PCDATA) --will normally be empty-->

<!ELEMENT em - - (%textvar;) >

<!ELEMENT b - - (%textvar;) >

a corresponds to its namesake in HTML and has some of the same

attributes. It is a bit old-fashioned in using "name" rather than

"id" for the label that is the target of a link.

<!ELEMENT a - - (%textvar;) >

<!ATTLIST a

 href CDATA IMPLIED

 name CDATA IMPLIED

 target CDATA IMPLIED >

"target" is another merely presentational device: as in HTML, it

hints to the displaying program that it is worth opening a secondary

window. code is also presentational and corresponds to its namesake

in HTML.

<!ELEMENT code - - (%textvar;) >

img introduces a picture, as in HTML.

<!ELEMENT img - - (#PCDATA) --will normally be empty-->

All the elements listed above from br to img can be used inside any

of the elements listed below, which are the main categories of

information about an item. For the meaning and use of the latter, see

my website at http://www.palmyra.uklinux.net/.

<!ELEMENT name - - (%textvar;)

<!ELEMENT number - - (%textvar;) >

<!ELEMENT cat - - (%textvar;) >

<!ELEMENT site - - (%textvar;) >

<!ELEMENT grid - - (%textvar;) >

<!ELEMENT source - - (%textvar;) >

<!ELEMENT publ - - (%textvar;) >

<!ELEMENT desc - - (%textvar;) >

<!ELEMENT loc - - (%textvar;) >

<!ELEMENT subst - - (%textvar;) >

<!ELEMENT date - - (%textvar;) >

<!ELEMENT interp - - (%textvar;) >

<!ELEMENT comment - - (%textvar;) >

<!ELEMENT cont - - (%textvar;) >

<!--finis-->

(The above data-structure is sufficiently restrictive for my purpose,

which was to help me to be regular and consistent in the recording of

my data. Observant eyes will note that it does permit me to do some

things that make little sense, for instance to put one a element

inside another, or to insert some textual content into br or img

elements. However I feel no inclination to do these things and don't

need the added complication of the code necessary to forbid them.)

Having chosen a data-structure, the first problem was to convert the

existing HTML data to the new form, bearing in mind that the

component parts of each item had to be forced into a new order to fit

the restrictions of the new DTD. There are many ways of doing this,

and if mine seems odd, the reader should bear in mind that I was

familiar with programming in Perl and inclined to stick to the

techniques that I new best.

My ad-hoc program html2xml reads the HTML data and converts to the

new DTD; it uses the SGML parser nsgmls to convert the HTML to a

canonical form and creates a structure of Perl objects corresponding

to the elements of the HTML; these are then picked off in the

appropriate order to create new items with correctly ordered inner

parts. Apart from the time in 2001 when I first introduced the new

DTD, I have not used my html2xml again except on one occasion when I

removed (deleted) one of my XML files by mistake !

I now had my data stored in XML in my new DTD in files called *.xml.

From summer 2001 onwards, all amendments and additions to the data

have been made by editing the XML files; this has kept a degree of

discipline in my data which was hard to achieve using raw HTML. Of

course, every time I amend an XML file, I have to ensure good order

by validating it against the DTD described above; I do this with

nsgmls, which is so quick and convenient I can use it many times over

within a single data-entry session.

I have not attempted to put my XML on the Web directly, as I think it

is important not to assume that all my readers will be using the very

latest in Web-browsing software! In fact, after amending any of my

XML master-files, I create a corresponding file in HTML by means of a

Perl program which goes by the name updatehtml. (Although this

program will produce correct HTML provided that the master-file is

correct XML, I occasionally verify the generated HTML using nsgmls.)

The automatically-generated HTML is, at the time of writing, XHTML

1.0.

The conversion to HTML is much simpler than the conversion out of it,

for it involves little more than a succession of string-

substitutions, the style of which will be familiar to anyone who has

used Perl or any of its antecedent programs like sed or vi. The

program works on tags, not on elements, which is satisfactory in this

case provided that matching operations are performed on both the

start- and the end-tag for the same element.

For instance, <collection> becomes <ul>:

$_ =~ s/<collection>/<ul>/;

$_ =~ s/<\/collection>/<\/ul>/;

<item> becomes a <ul> inside a <li>:

$_ =~ s/<item>/<li>\ <ul>/;

$_ =~ s/<\/item>/<\/ul><\/li>/;

The various parts of an item are all treated alike: first the start-

tags:

$_ =~ s/<name> */<li><b>name<\/b> /;

$_ =~ s/<number> */<li><b>number<\/b> /;

$_ =~ s/<cat> */<li><b>cat<\/b> /;

$_ =~ s/<site> */<li><b>site<\/b> /;

$_ =~ s/<grid> */<li><b>grid<\/b> /;

$_ =~ s/<source> */<li><b>source<\/b> /;

$_ =~ s/<publ> */<li><b>publ<\/b> /;

$_ =~ s/<desc> */<li><b>desc<\/b> /;

$_ =~ s/<loc> */<li><b>loc<\/b> /;

$_ =~ s/<subst> */<li><b>subst<\/b> /;

$_ =~ s/<date> */<li><b>date<\/b> /;

$_ =~ s/<interp> */<li><b>interp<\/b> /;

$_ =~ s/<comment> */<li><b>comment<\/b> /;

$_ =~ s/<cont> */<li><b>cont<\/b> /;

and the end-tags:

$_ =~ s/<\/name>/<\/li>/;

$_ =~ s/<\/number>/<\/li>/;

$_ =~ s/<\/cat>/<\/li>/;

$_ =~ s/<\/site>/<\/li>/;

$_ =~ s/<\/grid>/<\/li>/;

$_ =~ s/<\/source>/<\/li>/;

$_ =~ s/<\/publ>/<\/li>/;

$_ =~ s/<\/desc>/<\/li>/;

$_ =~ s/<\/loc>/<\/li>/;

$_ =~ s/<\/subst>/<\/li>/;

$_ =~ s/<\/date>/<\/li>/;

$_ =~ s/<\/interp>/<\/li>/;

$_ =~ s/<\/comment>/<\/li>/;

$_ =~ s/<\/cont>/<\/li>/;

As hinted before, I try to remain compatible with older browsers

while not neglecting new W3C recommendations, so I ensure that each a

element that is the target of a link has both "id" and "name"

attributes, both with the same value:

$_ =~ s/ name=(".*?")/ id=$1 name=$1/g; # 2003-01-14

It remains for the program to copy the front- and back-matter from

the old version of the HTML file, changing only minor details (most

importantly the date of revision wherever it appears.)

Spotmaps

The front matter of many of my HTML files includes a sketch-map of

the province of Britannia indicating the geographical distribution of

the relevant class of artefacts. This is generated from the XML files

in the following way: a program spotmap scans the file for grid-

references (element grid), and generates TeX code that places a

suitable symbol at the appropriate spot on the map according to the

National Grid. The map is then drawn and annotated using TeX,

including a coastal outline, which was obtained from the website of

the (United States) National Oceanic and Atmospheric Administration

and is stated by NOAA to be in the public domain.

(Owing to differences of geographic projection between these data and

the British National Grid, there may be small errors in the placement

of some points on the maps. This will ultimately put a limit to the

usability of the NOAA coastline data. Coastlines on a true National

Grid basis can be obtained from the British Ordnance Survey, but at

present I prefer to avoid their licensing procedures and possible

charges.)

TeX is of course the typesetting program devised by Donald Knuth. For

an introduction try http://www.tug.org/, the site of the TeX User

Group.

Printable versions

Just a few words on the most recent enhancement. Besides the Web-

presentation of my data I need a printed version in a ring-binder,

which I can carry about with me and refer to when working in a

library or in the field. I began by using the printing facilities of

my Web-browser, but rapidly felt the need for something that would

rewrite the data in a more compact form. I now have another Perl

program that rewrites the XML data as input to LaTeX, which gives a

more compact layout than anything I've managed to achieve using a

Web-browser; it reduced the thickness of the file I sometimes carry

about by about a half.

(LaTeX is an application of TeX, invented by Leslie Lamport (LaTeX, a

document preparation system, 2nd ed., Addison-Wesley 1996) which has

been much extended by the later contributions of users.)

One word about XSL, XSLT and all that : I feel somewhat guilty about

not having used them but I really haven't yet felt the need. I find

that as I'm reasonably fluent in Perl, and my source XML has a very

simple structure, I can more easily make an ad-hoc program in Perl to

convert the XML to whatever I want. One incidental benefit is that I

can even carry the comments in the XML over into the output file!

Future developments

At the moment my bibliography is kept as a simple HTML file which is

hand-edited rather than created from a source file in some other

notation. This has been satisfactory up till now, but the increasing

size of the bibliography (it now holds over 600 citations) makes me

think of improvements.

The ideal would be to rewrite the bibliography as a BibTeX database,

from which I could generate

1. a full listing typeset with LaTeX,

2. a HTML version of the above, for presentation on the Web, and

3. a list of references for any paper I may write (using LaTeX of

course), in whatever style I (or the journal I was aiming at) wanted.

The main thing that has made me defer this plan till now is that

converting the bibliography from the present HTML form to BibTeX is

not straightforward and cannot be done by a simple conversion

program; the problem is analogous to converting a page-description

(such as a wordprocessor document or a PostScript file) to a

logically structured notation (such as LaTeX, or a relational

database). Probably I should take this task in hand before the

bibliography becomes any larger!

END OF NEWSLETTER 3

This newsletter is copyright © Mark Bell and the individual authors, 2004.

Please contact the editor before reproducing material from this newsletter.

  • Home
  • Services
  • Projects
  • The Dark Ages
  • On the web
  • XML newsletter
  • Newsletter 1
  • Newsletter 2
  • Newsletter 3
  • Newsletter 4
  • Newsletter 5
  • Newsletter 6
  • Newsletter 7

User login

  • Create new account
  • Request new password

Navigation

  • Recent posts
archweb
Theme port sponsored by Duplika Web Hosting.