Random Post: get_results("SELECT ID,post_title,guid FROM $wpdb->posts WHERE post_status= \"publish\" ORDER BY RAND() LIMIT 1"); $p=$post[0]; echo ('' . $p->post_title . ''); ?>
RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  • Team
  •  

    Linked Data choices for historic places

    November 5th, 2010

    We’ve had some fitful conversation about modelling historic place-names extracted from the English Place Name Survey as Linked Data, on the Chalice mailing list.
    It would be great to get more feedback from others where we have common ground. Here’s a quick summary of the main issues we face and our key points of reference, to start discussion, and we can go into more detail on specific points as we work more with the EPNS data.

    Re-use, reduce, recycle?

    We should be making direct re-use of others’ vocabularies where we can. In some areas this is easy. For example, to represent the containment relations between places (a township contains a parish, a parish contains a sub-parish) we can re-use the some of the Ordnance Survey Research work on linked data ontologies – specifically their vocabulary to describe “Mereological Relations” – where “mereological” is a fancy word for “containment relationships”.

    Adapting other schemas into a Linked Data model

    One project which provides a great example of a more link-oriented, less geometry-oriented approach to describing ancient places is the Pleaides collection of geographic information about the Classical ancient world. Over the years, Pleaides has developed with scholars an interesting set of vocabularies, which don’t take a Linked Data approach but could be easily adapted to do so. They encounter issues to do with vagueness and uncertainty that geographical information systems concerning the contemporary world, can overlook. For example, the Pleiades attestation/confidence vocabulary expresses the certainty of scholars about the conclusions they are drawing from evidence.

    So an approach we can take is to build on work done in research partnerships by others, and try to build mind-share about Linked Data representations of existing work. Pleiades also use URIs for places…

    Use URIs as names for things

    One interesting feature of the English Place Name Survey is the index of sources for each set of volumes. Each different source which documents names (old archives, previous scholarship, historic maps) has an abbreviation, and every time a historic place-name is mentioned, it’s linked to one of the sources.

    As well as creating a namespace for historic place-names, we’ll create one for the sources (centred on the five volumes covering Cheshire, which is where the bulk of work on text correction and data extraction has been done. Generally, if anything has a name, we should be looking to give it a URI.

    Date ranges

    Is there a rough consensus (based on volume of data published, or number of different data sources using the same namespace) on what namespace to use to describe dates and date ranges as Linked Data? At one point there were several different versions of iCal, hCal, xCal vocabularies all describing more or less the same thing.

    We’ve also considered other ways to describe date ranges – talking to Pleiades about mereological relations between dates – and investigating the work of Common Eras on user-contributed tags representing date ranges. It would be hugely valuable to learn about, and converge on, others’ approaches here.

    How same is the same?

    We propose to mint a namespace for historic place-names documented by the English Place Name Survey. Each distinct place-name gets its own URI.

    For some of the “major names”, we’ve been able to use the Language Technology Group’s georesolution tool to make a link between the place-name and the corresponding entry in geonames.org.

    Some names can’t be found in geonames, but can be found, via Unlock Places gazetteer search, in some of the Ordnance Survey open data sources. Next week we’ll be looking at using Unlock to make explicit links to the Ordnance Survey Linked Data vocabularies. One interesting side-effect of this is that, via Chalice, we’ll create links between geonames and the OS Linked Data, that weren’t there before.

    Kate Byrne raised an interesting question on the Chalice mailing list – is the ‘sameAs’ link redundant? For example, if we are confident that Bosley in geonames.org is the same as Bosley in the Cheshire volumes of English Place Name Survey, should we re-use the geonames URI rather than making a ‘sameAs’ link between the two?

    How same, in this case, is the same? We may have two, or more, different sets coordinates which approximately represent the location of Bosley. Is it “correct”, in Linked Data terms, to state that all three are “the same” when the locations are subtly different?
    This is before we even get into the conceptual issues around whether a set of coordinates really has meaning as “the location” of a place. Geonames, in this sense, is a place to start working out towards more expressive descriptions of where a place is, rather than a conclusion.

    Long-term preservation

    Finally, we want to make sure that any URIs we mint are going to be preserved on a really long time horizon. I discussed this briefly on the Unlock blog last year. University libraries, or cultural heritage memory institutions, may be able to delegate a sub-domain that we can agree to long-term persistence of – but the details of the agreement, and periodic renewal of it due to infrastructural, organisational and technological change, is a much bigger issue than i think we recognise.


    Connecting archives with linked geodata – Part I

    October 22nd, 2010

    This is the first half of the talk I gave at FOSS4G 2010 covering the Chalice project and the Unlock services. Part ii to follow shortly….

    My starting talk title, written in a rush, was “Georeferencing archives with Linked Open Geodata” – too many geos; though perhaps they cancel one another out, and just leave *stuff*.

    In one sense this talk is just about place-name text mining. Haven’t we seen all this before? Didn’t Schuyler talk about Gutenkarte (extracting place-names from classical texts and exploring them using a map) in like, 2005, at OSGIS before it was FOSS4G? Didn’t Metacarta build a multi-million business on this stuff and succeed in getting bought out by Nokia? Didn’t Yahoo! do good-enough gazetteer search and place-name text mining with Placemaker? Weren’t *you*, Jo, talking about Linked Data models of place-names and relations between them in 2003? If you’re still talking about this, why do you still expect anyone to listen?

    What’s different now? One word: recursion. Another word: potentiality. Two more words: more people.

    Before i get too distracted, i want to talk about a couple of specific projects that i’m organising.

    One of them is called Chalice, which stands for Connecting Historical Authorities with Linked Data, Contexts, and Entities. Chalice is a text-mining project, using a pipeline of Natural Language Processing and data munging techniques to take some semi-structured text and turn the core of it into data that can be linked to other data.

    The target is a beautiful production called the English Place Name Survey. This is a definitive-as-possible guide to place-names in England, their origins, the names by which things were known, going back through a thousand years of documentary evidence, reflecting at least 1500 years of the movement of people and things around the geography of England. There are 82 volumes of the English Place Name Survey, which started in 1925, and is still being written (and once its finished, new generations of editors will go back to the beginning, and fill in more missing pieces).

    Place-name scholars amaze me. Just by looking at words and thinking about breaking down their meanings, place-name scholars can tell you about drainage patterns, changes in the order of political society, why people were doing what they were doing, where. The evidence contained in place-names helps us cross the gap between the archaeological and the digital.

    So we’re text mining EPNS and publishing the core (the place-name, the date of the source from which the name comes, a reference to the source, references to earlier and later names for “the same place”). But why? Partly because the subject matter, the *stuff*, is so very fascinating. Partly to make other, future historic text mining projects much more successful, to get a better yield of data from text, using the one to make more sense of the other. Partly just to make links to other *stuff*.

    In newer volumes the “major names”, i.e. the contemporary names (or the last documented name for places that have become forgotten) have neat grid references, point-based, thus they come geocoded. The earliest works have no such helpful metadata. But we have the technology; we can infer it. Place-name text mining, as my collaborators at the Language Technology Group in the School of Informatics in Edinburgh would have it, is a two-phase process. First phase is “geo-tagging”, the extraction of the place-names themselves; using techniques that are either rule-based (“glorified regular expressions”) or machine-learning based (“neural networks” for pattern cognition, like spam filters, that need a decent volume of training data).

    Second phase is “geo-resolution”; given a set of place-names and relations between them, figuring out where they are. The assumption is that places cluster together in space similarly as they do in words, and on the whole that works out better than other assumptions. As far as i can see, the state of the research art in Geographic Information Retrieval is still fairly limited to point-based data, projections onto a Cartesian plane. This is partly about data availability, in the sense of access to data (lots of research projects use geonames data for its global coverage, open license, and linked data connectivity). It’s partly about data availability in the sense of access to thinking. Place-name gazetteers look point-based, because the place-name on a flat map begins at a point on a cartesian plane. (So many place-name gazetteers are derived visually from the location of strings of text on maps; they are for searching maps, not for searching *stuff*)

    So next steps seem to involve

    • dissolving the difference between narrative, and data-driven, representations of the same thing
    • inferring things from mereological relations (containment-by, containment-of) rather than sequential or planar relationsOn the former – data are documents, documents are data.

    On the latter, this helps explain why i am still talking about this, because it’s still all about access to data. Amazing things, that i barely expected to see so quickly, have happened since i started along this path 8 years ago. We now have a significant amount of UK national mapping data available on properly open terms, enough to do 90% of things. OpenStreetmap is complete enough to base serious commercial activity on; Mapquest is investing itself in supporting and exploiting OSM. Ordnance Survey Open Data combines to add a lot of as yet hardly tapped potential…

    Read more, if you like, in Connecting archives with linked geodata – Part II which covers the use of and plans for the Unlock service hosted at the EDINA data centre in Edinburgh.


    Chalice poster from AHM 2010

    October 22nd, 2010

    Chalice had a poster presentation at All Hands Meeting in Cardiff, the poster session was an evening over drinks in the National Museum of Wales, and all very pleasant.

    Chalice poster

    View the poster on scribd and download if from there if you like, be aware the full size version is rather large.

    I’ve found the poster very useful; projected it instead of presentation slides while I talked at FOSS4G and at the Place-Names workshop in Nottingham on September 3rd.


    Quality of text correction analysis from CDDA

    October 21st, 2010

    The following post is by Elaine Yeates, project manager at the Centre for Data Digitisation and Analysis in Belfast. Elaine and her team have been responsible for taking scans of a selection of volumes of the English Place Name Survey and turning them into corrected OCR’d text, for later text mining to extract the data structures and republish them as Linked Data.

    “I’ve worked up some figures based on an average character count from Cheshire, Buckinghamshire, Cambridgeshire and Derbyshire.

    We had two levels of quality control:

    1st QA Spelling and Font:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 346 character errors (average per page 8.65) = 0.22

    1st QA Unicode:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 235 character errors (average per page 5.87)= 0.14.

    TOTAL Error Rate 0.36
    2nd QA – Encompasses all of 1st QA and based on 40 pages averaging 4000 characters per page the error rate was 18 character errors (average per page 0.45) = 0.01.

    Through the pilot we indentified that there are quite a few Unicodes unique to this material. CDDA developed an in-house online Unicode database for analysts, they can view, update the capture file and raise new codes when found. I think for a more substantial project we might direct our QA process through an online audit system, where we could identify issues with material, OCR of same, macro’s and the 1st and 2nd stages of quality control.

    We are pleased with these figures and it looks encouraging for a larger scaled project.”

    Elaine also wrote in response to some feedback on markup error rates from Claire Grover on behalf of the Language Technology Group:

    ‘Thanks for these. Our QA team our primarily looking for spelling errors, from your list the few issues seem to be bold, spaces and small caps.

    Of course when tagging, especially automated, you’re looking for certain patterns, however moving forward I feel this error rate is very encouraging and it helps our QA team to know what patterns might be searchable for future capture.

    Looking at your issues so far, on part Part IV (5 issues e-mailed) and a total word count of 132,357 (an error rate of 0.00003).”

    I am happy to have these numbers, as one can observe consistency of quality over iterations, as means are found to work with more volumes of EPNS.


    Musings on the first Chalice Scrum

    October 18th, 2010

    For a while i’ve been hearing enthusiastic noises about how Scrum development practise can focus productivity and improve morale; and been agitating within EDINA to try it out. So Chalice became the guinea-pig first project for a “Rapid Application Development” team; we did three weeks between September 20th and October 7th. In the rest of this post I’ll talk about what happened, what seemed to work, and what seemed soggy.

    What happened?

    • We worked as a team 4 days a week, Monday-Thursday, with Fridays either to pick up pieces or to do support and maintenance work for other projects.
    • Each morning we met at 9:45 for 15 minutes to review what had happened the day before, what would happen the next day
    • Each item of work-in-progress went on a post-it note in our meeting room
    • The team was of 4+1 people – four software developers, with a database engineer consulting and sanity checking
    • We had three deliverables –
          a data store and data loading tools
          a RESTful API to query the data
          a user interface to visualise the data as a graph and map

    In essence, this was it. We slacked on the full Scrum methodology in several ways:

    • No estimates.

    Why no estimates? The positive reason: this sprint was mostly about code re-use and concept re-design; we weren’t building much from scratch. The data model design, and API to query bounding boxes in time and space, were plundered and evolved from Unlock. The code for visualising queries (and the basis for annotating results) was lifted from Addressing History. So we were working with mostly known quantities.

    • No product owner

    This was mostly an oversight; going into the process without much preparation time. I put myself in the “Scrum master” role by instinct, whereas other project managers might be more comfortable playing “product owner”. With hindsight, it would have been great to have a team member from a different institution (the user-facing folk at CeRch) or our JISC project officer, visit for a day and play product owner.

    What seemed to work?

    The “time-boxed” meeting (every morning for 15 minutes at 9:45) seemed to work very well. It helped keep the team focused and communicating. I was surprised that team members actually wanted to talk for longer, and broke up into smaller groups to discuss specific issues.

    The team got to share knowledge on fundamentals, that should be re-useful across many other projects and services – for example, the optimum use of Hibernate to move objects around in Java decoupled from the original XML sources and the database implementation.

    Emphasis on code re-use meant we could put together a lot of stuff in a compressed amount of time.

    Where did things go soggy?

    From this point we get into some collective soul-searching, in the hope that it’s helpful to others for future planning.

    The start and end were both a bit halting – so out of 12 days available, for only 7 or 8 of those were we actually “on”. The start went a bit awkwardly because:

        We didn’t have the full team available ’til day 3 – holidays scheduled before the Scrum was planned
        It wasn’t clear to other project managers that the team were exclusively working on something else; so a couple of team members were yanked off to do support work before we could clearly establish our rules (e.g. “you’ll get yours later”).

    We could address the first problem through more upfront public planning. If the Scrum approach seems to work out and EDINA sticks with it for other projects and services, then a schedule of intense development periods can be published with a horizon of up to 6 months – team members know which times to avoid – and we can be careful about not clashing with school holidays.

    We could address the second problem by broadcasting more, internally to the organisation, about what’s being worked on and why. Other project managers will hopefully feel happier with arrangements once they’ve had a chance to work with the team. It is a sudden adjustment in development practise, where the norm has been one or two people full-time for a longish stretch on one service or project.

    The end went a bit awkwardly because:

      I didn’t pin down a definite end date – I wasn’t sure if we’d need two or three weeks to get enough-done, and my own dates for the third week were uncertain
      Non-movable requirements for other project work came up right at the end, partly as a by-product of this

    The first problem meant we didn’t really build to a crescendo, but rather turned up at the beginning of week 3, looked at how much of the post-it-note map we still had to cover. Then we lost a team member, and the last couple of days turned into a fest of testing and documentation. This was great in the sense that one cannot underestimate the importance of tests and documentation. This was less great in that the momentum somewhat trickled away.

    On the basis of this, I imagine that we should:

    • Schedule up-front more, making sure that everyone involved has several months advance notice of upcoming sprints
    • Possibly leave more time than the one review week between sprints on different projects
    • Wait until everyone, or almost everyone, is available, rather than make a halting start with 2 or 3 people

    We were operating in a bit of a vacuum as to end-user requirements, and we also had somewhat shifting data (changing in format and quality during the sprint). This was another scheduling fail for me – in an ideal world we would have waited another month, seen some in-depth use case interviews from CeRch and had a larger and more stable collection of data from LTG. But when the chance to kick off the Scrum process within the larger EDINA team came up so quickly, I just couldn’t postpone it.

    We plan a follow-up sprint, with the intense development time between November 15th and 25th. The focuses here will be

    • adding annotation / correction to the user interface and API (the seeds already existing in the current codebase)
    • adding the ability to drop in custom map layers

    Everything we built at EDINA during the sprint is in Chalice’s subversion repository on Sourceforge – which I’m rather happy with.


    Visualisation of some early results

    August 20th, 2010

    Claire showed us some early results from the work of the Language Technology Group, text mining volumes of the English Place Name Survey to extract geographic names and relations between them.

    LTG visualisation of some Chalice data

    LTG visualisation of some Chalice data

    What you see here (or in the full-size visualisations – start with files *display.html) is the set of names extracted from an entry in EPNS (one town name, and associated names of related or contained places). Note there is just a display, the data structures are not published here at the moment, we’ll talk next week about that.

    The names are then looked up in the geonames place-name gazetteer, to get a set of likely locations; then the best-match locations are guessed at based on the relations of places in the document.

    Looking at one sample, for Ellesmere – five names are found in geonames, five are not. Of the five that are found, only two are certainly located, e.g. we can tell that the place in EPNS and place in geonames are the same, and establish a link.

    What will help improve the quantity of samenesses that we can establish, is filtering searches to be limited by counties – either detailed boundaries or bounding boxes that will definitely contain the county. Contemporary data is now there for free re-use through Unlock Places, which is a place to start.

    Note – the later volumes of EPNS do provide OS National Grid coordinates for town names; the earlier ones do not; we’re still not sure when this starts, and will have to check in with EPNS when we all meet there on September 3rd.

    How does this fit expectations? We know from past investigations with mixed sets of user-contributed historic place-name data that geonames does well, but not typically above 50% of things located. Combining geonames with OS Open Data sources should help a bit.

    The main thing i’m looking to find out now is what proportion of the set of all names will be left floating without a georeference, and how many hops or links we’ll have to traverse to connect floating place-names with something that does have a georeference. How important it will be to convey uncertainty about measurements; and what the cost/benefit will be of making interfaces allowing one to annotate and to correct the locations of place-names against different historic map data sources.

    Clearly the further back we go the squashier the data will be; some of the most interesting use cases that CeRch have been talking to people about, involve Anglo-Saxon place references. No maps – not a bad thing – but potentially many hops to a “certain” reference. Thinking about how we can re-use, or turn into RDF namespaces, some of the Pleiades Ancient World GIS work on attestation/confidence of place-names and locations.


    Posters and presentations

    July 23rd, 2010

    Happy to have had CHALICE accepted as a poster presentation for the e-Science All Hands Meeting in Cardiff this September. It will be good to have a glossy poster. Pleased to have been accepted at all, as the abstract was rather scrappy and last-minute. I had a chance to revise it, and have archived the PDF abstract.

    I’m also doing a talk on CHALICE, related work and future dreams, at the FOSS4G 2010 conference in Barcelona a few days earlier. Going to be a good September, hopes.


    Visiting the English Place Name Survey

    June 23rd, 2010

    I was in Nottingham for OSGIS at the Centre for Geospatial Sciences on Tuesday; skipped out between lunch and coffee break to visit the English Place Name Survey in the same leafy campus.

    A card file at EPNS

    A card file at EPNS

    Met with Paul Cavill, who dropped me right in to the heart of the operation – stacks of index cards in shoe boxes. Each major name has a set of annotation cards, describing different related names and their associations and sources – which range from Victorian maps to Anglo-Saxon chronicles.

    The editing process takes the card sets and turns them right into print-ready manuscript. The manuscript then has very consistent layout conventions – capitalisation, indentation. This is going to make our work of structure mining a lot easier.

    Another bonus I wasn’t expecting was the presence of OSGB grid references for all the major names. The task of making links becomes a snap – I was imagining a lot of iterative guesswork based on clustering and closeness to names in other sources. (There are four Waltons in the UK in geonames, dozens in the EPNS).

    On this basis I reckon the entity recognition will be a breeze, LTG will hardly have to stretch their muscles, which means we can ask them to work on grammars and machine learning recognisers for parts of other related archives within the scope of CHALICE.

    Pic_0622_026And we would have freedom in the EDINA team’s time to do more – specifically to look at using the National Map Library of Scotland’s map rectifier tools to correlate the gazetteer with detailed line-drawn maps  also created by the late H. D. G. Foxall. Digitisations of these maps live in the Shropshire Records Office. We must talk with them about their plans (the Records Office holds copyright in the scans).

    The eye-opener for me was the index of sources, or rather the bibliography. Each placename variant is marked with a few letters identifying the source of the name. So the index itself provides a key to old maps and gazetteers and archival records. To use Ant Beck’s phrase the EPNS looks like a “decoupled synthesis” of placename knowledge in all these sources. If we extract its structure, we are recoupling the synthesis and the sources, and now know where to look next to go text mining and digitising.

    Pic_0622_024So we have the Shropshire Hundreds as a great place to start, as this is where the EPNS are working on now and the volumes are “born digital”. Back at CDDA, Paul Ell has some of the very earliest volumes digitised, and if we find a sample from the middle, we can produce grammar rules that we can be pretty confident will extract the right structure from the whole set, when the time comes to digitise and publish the entire 80+ volume, and growing, set.

    But now i’m fascinated by the use of the EPNS derived data as a concordance to so many associated archives documenting historic social patterns. Towards the end of our chat Paul Cavill was speculating about reconstructing Anglo-Saxon England by means of text mining and georeferencing archives – we could provide a reference map to help archaeologists understand what they are finding, or even help them focus on where to look for interesting archaeology.

    Paul had been visited by the time-travelling Mormons digitising everything a couple of weeks previously, and will hopefully offer an introduction – i would really, really like to meet them.


    CHALICE: Our Budget

    June 10th, 2010

    This is the last of the seven blog posts we were asked to complete as participants in a #jiscexpo project. I like the process. This is a generalised version of our project budget. More than half goes to the preparation and annotation of digitised text from scans, both manually and using named entity recognition tools.

    The other half is for software development and user engagement; hoping to work together closely here. Of course we hope to over-deliver. Also have a small amount allocated to have people travel to a workshop. There’s another, independently supported JISC workshop planned to happen at EPNS on September 3rd.

    Institution Apr10Mar11
    EDINA National Datacentre, University of Edinburgh (project management, design, software development) £21129
    Language Technology Group, School of Informatics, University of Edinburgh (text mining archival work, named entity recognition toolkit development) £19198
    Centre for Data Digitisation and Analysis, Queens College Belfast (preparation of corrected digitised texts for use in archival text mining – the EPNS in a set schedule of volumes) £15362
    Centre for e-Research, Kings College London (backup project management, user needs and use case gathering, interviews, dissemination) £12365
    Amount Requested from JISC £68054

    CHALICE: The Plan of Work

    June 10th, 2010

    DRAFT

    GANTT-like chart showing the interconnection between different work packages and participants in CHALICE – not a very high-quality scan, sorry. When there are shifts and revisions in the workplan, Jo will rub out the pencil markings and scan the chart in again, but clearer this time.

    As far as software development goes we aspire to do a Scrum though given the resources available it will be more of a Scrum-but. Depending how many people we can get to Scrum, we may have to compress the development schedule in the centre – spike one week, deliver the next pretty much – then have an extended maintenance and integration period with just one engineer involved.

    The preparation of structured versions of digitised text with markup of extracted entities will be more of a long slog, but perhaps I can ask CDDA and LTG to write something about their methodologies.

    The use case gathering and user engagement parts of the project will develop on the form already used in the TextVRE project.