RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  • Team

    Next: Digitisation and Exposure of English Place-names

    April 5th, 2012

    Chalice was a short project, funded by JISC, to extract a digital gazetteer, in Linked Data form, from selected volumes of the English Place-Name Survey.

    Happily, the same group of partners, with the addition of Institute of Name Studies, secured significant funding from JISC to complete the scanning, OCR, error correction and text mining of all the existing published volumes of the Survey.

    The project, known as DEEP – Digitisation and Exposure of English Place-names – will run until 2013, when the resulting data will be made available through the JISC-supported Unlock Places geographic search API.

    List of outcomes of the Chalice project

    July 8th, 2011

    We put together this long list of different things that happened during the Chalice project, for our last bi-weekly project meeting, on 28th April 2011. The final product post offers an introduction to Chalice.


    These are pieces of work completed to form the project:

    • Corrected OCR for 5 EPNS volumes (*not* open licensed)
    • Quality assessment of the OCR
    • Extracted data in XML
    • Report on the text-mining and georeferencing process
    • RDF representation of extracted data, Open Database License
    • Searchable JSON API for the extracted data
    • Two prototype visualisations
    • Source code for the preceding 4 items
    • Two use case assessments
    • Supporting material for the use case assessments
    • Simple web service for alt-names for ADS
    • Sample Integration with GBHGIS data


    These are less concrete but equally valuable side-effects of the project work:

    • A set of sameAs assertions for Cheshire names between geonames and Ordnance Survey 50K gazetteer to go to
    • Historic place-name data to enhance with, potentially.
    • Improvements to the Edinburgh Geoparser and the Unlock Text service
    • Pushed forward open source release of the Geoparser
    • Refactoring of the Unlock Places service
    • Discussions and potential alignment with other projects (SPQR, Pleiades, GBHGIS)
    • Discussions with other place-name surveys (SPNS – Wales?)

    Talks / Dissemination

    Final Product Post: Chalice: past places and use cases

    June 29th, 2011

    This is our “final product post” as required by the #jiscexpo project guidelines. Image links somehow got broken, they are fixed now, please re-view.

    Chalice – Past Places

    Chalice is for anyone working with historic material – be that archives of records, objects, or ideas. Everything happens somewhere. We aimed to provide a historic place-name gazetteer covering a thousand years of history, linked to attestations in old texts and maps.

    Place-name scholarship is fascinating; looking at names, a scholar can describe the lay of the land, see political developments. We would like to pursue further funding to work with the English Place-Name Survey on an expert-crowdsourced service consuming the other 80+ volumes and extracting the detailed information – etymology, field-names.

    Linked to other archival sources, the place-name record has the potential to reveal connections between them, and in turn feed into deeper coverage in the place-name survey.

    There is a Past Places browser to help illustrate the data and provide a Linked Data view of the data.

    Stuart Dunn did a series of interviews and case studies with different archival sources, making suggestions for integration. The report on our use case for the Clergy of the Church of England Database may be found here; and that on our study of the Victoria County History is here. We also have valuable discussions with the Archaeology Data Service, which were reported in a previous post.

    Rather than a classical ‘user needs’ approach, targeting groups such as historians, linguists and indeed place-name scholars, it was decided to look in detail at other digital resources containing reference material. This allowed us to start considering various ways in which a digitized, linkable EPNS could be automatically related to such resources. The problems are not only the ones we anticipated, of usability and semantic crossover between the placename variants listed in EPNS and elsewhere; but also ones of data structure, domain terminology and the relationship of secondary references acorss such corpora. We hope these considerations will help inform future development of placename digitization.

    Project blog

    This covers the work of the four partners in the project.

    CeRch at KCL developed use cases through interviews with maintainers of different historic sources. There are blog descriptions of conversations with:

    LTG did some visualisations for these use cases, and more seriously text mining the semi-structured text of different sample volumes of the English Place Name Survey.

    The extraction of corrected text from previously digitised pages was done by CDDA in Belfast. There is a blog report on the final quality of the work, however the full resulting text is not open licensed nor distributed through Chalice.

    EDINA took care of project management and software development. We used the opportunity to try out a Scrum-style “sprint” way of working with a larger team.

    TOC to project blog –here is an Atom feed of all the project blog posts and they should be categorised / describe project partners

    Project tag: chaliced

    Full project name: Connecting Historical Authorities with Links, Contexts and Entities

    Short description: Creating and re-using a linked data historic gazetteer through text mining.

    Longer description:Text mining volumes of the English Place Name Survey to produce a Linked Data historic gazetteer for areas of England, which can then be used to improve the quality of georeferencing other archives. The gazetteer is linked to other placename sources on the Linked Data web via and Ordnance Survey Open Data. Intensive user engagement with archive projects that can benefit from the open data gazetteer and open source text mining tools.

    Key deliverables: Open source tools for text mining archives; Linked open data gazetteer, searchable through JISC’s Unlock service; studies of further integration potential.

    Lead Institution: University of Edinburgh

    Person responsible for documentation: Jo Walsh

    Project Team: EDINA: Jo Walsh (Project Manager), Joe Vernon (Software Developer), Jackie Clark (UI design), David Richmond (Infrastructure), CDDA: Paul Ell (WP1 Coordinator), Elaine Yates (Administration), David Hardy (Technician), Karleigh Kelso (Clerical), LTG: Claire Grover (Senior Researcher), Kate Byrne (Researcher), Richard Tobin (Researcher), CeRch: Stuart Dunn (WP3 Coordinator).

    Project partners and roles: Centre for Data Digitisation and Analysis, Belfast – preparing digitised text, Centre for e-Research, Kings College London – user engagement and dissemination, Language Technology Group, School of Informatics, Edinburgh – text mining research and tools.

    This is the Chalice project blog and you can follow an Atom feed of blog posts (there are more to come).

    The code produced during the Chalice project is free software; it is available under the GNU Affero GPL v3 license. You can get the code from our project sourceforge repository. The text mining code is available from LTG – please contact Claire Grover for a distribution…

    The Linked Data created by text mining volumes of the English Place Name Survey – mostly covering Cheshire – is available under the
    Open Database License – a share-alike license for data by Open Data Commons.

    The contents of this blog itself are available under a Creative Commons Attribution-ShareAlike 3.0 Unported license.


    CC-BY-SA GNU Affero GPL v3 license. Affero GPL v3

    Link to technical instructional documentation

    Project started: July 15th 2010
    Project ended: April 30th 2011
    Project budget: £68054

    Chalice was supported by JISC as a project in its #jiscexpo programme. See its PIMS project management record for information about where responsibility fits in at JISC.

    Geo-linking EPNS to other sources

    May 13th, 2011

    We’re wrapping up the loose ends on the Chalice project now, preparing to publish all the final material.

    Claire Grover at LTG did some interesting map renderings of the English Place-Name Survey names that we’ve managed to link to names in geonames and the Ordnance Survey Linked Data.

    Claire writes: Following last Thursday’s discussion, I’ve pulled out some figures about the georeferences in the Chalice data.

    I’ve also mapped the georeferences for each of the files – see the .display.html files in The primary.display.html ones (example: Cheshire Vol. 44) contain only the places that were identified as primary-sub-townships while the all.display.html ones (example: Cheshire Vol. 44) contain all the places that have at least one grid reference. Note that the colour of the gridreferences and markers in the display indicates source: green ones are from unlock, red ones are from geonames and blue ones were provided by EPNS (known-gridref – only in Cheshire and Shropshire).

    It’s not easy to make any firm conclusions from this but I tend to agree with Paul [Ell, of CDDA] that it would be better not to georeference smaller places (secondary-sub-townships) but instead to assign them the grid reference of the larger place they are contained in/associated with.

    ADS use case

    March 1st, 2011

    Jo and I recently met with Stuart Jeffrey and Michael Charno at the Archaeology Data Service in York, to discuss a putative third CHALICE use case. The ADS is the main repository for archaeological data in the UK, and thus has many potential crossovers with CHALICE, and faces many comparable issues in terms of delivering the kind of information services its users want.

    Much of the ADS’s discovery metadata as far as topography is concerned is based on the National Monument Record (NMR); and therefore on modern placenames. The ADS’s ArchSearch facility is based on a facetted classification principle: users can come into the system from a national perspective, and use parameters of ‘what’, ‘when’ and ‘where’ to pare the data down until they have a result set that conforms to their interests, with the indexing and classification into facets undetaken by ADS staff during the accession process.

    In parallel with this, the ADS has experimented with Natural Language Processing (NLP) algorithms to extract place types – types of monument, types of site, types of feature etc from so-called ‘greay Literature’, employing the MIDAS period terms. The principle of using NLP to build metadata is not in itself unproblematic: many depositors prefer to be certain that *they* are responsible for creating, and signing off, the descriptive metadata for their records. As with other organizations that we’ve spoken to, Stuart noted that georeferencing collections according to county > district > parish can create problems  due to boundary changes; also many users do not necessarily approach administrative units in a systematic way. For example, most people would not, in their searching behaviour, characterize ‘Blackpool’ as a subunit of ‘Lancashire’. This throws up interesting structural parallels with what we heard from the CCED project.

    Another good example the ADS recently encountered, is North Lincolnshire, which is described by Wikipedia as “a unitary authority area in the region of Yorkshire and the Humber in England… [and] for ceremonial purposes it is part of Lincolnshire.” This came up while creating a Web service for the Heritage Gateway for them.  It was assumed that users would naturally look for North Lincolnshire in Lincolnshire, however the Heritage Gateway used the official hierarchy, which put North Lincolnshire in Yorkshire and the Humber.  They were working on addressing that in the next version of their interface.

    It was strongly agreed that there is a very good case to be made for using CHALICE to enrich ADS metadata with historical variants, and that those wishing to search the collections via location would benefit from such enrichment. This view of things sits well alongside the CCED case (which focuses on connections of structure and georeferenceing) and VCH (which focuses on connections between semantic entities). What is interesting is that all three cases have different implications for the technology, costs and research use: in the next three months or so the project will work on describing and addressing these implications.

    Linked Data for places – any advice?

    January 6th, 2011

    We’d really benefit from advice about what Linked Data namespaces to use to describe places and the relationships between them. We want to re-use as much of others’ work as possible, and use vocabularies which are likely to be well and widely understood.

    Here’s a sample of a “vanilla” rendering of a record for a place-name in Cheshire as extracted from the English Place Name Survey – see this as a rough sketch.

    <chalice:Place rdf:about=”/place/cheshire/prestbury/bosley/bosley”>
    <chalice:parish rdf:resource=”/place/cheshire/prestbury/bosley”/>
    <chalice:parent rdf:resource=”/place/cheshire/prestbury/bosley”/>
    <georss:point>53.1862392425537 -2.12721741199493</georss:point>
    <owl:sameAs rdf:resource=”″/>


    We could re-use as much as we can of the geonames ontology. It defines a gn:Feature to indicate that a thing is a place, and gn:parentFeature to indicate that one place contains another.

    Ordnance Survey

    Ordnance Survey publish some geographic ontologies: there are some within, and there’s some older work including a vocabulary for mereological (i.e. containment) relations includes isPartOf and hasPart. But the status of this vocabulary is unclear – is its use still advised?

    The Administrative Geography ontology defines a ‘parish‘ relation – this is the inverse of how we’re currently using ‘parish’. (i.e. Prestbury contains Bosley) (And our concepts of historic parish and sub-parish are terrifically vague…)

    For place-names found in the 1:50K gazetteer the OS use the NamedPlace class – but it feels odd to be re-using a vocabulary explicitly designed for the 50K gazetteer.


    Are there other wide-spread Linked Data vocabularies for places and their names which we could be re-using? Are there other ways in which we could improve the modelling? Comments and pointers to others’ work would be greatly appreciated.

    Reflections on the second Chalice scrum

    January 6th, 2011

    We had a second two-week Scrum session on code for the Chalice project. This was a followup to the first Chalice scrum during which we made solid progress.

    During the second Scrum the team ran into some blocks and progress slowed. The following is quite a soul-searching post, in accordance with the project documentation instructions: “don’t forget to post the FAIL(s) as well: telling people where things went wrong so they don’t repeat mistakes is priceless for a thriving community.”

    Our core problem was the relative inflexibility of the relational database backend. We’d chosen to use an RDBMS rather than an RDF triplestore mainly for the benefits of code-reuse and familiarity, as this enabled us to repurpose code from a couple of similar EDINA projects, Unlock and Addressing History.

    However, when the time came to revise the model based on updated data extracted from EPNS volumes, this created a chain of dependencies – updates to the data model, then the API, then the prototype visualisation – progress slowed, and not much changed in the course of the second sprint.

    A second problem was lack of really clearly defined use cases, especially for a visual interface to the Chalice data. Here we have a bit of a chicken-and-egg situation; the work exploring how different archive projects can re-use the Chalice data to enhance their collections, is still going on. This is something which we have more emphasis on during the latter part of the project.

    So on the one hand there’s a need for a working prototype to be able to integrate Chalice data with other resources; and on the other, a need to know how those resources will re-use the Chalice data to inform the prototype.

    So what would we do differently if we did it again?

    • More of a design phase before the Scrum proper starts – with time to experiment with different data storage backends
    • More work developing detailed use cases before software development starts
    • More active collaboration between people talking to end users and people developing the backend (made more difficult because the project partners are distributed in space)

    Below are some detailed comments from two of the Scrum team members, Ross and Murray.

    Ross: I found Scrum useful, efficient, great for noticing both what others are doing and when your heading down the wrong path and identifying when you need further meetings, as was the case a few times early in the process. The whiteboard idea developed later on was also very useful. I don’t think the bottlenecks where anything to do with the use of Scrum, just in the amount of information and quality of data we had available to us, maybe this is due partially to the absence of requirements gathering in Scrum.

    The data we received had to be reverse engineered to some respect. As well as figuring out what everything in the given format was for (such as regnal dates, alternative names, contained places and their location relative to parent) and what parts where important to us (such as which of the many date formats we were going to store i.e. start, end and/or approximations) we also had no direct control over it.

    In order for the database, interface and API to work we had to decide on a structure quickly and get data in the database meaning learning how to install and operate a triple store (the recommend method) or spend time figuring out how to get hibernate to work with the decided
    structure (a more adaptable database access technology) would have delayed everything so a trade off was made to manually write code to
    parse the data from XML and enter it into a familiar relational database which caused us more problems later on. One of these was that the data was to continue to change on every generation; elements being added and removed or completely changed meant changing the parsing, then the domain objects, then the database and lastly the database insertion code.

    Lack of use cases: From the start we were developing an app without knowing what it should look like or how it should function. We were unsure as to what data we should or would need to store and how much control users of the service would have over the data in the database. We were unsure how to query the database and display API request responses so as to best fit the
    needs of the intended users in an efficient, useful way. We are slightly more clear on this but more information on how the product will be used would be greatly helpful.

    And as for future development… If we are sticking with the relational database model I definitely think it’s wise to get rid of all the database reading/writing code in favour of a hibernate solution, this would be tricky with our database structure however but more adaptable and symmetrical; so that changes to the input method are also made to the output and only one change needs
    to be made. Some sort of XML-POJO relational tool may also be useful
    although would make new dataset importing more complex (perhaps using
    xslt) to further improve adaptability.
    As well as that, some more specific use cases mentioning inputs and
    required outputs would be very useful.

    Murray: My comment, would be that we possibly should have worked on a hibernate
    ORM first, before creating the database. As soon as we had natural keys,
    triggers and stored procs in the database, it became too cumbersome to
    reverse engineer them.

    If we had created a ORM mapping first we could automatically generate
    the db schema from that, rather than the other way round.
    I presume we could write the searches even the spacial ones in hibernate
    rather than stored procs.
    Then it would be easier to cope will all the shifts in the xml
    structure. Propagating to changes through the tiers would be case of
    regenerating db and domain objects from the mappings rather than by hand.

    The generated domain objects could be reused across the dataloading, api
    and search. The default lazy loading in hibernate would have been good
    enough to deal with the hierarchical nature of the data to a
    indiscriminate depth.

    Chalice at WhereCamp

    November 23rd, 2010

    I was lucky enough to get to WhereCamp UK last Friday/Saturday, mainly because Jo couldn’t make it. I’ve never been to one of these unconferences before but was impressed by the friendly, anything-goes atmosphere, and emboldened to give an impromtu talk about CHALICE. I explained the project setup, its goals and some of the issues encountered, at least as I see them –

    • the URI minting question
    • the appropriateness (or lack of it) of only having points to represent regions instead of polygons
    • the scope for extending the nascent historical gazetteer we’re building and connecting it to others
    • how the results might be useful for future projects.

    I was particularly looking for feedback on the last two points: ideas on how best to grow the historical gazetteer and who has good data or sources that should be included if and when we get funding for a wider project to carry on from CHALICE’s beginnings; and secondly, ideas about good use cases to show why it’s a good idea to do that.

    We had a good discussion, with a supportive and interested audience. I didn’t manage to make very good notes, alas. Here’s a flavour of the discussion areas:

    • dealing with variant spellings in old texts – someone pointed out that the sound of a name tends to be preserved even though the spelling evolves, and maybe that can be exploited;
    • using crowd-sourcing to correct errors from the automatic processes, plus to gather further info on variant names;
    • copyright and IPR, and the fact that being out of print copyright doesn’t mean there won’t be issue around digital copyright in the scanned page images;
    • whether or not it would be possible – in a later project – to do useful things with the field names from EPNS;
    • the idea of parsing out the etymological references from EPNS, to build a database of derivations and sources;
    • using the gazetteer to link back to the scanned EPNS pages, to assist an online search application.

    Plenty of use cases were suggested, and here are some that I remember, plus ideas about related projects that it might be good to tie up with:

    • a good gazetteer would aid research into the location of places that no longer exist, eg from Domesday period – if you can locate historical placenames mentioned in the same text you can start narrowing down the likely area for the mystery places;
    • the library world is likely to be very interested in good historical gazetteers, a case mentioned being the Alexandria Library project sponsored by the Library of Congress amongst others;
    • there are overlaps and ideas to share with similar historical placename projects like Pleiades, Hestia and GAP (Google Ancient Places).

    I mentioned that, being based in Edinburgh, we’re particularly keen to include Scottish historical placenames. There are quite a few sources and people who have been working for ages in this area – that’s probably one of the next things to take forward, to see if we can tie up with some of the existing experts for mutual benefit.

    There were loads of other interesting presentations and talk at WhereCamp… but this post is already too long.

    Linking historic places: looking at Victoria County History

    November 19th, 2010

    Stuart Dunn mentioned the Victoria County History in his writeup of discussions with the Clergy of the Church of England Database project. Both resources are rich in place-name mentions and historic depth; as part of the Chalice project we’re investigating ways to make such resources more searchable by extracting historic place-names and linking them to our gazetteer.

    Here’s a summary of some email conversation between Stuart, Claire Grover, Ross Drew at EDINA and myself while looking at some sample data from VCH.

    The idea is to explore the possibilities in how Chalice data could enhance / complement semi-structured information like VCH (or more structured database-like sources such as CCED).

    It would be very valuable, I think, to do an analysis of how much effort and preparation of the (target) data is needed to link CHALICE to VCH, and a more structured dataset like CCED. By providing georeferences and toponym links, we’re bringing all that EPNS documentary evidence to VCH, thus enriching it.

    It would be very interesting if we were able to show how text-mining techniques could be used to add to the work of EPNS (extracting place references that aren’t listed, and suggesting them to editors along with suggested attestations (source and date).

    In the more immediate future; this is about adding links to Chalice place-references to other resources, that would allow us to cross-reference them and search them in interesting ways.

    Text mining isn’t absolutely necessary to map the EPNS place names to the VCH text. On the other hand, LTG have all the processing infrastructure to convert formats, tokenise the text etc. so we could put something in place very quickly. It wouldn’t be perfect but it would demonstrate the point. I’ve not seen the CCED data, so don’t know how complex that would be.

    Here’s a sample reference to a volume of VCH that may have some overlap with the Shropshire content we have in “born-digital” form from EPNS. There’s the intriguing prospect of adding historic place-name text mining/search in at the digitisation phase, so resources can be linked to other references as soon as they’re published.

    Structuring a Linked Data namespace for places

    November 10th, 2010

    Thoughts on structuring a namespace for historic English places, for our prototype Linked Data version of the English Place Name Survey; how do others do it? Our options seem to be:

    1. give each placename a numeric identifier that can be part of the link
    2. create a more human-readable identifier based on the name, to use as part of the link.

    Numeric identifiers for places look like common practise. uses numbers to create links for places – so “is”, or refers to, Baschurch in Shropshire. Though the coordinates of the point may change, the number is associated with the name, and it remains the same.

    Ordnance Survey Linked Data also uses a numeric ID to create its link that stands for (the same) Baschurch –

    The Linked Data Patterns online book has a set of patterns for identifying URIs. The patterns are focused on use with systems that are already database-based, with some design thought having gone into how IDs look, how they can be looked up, and how their persistence is guaranteed.

    The point here is that the numeric identifiers still need careful curation – an organisational guarantee that the identifiers will stay the same for the predicatable future.

    We’re using a relational database (PostGIS) rather than a triplestore, to hold the Chalice data (because the data model won’t really change or expand). We can’t just use IDs that are created automatically by the database when items are inserted into it, because those might change if the names are inserted in a different order.

    During Chalice we’re not building a be-all-end-all system, but rather prototyping an approach to text mining and georeferencing places can be used to turn an amazing hand-created resource into a 21st century Linked Data gazetteer; leaving behind open source tools to make sure the process can be repeated again with more digitised text.

    But we’re not building something to throw away; we want to make sure the links we create can be preserved – that they won’t be broken and won’t change their meanings. So it may be better for us to structure our namespace using the EPNS names themselves, and the order in which they occur in the printed volumes of EPNS.

    The EPNS volumes are arranged county-by-county – each county has its own editor, and so may have different layout, style guidelines, level of detail for things like field-names, and the presence or absence of OS Grid coordinates, more or less according to the whims of the county editor. (We’ve focused on Cheshire, but LTG have been developing test parsers for samples of several different counties.)

    So it makes sense to include the county name in our namespace. This also helps with disambiguation – which Walton is this Walton? But there will still be cases where several places, in quite different locations, but still within the same county, share a name. In this case, we’d also give the places a numeric identifier (Walton-1, Walton-2) in the order in which they appear in the EPNS text.

    Some volumes of EPNS give us OS National Grid coordinates for the “major names”, others don’t. Where the “major name” exists in one or more gazetteers (geonames, OS Open Data), the LTG’s georesolver tool can create some of the missing links using the Unlock Places gazetteer cross-search.

    More potentially useful context in the work of the UK Location Programme on Linked Data namespaces for places – a recent Guide to Linked Data and the UK Location Strategy, and last year’s guidance on Designing URI sets for Location.

    One more potential complication, which is a fairly subtle issue of semantics – does a link identify a place, or a description of a place? Ordnance Survey Research try to make the difference clear by using a different namespace for ‘IDs for places’ and ‘IDs for documents describing places’.
    So “is” Baschurch; and “is” the description of Baschurch. To make sure we’re properly confused, when a human looks up the /id/ link using a web browser, the browser is redirected to the human-readable /doc/. To actually get hold of the Linked Data description of Baschurch (including the coordinates for it in the 50K gazetteer), one has to specifically request the machine-readable, rather than human-readable, version of the link, like this:

    curl -L -H "Accept: application/rdf+xml" :) - but now you know that!

    This took me a little while, and some back-and-forth with John Goodwin from OS Research on “Twitter”, to figure out, which is why I thought it worth writing down here.