Random Post: Chalice poster from AHM 2010
RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  • Team
  •  

    Visiting the English Place Name Survey

    June 23rd, 2010

    I was in Nottingham for OSGIS at the Centre for Geospatial Sciences on Tuesday; skipped out between lunch and coffee break to visit the English Place Name Survey in the same leafy campus.

    A card file at EPNS

    A card file at EPNS

    Met with Paul Cavill, who dropped me right in to the heart of the operation – stacks of index cards in shoe boxes. Each major name has a set of annotation cards, describing different related names and their associations and sources – which range from Victorian maps to Anglo-Saxon chronicles.

    The editing process takes the card sets and turns them right into print-ready manuscript. The manuscript then has very consistent layout conventions – capitalisation, indentation. This is going to make our work of structure mining a lot easier.

    Another bonus I wasn’t expecting was the presence of OSGB grid references for all the major names. The task of making links becomes a snap – I was imagining a lot of iterative guesswork based on clustering and closeness to names in other sources. (There are four Waltons in the UK in geonames, dozens in the EPNS).

    On this basis I reckon the entity recognition will be a breeze, LTG will hardly have to stretch their muscles, which means we can ask them to work on grammars and machine learning recognisers for parts of other related archives within the scope of CHALICE.

    Pic_0622_026And we would have freedom in the EDINA team’s time to do more – specifically to look at using the National Map Library of Scotland’s map rectifier tools to correlate the gazetteer with detailed line-drawn maps  also created by the late H. D. G. Foxall. Digitisations of these maps live in the Shropshire Records Office. We must talk with them about their plans (the Records Office holds copyright in the scans).

    The eye-opener for me was the index of sources, or rather the bibliography. Each placename variant is marked with a few letters identifying the source of the name. So the index itself provides a key to old maps and gazetteers and archival records. To use Ant Beck’s phrase the EPNS looks like a “decoupled synthesis” of placename knowledge in all these sources. If we extract its structure, we are recoupling the synthesis and the sources, and now know where to look next to go text mining and digitising.

    Pic_0622_024So we have the Shropshire Hundreds as a great place to start, as this is where the EPNS are working on now and the volumes are “born digital”. Back at CDDA, Paul Ell has some of the very earliest volumes digitised, and if we find a sample from the middle, we can produce grammar rules that we can be pretty confident will extract the right structure from the whole set, when the time comes to digitise and publish the entire 80+ volume, and growing, set.

    But now i’m fascinated by the use of the EPNS derived data as a concordance to so many associated archives documenting historic social patterns. Towards the end of our chat Paul Cavill was speculating about reconstructing Anglo-Saxon England by means of text mining and georeferencing archives – we could provide a reference map to help archaeologists understand what they are finding, or even help them focus on where to look for interesting archaeology.

    Paul had been visited by the time-travelling Mormons digitising everything a couple of weeks previously, and will hopefully offer an introduction – i would really, really like to meet them.


    CHALICE: Our Budget

    June 10th, 2010

    This is the last of the seven blog posts we were asked to complete as participants in a #jiscexpo project. I like the process. This is a generalised version of our project budget. More than half goes to the preparation and annotation of digitised text from scans, both manually and using named entity recognition tools.

    The other half is for software development and user engagement; hoping to work together closely here. Of course we hope to over-deliver. Also have a small amount allocated to have people travel to a workshop. There’s another, independently supported JISC workshop planned to happen at EPNS on September 3rd.

    Institution Apr10Mar11
    EDINA National Datacentre, University of Edinburgh (project management, design, software development) £21129
    Language Technology Group, School of Informatics, University of Edinburgh (text mining archival work, named entity recognition toolkit development) £19198
    Centre for Data Digitisation and Analysis, Queens College Belfast (preparation of corrected digitised texts for use in archival text mining – the EPNS in a set schedule of volumes) £15362
    Centre for e-Research, Kings College London (backup project management, user needs and use case gathering, interviews, dissemination) £12365
    Amount Requested from JISC £68054

    CHALICE: The Plan of Work

    June 10th, 2010

    DRAFT

    GANTT-like chart showing the interconnection between different work packages and participants in CHALICE – not a very high-quality scan, sorry. When there are shifts and revisions in the workplan, Jo will rub out the pencil markings and scan the chart in again, but clearer this time.

    As far as software development goes we aspire to do a Scrum though given the resources available it will be more of a Scrum-but. Depending how many people we can get to Scrum, we may have to compress the development schedule in the centre – spike one week, deliver the next pretty much – then have an extended maintenance and integration period with just one engineer involved.

    The preparation of structured versions of digitised text with markup of extracted entities will be more of a long slog, but perhaps I can ask CDDA and LTG to write something about their methodologies.

    The use case gathering and user engagement parts of the project will develop on the form already used in the TextVRE project.



    CHALICE: Team Formation and Community Engagement

    June 10th, 2010

    Institutional and Collective Benefits describes who, at an institutional level, is engaged with the CHALICE project. We have three work packages split across four institutions – the Centre for Data Digitisation and Analysis at Queens University Belfast; the Language Technology Group at the School of Informatics, and the EDINA National Datacentre, both at the University of Edinburgh; and the Centre for e-Research at Kings College, London.

    The Chalice team page contains more detailed biographical data about the researchers, developers, technicians and project managers involved in putting the project together.

    The community engagement aspect of CHALICE will focus on gathering requirements from the academic community on how a linked data gazetteer would be most useful in to historical research projects concerned with different time periods. Semi-structured interviews will be conducted with relevant projects, and the researchers involved will be invited to critically review existing gazetteer services, such as geonames, with a view to identifying how they would could get the most out of such a service. This will apply the same principles, based loosely on the  methodology employed by the TEXTvre project. The project will also seek to engage with providers of services and resources. CHALICE will be able to enhance such resources, but also link them together: in particular the project will collaborate with services funded by JISC to gather evidence as to how these services could make use of the gazetteer .  A rapid analysis of the information gathered will be prepared, and a report published within six months of the project’s start date.

    When a first iteration of the system is available, we will revisit these projects, and  develop brief case studies that illustrate practical instances of how the resource can be used.

    The evidence base thus produced will substantially inform design of the user interface and the scoping and implementation of its functionalities.

    Gathering this information will be the responsibility of project staff at CeRch.

    We would love to be more specific about exactly which archive projects will yield to CHALICE at this point; but a lot will depend both on the spatial focus of the gazetteer, and the investigation and outreach during the course of the project. So we have a half dozen candidates in mind right now, but the detailed conversations and investigations will have to wait some months… see the next post on the project plan describing when and how things will happen.


    CHALICE: Open Licensing

    June 10th, 2010

    Software

    We commit to making source code produced as part of CHALICE available under a free software license – specifically, the GNU Affero General Public License Version 3. This is the license that was suggested to the Unlock service during consultation with OSS Watch, the open source advisory service for UK research.

    GPL is a ShareAlike kind of license, implying that if someone adapts and extends the CHALICE code for use in a project or service, they should make their enhancements available to others. The Affero flavour of GPLv3 invokes the ShareAlike clause if the software is used over a network.

    Data

    We plan to use the Open Database License from Open Data Commons to publish the data structures extracted from EPNS – and other sources where we have the freedom to do this. ODbL is a ShareAlike license for data – the OpenStreetmap project is moving to use this license, which is especially relevant to geographic factual data.

    As far as we know this will be the first time ODbL has been used for a research project of this kind – if there are other examples, would love to hear about them. We’ll seek advice from JISC Legal and from the Edinburgh Research and Innovation office legal service, as to the applicability of ODbL to research data, just to be sure.


    CHALICE: Risk Analysis and Success Plan

    June 10th, 2010

    This post should attempt to forecast both the risks or hurdles that might arise as the project progresses as well as how the project will manage sucess if its outputs become extremely popular?

    Risk Analysis

    Risk

    Probability

    (1-5)

    Severity

    (1-5)

    Score

    (P x S)

    Action to Prevent/Manage Risk

    Staffing

    2

    3

    6

    Secure by contract lock-in. Ensuring knowledge is continually disseminated amongst those involved. Rapid project life-cycle reduces likelihood of ill effects due to staff departure
    Community buy-in

    2

    3

    6

    Engagement with community via existing digital humanities network and geospatial semantic web communities; consultation with researchers on quality of work; dissemination through geographic information retrieval community workshops.
    Technical

    2

    3

    4

    Testing quality of extracted names against historic census data applied by UKDA. Evaluating other techniques to find and resolve names.
    Software

    2

    2

    6

    Continue consultation with JISC OSSWatch. Building on experiences and outputs from the GeoDigRef project and maintenance taken into account as support work on the Unlock service.

    CHALICE: Institutional and Collective Benefits

    June 10th, 2010

    DRAFT – needs CeRch+CDDA detail plus specific end user engagements though we can go on on the latter topic in later posts.

    At this point we should talk a bit more about who is involved in CHALICE and what we’re hoping to gain from it.

    The project is led by the EDINA National Datacentre at the University of Edinburgh. EDINA is almost entirely supported by JISC, and runs the flagship Digimap service which provides UK HE/FE access to national mapping data for the UK.

    EDINA also maintains the Unlock service, which provides search across different placename gazetteers, and extraction of placenames from text using different gazetteers to “ground” references to place at definite locations. Unlock started life as the GeoCrossWalk project, and it was our involvement in the “Embedding GeoCrossWalk” project that sparked this interest in using text mining techniques to generate placename authority files from historic texts.

    The Language Technology Group at the School of Informatics in Edinburgh were partners in this, and have moved on with us to CHALICE. They created the Edinburgh Geoparser that sits behind the Unlock Text web service. Their text mining magic extends much deeper than we’ve really made use of yet, as far as being able to extract events and relations from text, as well as references to people and concepts.

    CHALICE should be a fun challenge in an as yet under-explored research area of historic text mining – tuning grammar rules to do markup that can then be used to train machine learning recognisers, and comparing the results. Through their work with CDDA we hope to gain insight into the best balance between manual annotation and manually-corrected automatic annotation, in terms of cost of work, cost savings for others’ future work, and benefits of the different approaches to named entity recognition.

    CeRch

    CDDA



    CHALICE: Aims, Objectives, Outcomes

    June 10th, 2010

    Hello and welcome to the CHALICE project blog.We’re re-stating our project’s aims and means in seven blog post, and this is the first.

    What are we trying to achieve with CHALICE?

    We want to help create an historic placename gazetteer for the UK, publish it as Linked Data and link it to other widely-used sources of placename reference information on the semantic web – starting with geonames.org, as that is linked to many other sources.

    geonames

    We’ll use Named Entity Recognition techniques to extract placename and timescale reference information from texts, using digitised text from the English Place Name Survey.

    Once we’ve bootstrapped our historic gazetteer, it can be used to greatly improve the quality of future historic text mining efforts.

    What else do we hope will result from CHALICE?

    We hope to improve the quality and viability of semi-structured data extraction from non-narrative text, and gain insight into how this process can be more easily repeated by others. We’ll experiment with both grammar-based and machine-learning based named entity recognition systems to see if there’s benefit in combining both approaches, and whether we can identify specific strengths and weaknesses.

    We plan to engage with projects and people outwith the usual scope of geographic information specialists; people who are holding archives rich with implicit structured data, or people for whom a geographic means of exploring their archives could hold a lot of benefit.

    What will we leave behind?

    • A gazetteer with dense historic coverage (in theory back to the Domesday era) for all recorded placenames and variants within a few key areas.
    • A Linked Data version of this gazettteer
    • A simple web interface to annotate and correct the gazetteer data and semi-automatically created links to other entities on the semantic web
    • A short series of case studies demonstrating use of the gazetteer and its potential application to other, similar archives and services
    • Search through the gazetteer data using the JISC-supported Unlock Places service