Sunday, September 30, 2012

Introducing "CLAVIN" (Cartographic Location And Vicinity INdexer)

What is a Geotagger?


If your work involves finding meaning in unstructured text, you may have at one time or another, worked with semantic technologies like entity extraction.  An Entity Extractor promotes words in text to concepts; this is typically realized in the form of entity tagging, where an ontology is associated with a word or phrase (e.g. PERSON, PLACE, TIME, ORGANIZATION).  Once entities have been "tagged", the next step is to "resolve" them to a global concept or entity (entity resolution).  For instance, we not only want to know that "Barack Obama" is a PERSON, we also want any reference to Barack Obama to point to one "identifier".  This way, we can associate an entity across all of documents they have occurred in, allowing us to do things like build a global graph of concepts or perform faceted searches against those concepts.

Perhaps one of the most important forms of entity resolution is associating locations to geographic coordinates, commonly known as Geotagging (which may also encompass the entity resolution step).  For instance, we not only want to know that New York City is a LOCATION, but also, that it's center latitude and longitude is 40.7142° N, 74.0064° W, the location is in the "New York State" administrative district, and in the country of the United States of America.  More sophisticated resolution techniques might even include the polygonal boundaries of the location, and an association to a semantic graph of concepts related to the city and it's history.

The Problem


For years the Geotagging market has been dominated by a very small number couple commercial products; many entity extractors can identify locations, but few actually resolve that location to a fixed point in space.   As far as I'm aware, the most used is MetaCarta (http://www.metacarta.com/), one I have personally used on a number of projects.  MetaCarta is really good in terms of accuracy and features; in fact, most systems I've seen MetaCarta deployed in only use about 25% of its featurs.  The problem with MetaCarta is that it is expensive (I don't have figures, I've only seen my customers cringe when talking about price).  

Yahoo also has a Geotagger in the form of a web service offering called Placemaker (http://developer.yahoo.com/geo/placemaker/).   For us, Placemaker has never been a viable solution since it can't be deployed on an internal network, and doesn't fit well with our architectural use cases.  Placemarker also doesn't seemed to be well tuned to our corpora, meaning it's has lower than ideal precision in extraction. 

Outside of the commercial space, there are no viable open source alternatives.  A quick Google Search for open source Geotaggers will probably return Geodict (https://github.com/petewarden/geodict), a GitHub project by Pete Warden.  Pete combines a Gazetteer (geospatial dictionary) with some simple rules for locating potential places in a sentence (presence of key words like "in" or "near") in a brute force approach to solving the problem.  Unfortunately, Geodict's approach doesn't take semantic meaning of the sentence into account when locating potential "place" words, and it doesn't perform the resolution step (differentiating locations by context: the "Springfield" problem).

Introducing CLAVIN


Necessity is the mother of invention.  - Unknown

Early this year, our company found itself desperately in need of a Geotagger.  Our enterprise search application, built around geospatial faceting, lacked the geospatial entities we needed to do the faceting.  We had documents, just no geospatial tagging.  Architecturally, an upstream component in the ETL pipeline was supposed to provide this capability (using MetaCarta), but for one reason or another, the team producing that capability was not going to be able to make the delivery timeframe (I should add that this was no fault of the MetaCarta product).

One of our Berico Technologies' Data Scientists, Charlie Greenbacker, was working on another, unrelated problem involving the resolution of country names across a collection of structured datasets (Excel and CSV documents) that we were "mashing together" so we could do analysis across datasets. Recognizing that both problems had a similar solution, Charlie began work on a homegrown geotagger that eventually became CLAVIN: Cartographic Location And Vicinity INdexer (http://clavin.bericotechnologies.com/).

What is CLAVIN?


It's a geotagger (and resolver).  Architecturally, CLAVIN is extremely simple.  CLAVIN was written in Java, but can be bundled in a Java Web Application as a web service allowing any application to access it (as seen in our CLAVIN-Web demonstration).  

CLAVIN has a simple workflow.  An EntityTagger is used to find unresolved (string) PEOPLE, PLACES, and LOCATIONS from a string (multiline, complex quotation, etc.).  Once those entities are extracted from the text, they are passed to a LocationResolver that returns the most confident match (ResolvedLocation) for each location in the set.

The default EntityTagger implementation in CLAVIN is the Apache OpenNLP framework.  Apache OpenNLP is the most license friendly framework we could bundle with CLAVIN; the most accurate EntityTagger we have implemented is one utilizing the Stanford NER, which we don't provide outside of service contracts since it's GPL.

Our default LocationResolver uses a custom Apache Lucene index of the GeoNames Gazetteer (http://www.geonames.org/).  The LocationResolver includes tunable algorithms for performing fuzzy and probabilistic matching of locations.  Since you sacrifice performance for accuracy and vice versa, the LocationResolver is a great abstraction for a number of strategies you may need to employ in your system.  Another benefit of CLAVIN is that we maintain the resolver index (one less thing you need to worry about).

Code Example


This is how simple it is to use CLAVIN under the current API.  Keep in mind there will be some changes before it's official release mid October.
// Location of the initializer for Stanford NER
String classifierModelPath = 
  "/location/of/classifier/all.3class.distsim.crf.ser.gz";
    
// Needed by Stanford NER Implementation
SequenceClassifierProvider classifierProvider 
  = new ExternalSequenceClassiferProvider(classifierModelPath);

// Initialize the Tagger (Sorry, but I'm demonstrating the Stanford NER)
// Tagger at the moment.  Will update with OpenNLP ASAP.
EntityTagger entityTagger = new NerdEntityTagger(classifierProvider);

// Location of the Location Resolver Index
String locationResolverIndexPath = "/location/of/index/IndexDirectory";

// Instantiate the Location Resolver
LocationResolver locationResolver 
  = new LocationResolver(
    new File(locationResolverIndexPath), 3, 5);

// Nothing magic here, just a couple of sentences.
String text = getText();

// Tag the text
TaggedDocument taggedDocument = entityTagger.tagDocument(text);

System.out.println(String.format("%s locations found",
  taggedDocument.getLocations().size()));

// Resolve the locations from the extracted locations
List<ResolvedLocation> resolvedLocations = 
    locationResolver.resolveLocations(taggedDocument.getLocations());

for(ResolvedLocation resolvedLocation : resolvedLocations){
  
  System.out.println(
    String.format("%s (%s, %s)", 
      resolvedLocation.matchedName, 
      resolvedLocation.geoname.latitude, 
      resolvedLocation.geoname.longitude));
}

The getText() method simply returns the following string:


I visited the Sears Tower in Chicago only to find out there were exciting attractions in Springfield.  After Springfield, Chuck and I drove east through Indiana to West Virginia, stopping in Harper's Ferry.  We finally made it to our destination in Washington, DC on Tuesday.


And we get the following results on the console:


7 locations found
Chicago (41.85003, -87.65005)
Springfield (39.80172, -89.64371)
Springfield (39.80172, -89.64371)
Indiana (40.00032, -86.25027)
West Virginia (38.50038, -80.50009)
Washington (38.89511, -77.03637)
DC (38.91706, -77.00025)



More Information


If you want to know more about CLAVIN, Charlie will be speaking at GEOINT 2012 in Orlando, FL  (October 8-11) and hopefully at Strata Santa Clara next year (YouTube proposal below):



If you have any other questions, just leave me comment.  

3 comments:

  1. Great background info, Richard! With the latest version of CLAVIN, the API is now even simpler: simply instantiate the GeoParser class, then call the parse() method on your text to get a list of ResolvedLocations.

    ReplyDelete
  2. Hi all,
    can anyone tell me please if the CLAVIN (with using the Geonames.org gazetteers) support resolving fully or partially addresses. The example in this article resolved only cities. Is it possible to store e.g. data from openstreetmap.org to the CLAVIN's Lucene database and successfully resolve the addresses? (Has anyone tried it?)
    Thank you very much for any advises and recommendations.
    Michal

    ReplyDelete
    Replies
    1. Hi Michal, it sounds like you're really looking for a geocoder (http://en.wikipedia.org/wiki/Geocoding) rather than a geoparser (http://en.wikipedia.org/wiki/Geoparsing) like CLAVIN. There are a number of geocoding software packages available, and Google even offers a geocoding API: https://developers.google.com/maps/documentation/geocoding/

      Delete

Note: Only a member of this blog may post a comment.