Thursday, April 19, 2012

Practical Apache Pig Recipes - Round 1


Apache Pig is an incredibly productive framework for manipulating datasets on the Hadoop platform.  Coupled with a couple of third-party libraries (PiggyBank and DataFu), Pig has helped me break the habit of writing custom Java apps to do what should be trivial tasks with structured data.  In this post, I'm going to provide a couple of useful recipes that will demonstrate the power of the platform.

Installing and Running Pig without a Hadoop Cluster.


These instructions are for a *NIX based Operating System with Java installed.

1.  Download Pig:  http://pig.apache.org/releases.html, place it in some useful directory.
2.  Extract the archive:
# If this is a later release, please change the version number.
tar xzvf pig-0.9.2.tar.gz
3.  Create an alias to Pig (in local mode; i.e.: no Hadoop Cluster):
# Assuming you are still in the same directory you extracted pig into.
# If you use Pig often, you should set this in your BASH profile.
alias pig="`pwd`/pig-0.9.2/bin/pig -x local"
4a.  Start "Grunt", the Pig Shell:
pig
4b.  Execute a Pig Script
pig {scriptname}.pig

Stripping, Adding and Reordering Columns in a CSV File


Sometimes, you will have to deal with datasets that have columns you don't care about, or will have a column out of order.  This demonstration will show how you can do this in Pig with only three lines of Pig Latin.

Princeton University has an awesome website with sample CSV datasets that include the Latitude and Longitudes of chain stores (McDonalds, Starbucks, Walmarts, etc.) in America.

The schema of the CSV file is:  longitude, latitude, store name, store address.

I have an external application that requires the coordinates to be in the form latitude, longitude instead of how it is now (longitudelatitude). The store name is also a little verbose (usually the name of the chain followed by a number); since I don't need the individual store name, but still need a normalized identifier, I'm going to drop the store name column and add a constant field with the normalized value.
Download "strip_add_remove.pig"


The console produces the following output at the end of the job:

...Truncated...
(41.5667,-70.6227,Starbucks,"28 Davis Straights/ Route 28; Falmouth)
(41.61734,-70.49054,Starbucks,"6 Market Street; Mashpee)
(43.50562,-70.43833,Starbucks,"509 Main Street; Saco)
(43.63285,-70.3631,Starbucks,"200 Running Hill Road; South Portland)
(43.63416,-70.33794,Starbucks,"364 Maine Mall Road N135; South Portland)
(43.65202,-70.3097,Starbucks,"1001 Westbrook St; Portland)
(43.65412,-70.26322,Starbucks,"594 Congress Street; Portland)
(43.65762,-70.25521,Starbucks,"176 Middle St.; Portland)
(44.1207,-70.23047,Starbucks,"35 Mount Auburn Avenue; Auburn)
(43.85869,-70.1018,Starbucks,"49 Main Street; Freeport)
(43.9374,-69.9809,Starbucks,"125 Topsham Fair Mall Rd; Topsham)
(43.90637,-69.91558,Starbucks,"10 Gurnet Drive; Brunswick)
(44.5629,-69.64249,Starbucks,"2 Waterville Commons Drive; Waterville)
(44.83537,-68.74344,Starbucks,"38 Bangor Mall Blvd; Bangor)
(44.83959,-68.7426,Starbucks,"60 Longview Dr; Bangor)

The magic is in the FOREACH statement;  all we did was create a new "projection" of the data by reordering lat and lon, providing the normalized field 'Starbucks', and omit the original name field.


Creating a Bounding Box Filter


A bounding box is really simple when you think about it.  It's simply the right and left-most latitude and top and bottom-most longitude.  Here, we will use Pig to filter store locations to those in Northern Virginia (Latitudes 38 and 39, Longitudes -77 and -78), store those locations, and count the results.
Download "bounding_box.pig"

Here are the results from the Pig job:

...Truncated...
(38.8997,-77.0262,Starbucks,"1000 H St NW; Washington)
(38.96875,-77.0261,Starbucks,"6500 Piney Branch Rd NW; Washington)
(38.99644,-77.02586,Starbucks,"915 Ellsworth Drive C-19; Silver Spring)
(38.89848,-77.02389,Starbucks,"701 9th Street NW; Washington)
(38.90243,-77.02389,Starbucks,"999 9th St NW; Washington)
(38.89402,-77.02191,Starbucks,"325 Seventh Street NW Suite 100; Washington)
(38.89977,-77.02191,Starbucks,"800 7th Street NW Suite 305; Washington)
(38.91223,-77.02191,Starbucks,"443-C 7th Street)
(38.91921,-77.02191,Starbucks,"2225 Georgia Avenue)
(38.72444,-77.01967,Starbucks,"952 E Swan Creek Rd; Fort Washington)
(38.9094,-77.01791,Starbucks,"1 Avaition Cir; Washington)
(38.8969,-77.00672,Starbucks,"40 Massachusetts Avenue Amtrak Baggage Area; Washington)
(38.88736,-77.00297,Starbucks,"237 Pennsylvania Ave SE; Washington)

...Hadoop output omitted...

(158)

The FILTER command will create a new projection of the dataset keeping only those records within the bounding box we specified (pretty simple, huh?).


Calculate the Distance Between Locations


In this recipe, we will calculate the distance between a target location (Latitude and Longitude supplied via command line) and all other locations supplied by the dataset.  In order to perform this calculation, we will use a User Defined Function from the DataFu library from LinkedIn.
Download "distance.pig"

In order to run this example, you will need to supply the appropriate center point (to set the $USERLAT and $USERLON variables):
# 38.8977, -77.0366 is the White House, Washington, D.C.
pig -p USERLAT=38.8977 -p USERLON=-77.0366  distance.pig

The results of our distance calculation and subsequent filter (locations within 10 miles):
...Truncated...
(38.99644,-77.02586,Starbucks,6.8466316029936705,"915 Ellsworth Drive C-19; Silver Spring)
(38.98825,-77.09609,Starbucks,7.02586063143866,"7700 Norfolk Ave; Bethesda)
(38.80397,-76.9843,Starbucks,7.061133827376486,"6171-A Oxon Hill Road; Oxon Hill)
(38.9953,-77.07719,Starbucks,7.0874660476019535,"8542 Connecticut Avenue; Chevy Chase)
(38.83643,-77.15642,Starbucks,7.71170369522629,"6365 Columbia Pike; Falls Church)
(38.93301,-77.17824,Starbucks,7.995811205809961,"1438 Chain Bridge Road; McLean)
(38.89369,-77.18887,Starbucks,8.192941429466497,"1218 West Broad Street; Falls Church)
(39.02039,-77.0126,Starbucks,8.574554264635223,"10103 Colesville Rd; Silver Spring)
(38.92895,-77.19746,Starbucks,8.913496579837988,"1961 Chain Bridge Rd; McLean)
(38.90366,-77.20421,Starbucks,9.02192730760362,"7501 H Leesburg Pike; Falls Church)
(38.7715,-77.08137,Starbucks,9.046367728152426,"6754 Richmond Hwy Unit 4; Alexandria)
(39.02905,-77.00707,Starbucks,9.213012865407007,"10731 Colesville Road; Silver Spring)
(38.99768,-76.90967,Starbucks,9.707743952951674,"7541 Greenbelt Rd. Space 16; Greenbelt)
(39.03881,-77.05704,Starbucks,9.811380298967933,"2800 W. University Blvd. E; Wheaton)
(39.03927,-77.05418,Starbucks,9.827010981193256,"11160 Veirs Mill Road 139; Wheaton)
(39.01629,-76.93112,Starbucks,9.962700766542332,"4750 Cherry Hill Road; College Park)
(38.83159,-77.20134,Starbucks,9.970542690585502,"7414 Little River Turnpike; Annandale)

This recipe demonstrates two important features of Pig.  First, we are registering a JAR and User Defined Function with Pig so it can be called in one of the projections.  The second is the use of command-line arguments to supply dynamic values to the script.