Saturday, January 22, 2011

Calculating Similarity (Part 2): Jaccard, Sørensen and Jaro-Winkler Similarity

In my last post on calculating string similarity, I focused on one algorithm (Cosine Similarity) and some of the concepts involved in making its calculation.  In this post, I want to introduce three more ways of calculating similarity.  The three methods I present today are all a little less intensive to calculate than cosine similarity, and one in particular is probably more accurate.

The three algorithms I will be presenting today are:
  • Jaccard Similarity
  • Sørensen Similarity
  • Jaro-Winkler Similarity

I'm going to present the implementation of each of these algorithms, and then at the end of the post, the results they return for a couple example sets of strings. Let's begin in order, starting with Jaccard Similarity.

Jaccard Similarity is very easy to calculate and actually required no extra programming on my part to implements (all the utility functions required were created for the Cosine Similarity implementation).  The equation is sometimes called the Jaccard Distance, or Jaccard Similarity Coefficient, originating from the late 19th, early 20th century botantist Paul Jaccard (Paul Jaccard, 2011).

The equation is remarkably simple (length of the intersect divided by the length of the union):

(Jaccard index, 2011).

Here is my implementation of the Jaccard Similarity Coefficient:

/**
  * Find the Jaccard Similarity of two strings
  * @param stringOne the first string to compare
  * @param stringTwo the second string to compare
  * @return the Jaccard Similarity (intersect(A,B) / union(A, B))
  */
 @Override
 public double calculate(String stringOne, String stringTwo) {
   return (double) intersect(stringOne, stringTwo).size() /
          (double) union(stringOne, stringTwo).size();
 }

The intersect and union functions are both defined in the com.berico.similarity.CharacterVectorUtils class, which you can, along with the rest of the Eclipse project, at the bottom of this post.

The next similarity algorithm is Sørensen Similarity (or Sørensen index, or Sørensen Similarity Coefficient). The equation comes from Thorvald Sørensen, a turn-of-the-century Danish botanist (Thorvald Sørensen 2011). Like the Jaccard index, Sørensen Similarity is very easy to calculate.

The equation for Sørensen Similarity:
(Sørensen similarity index, 2011)
In the equation, 'A' represents the size of the first sample, 'B' the size of a second, and 'C' the size of the intersect between sample one and two (I don't know why someone posted it in Wikipedia that way, as opposed to [ 2 * |A ∩ B| / |A| + |B| ]).  Here is my implementation of the Sørensen similarity index:

/**
 * Calculate the Sorensen Similarity of two strings.
 * Equation: (2 * intersect(A, B)) / (|A| + |B|)
 * @param stringOne First String
 * @param stringTwo Second String
 * @return The Sorensen similarity of two strings.
 */
 @Override
 public double calculate(String stringOne, String stringTwo) {
   return  (double) (2 * intersect(stringOne, stringTwo).size()) /
           (double) (stringOne.length() + stringTwo.length());
 }

Once again, a very simple algorithm to implement. The next algorithm is not so simple. The Jaro-Winkler Similarity index is the first algorithm I will demonstrate in which the order of occurrence is an essential determination of similarity. In a way, Jaro-Winkler similarity is very similar to the ubiquitous Levenshtein Distance, which is used in many Natural Language Processing applications like Lucene and PostgresSQL (Levenshtein distance, 2011). In many domains, the order in which properties occur are just as important as their existence; this is certainly true for strings. Because of this, Jaro-Winkler is a lot more accurate than Cosine Similarity.

The Jaro-Winkler Similarity Coefficient is an addition to the Jaro Distance; I don't know if Winkler worked in tandem with Jaro, but Winkler added an important component to the Jaro distance algorithm that weighted or penalized strings based on their similarity at the beginning of the string (the prefix). The algorithm for the Jaro-Winkler distance is a little complicated, in part because it requires iteration over the smallest string to determine if a non-matching character fits within a predetermined "window" of the other string. For instance, the strings "martha" and "marhta" are considered a complete match because the transposed "th" and "ht" are within 2 characters of each other.

Finding matches is probably the most complicated part of this algorithm, but fortunately for you, I have done all of the hard work:

/**
  * Instead of making two seperate functions for matching
  * and transposes (which would be terrible since you
  * find the transpose while doing matching), I created this
  * little class to hold the results of both.
  */
 public static class MatchResults {
   public int numberOfMatches = 0;
   public int numberOfTransposes = 0;
 }

/**
  * Find the all of the matching and transposed characters in 
  * two strings
  * @param stringOne First String
  * @param stringTwo Second String
  * @return A Match Result with both the number of matches and
  * number of transposed characters
  */
 public static MatchResults determineMatchesAndTransposes(
                  String stringOne, String stringTwo){

   //Create the match result instance
   MatchResults matchResults = new MatchResults();
   //Find the matching window (how far left and right to
   //look for matches)
   int window = matchingWindow(stringOne, stringTwo);
   //We need to find the shortest and longest character string
   //because we iterate over the shortest
   char[] shortest, longest;
   //If string one is less than or equal to string two
   if(stringOne.length() <= stringTwo.length()){
     //use string one as the shortest
     shortest = stringOne.toCharArray();
     longest = stringTwo.toCharArray();
   } else {
     //otherwise use string two as the shortest
     shortest = stringTwo.toCharArray();
     longest = stringOne.toCharArray();
   }
   //we need to find the number of times we find a match
   //out of position (ex: the 4th character of string one
   //matches the 5th character of string two
   int matchedOutOfPosition = 0; 
   //Iterate over the shortest string
   for(int i = 0; i < shortest.length; i++){
     //If we have an in-position match
     if(shortest[i] == longest[i]){
       //increment the number of matches
       matchResults.numberOfMatches++;
       //go to the next iteration
       continue;
     }
     //Set the boundary of how far back we
     //can look at the longest string
     //given the string's size and the
     //window size
     int backwardBoundary = 
       ((i - window) < 0)? 0 : i - window;
     //Set the boundary of how far we
     //can look ahead at the longest string
     //given the string's size and the
     //window size
     int forwardBoundary = 
       ((i + window) > (longest.length - 1))? 
          longest.length - 1 : i + window;
     //Starting at the back-most point, and
     //moving to the forward-most point of the
     //longest string, iterate looking for matches
     //of our current character on the shortest 
     //string
     for(int b = backwardBoundary; 
             b <= forwardBoundary; 
             b++){
       //If theres a match within our window
       if(longest[b] == shortest[i]){
         //increment the number of matches
         matchResults.numberOfMatches++;
         //but note that this is out of
         //position
         matchedOutOfPosition++;
         //break out of this inner loop if we
         //aren't on the last element
         break;
       }
     }
   }
   //Determine the number of transposes by halving
   //the number of out-of-position matches.
   //This works because if we had two strings:
   //"martha" and "marhta", the 'th' and 'ht' are
   //transposed, but would be counted twice in the
   //process above.
   matchResults.numberOfTransposes = matchedOutOfPosition / 2;
   //return the match result
   return matchResults;
 }

Transposed characters in the Jaro-Winkler algorithm, however, do not get the full weight that a 1-for-1 match would. The penalty for transposed characters is a part of the Jaro Distance algorithm provided below:
(Jaro-Winkler distance, 2011)
dj = Jaro Distance
m = number of matching characters
t = number of transposed characters
|s1| = length of the first string
|s2| = length of the second string

The penalty for transposed characters can be seen by the (m - t) / m factor.  The more transposes found between the two strings, the smaller the overall matching weight.

/**
 * Determine the Jaro Distance.  Winkler stole some of Jaro's
 * thunder by adding some more math onto his algorithm and getting
 * equal parts credit!  However, we still need Jaro's Distance
 * to get the Jaro-Winkler Distance.
 * Equation: 1/3 * ((|A| / m) + (|B| / m) + ((m - t) / m))
 * Where: |A| = length of first string
 *        |B| = length of second string
 *         m  = number of matches
 *         t  = number of transposes
 * @param numMatches Number of matches
 * @param numTransposes Number of transposes
 * @param stringOneLength Length of String one
 * @param stringTwoLength Length of String two
 * @return Jaro Distance
 */
 public static double jaroDistance(
       int numMatches, int numTransposes, 
       int stringOneLength, int stringTwoLength){
   
   //I hate Java's facility for math.  
   //We have to cast these int's as doubles to
   //be able to properly retrieve the decimal result
   double third = 1d / 3d;
   // (|A| / m)
   double stringOneNorm = 
      (double)numMatches / (double)stringOneLength;
   // (|B| / m)
   double stringTwoNorm = 
      (double)numMatches / (double)stringTwoLength;
   // ((m - t) / m)
   double matchTransNorm = 
      ((double)numMatches - (double)numTransposes) / (double)numMatches;
   // 1/3 * ((|A| / m) + (|B| / m) + ((m - t) / m))
   return third * (stringOneNorm + stringTwoNorm + matchTransNorm);
 }

Another important equation to show is how we determine the matching window.  The matching window is how far to the left and right of our current position on the longest string we will look for matching characters.  The window can be calculated with the following equation:
(Jaro-Winkler distance, 2011)

As you can see, the window is generally a little less than half the size of the longest string.

/**
 * Determine the maximum window size to use when looking for matches.
 * The window is basically a little less than the half the longest
 * string's length.
 * Equation: [ Max(A, B) / 2 ] -1
 * @param stringOne First String
 * @param stringTwo Second String
 * @return Max window size
 */
 public static int matchingWindow(String stringOne, String stringTwo){
   return 
    (Math.max(stringOne.length(), stringTwo.length()) / 2) - 1;
 }

Winkler's contribution to the algorithm was the addition of the weighted prefix.  Winkler introduced the notion that strings starting with the same characters (exact, not out of order matching) should be more heavily weighted.  The caveat is that Winkler considers the maximum prefix size to be 4.  This means that all matching characters past the first four are weighted equally.  The number of matching characters in the prefix are multiplied by a constant, the standard being 0.1 (this is what Winkler used).

/**
  * Find the Winkler Common Prefix of two strings.  In Layman's terms,
  * find the number of character that match at the beginning of the
  * two strings, with a maximum of 4.
  * @param stringOne First string
  * @param stringTwo Second string
  * @return Integer between 0 and 4 representing the number of
  * matching characters at the beginning of both strings.
  */
 public static int winklerCommonPrefix(
                    String stringOne, String stringTwo){
  
   int commonPrefix = 0;
   //Find the shortest string (we don't want an index out of bounds
   //exception).
   int boundary = (stringOne.length() <= stringTwo.length())? 
                    stringOne.length() : stringTwo.length();
   //iterate until the boundary is hit (shortest string length)
   for(int i = 0; 
           i < boundary;
           i++){
      //If the character at the current position matches
      if(stringOne.charAt(i) == stringTwo.charAt(i)){
         //increment the common prefix
         commonPrefix++;
      } else {
         //otherwise, continue no further, we are done.
 break;
      }
      //If we hit our max number of matches, bust out
      //of the loop.
      if(commonPrefix == 4){ break; }
   }
   //return the number of matches at the beginning of 
   //both strings
   return commonPrefix;
 }

Now that we understand Winkler's contribution, I can show you the full Jaro-Winkler equation:
(Jaro-Winkler distance, 2011)
dw = Jaro-Winkler Distance
dj = Jaro Distance
l = the prefix length (number of starting characters in both strings that matched, max of 4)
p = the prefix weight (default = 0.1)

The best way to understand how this algorithm works is to look at its implementation in code:

//Bonus weighting for string starting with the same characters
//(e.g.: prefix scaling factor)
public static double PREFIX_SCALING_FACTOR = 0.1;
  
/**
 * Calculate the Jaro-Winkler Similarity of two strings
 * @param stringOne First String
 * @param stringTwo Second String
 * @return Jaro-Winkler similarity value
 */
 @Override
 public double calculate(String stringOne, String stringTwo) {
   //Get Matches and Transposes
   MatchResults matchResults = 
      determineMatchesAndTransposes(stringOne, stringTwo);
   //Get the Jaro Distance
   double jaroDistance = 
      jaroDistance(
         matchResults.numberOfMatches, 
         matchResults.numberOfTransposes, 
         stringOne.length(), 
         stringTwo.length());
   //Find the Winkler common prefix length (maxes at 4 characters)
   int winklerCommonPrefix = 
      winklerCommonPrefix(stringOne, stringTwo);
   //Find the Jaro-Winkler Distance
   // = Jd + (l * p * ( 1 - Jd));
   double jaroWinklerDistance = 
      jaroDistance + (winklerCommonPrefix * PREFIX_SCALING_FACTOR)
         * (1 - jaroDistance);
   //Return the distance
   return jaroWinklerDistance;
 }

And that's it. Jaro-Winkler in its entirety. Now let's see the differences between these similarity implementations. The following is the "runner" I'm using to execute these similarity functions (I refactored the old implementation of this class, which was displayed in the first post):

public class SimilarityRunner {

/**
 * @param args
 */
 public static void main(String[] args) {  
  
   String one = "apple";
   String two = "applet";
  
   printSimilarity(new CosineSimilarity(), one, two);
   printSimilarity(new JaccardSimilarity(), one, two);
   printSimilarity(new SorensenSimilarity(), one, two);
   printSimilarity(new JaroWinklerSimilarity(), one, two);
 }

 private static void printSimilarity(
       ISimilarityCalculator simCalculator, 
                    String one, String two){
  
   double percentSimilar = 
     simCalculator.calculate(one, two) * 100;

   System.out.println(
     String.format("[%s] %s and %s are %s%% similar",    
       simCalculator.getClass().getSimpleName(), 
       one, two, percentSimilar));
 }
 
}

I will be changing the values of strings 'one' and 'two' to demonstrate the differences between their calculations. Let's begin by using the strings currently set in the code: "apple" and "applet".

[CosineSimilarity] apple and applet are 93.54143466934852% similar
[JaccardSimilarity] apple and applet are 80.0% similar
[SorensenSimilarity] apple and applet are 72.72727272727273% similar
[JaroWinklerSimilarity] apple and applet are 96.66666666666667% similar

The strings "apple" and "applet" are obviously very similar. Both the Cosine and Jaro-Winkler similarity algorithms, in my opinion, do a good job of noting the similarity. Now let's look at the string "string" compared to it's reverse "gnirts":

[CosineSimilarity] string and gnirts are 100.00000000000003% similar
[JaccardSimilarity] string and gnirts are 100.0% similar
[SorensenSimilarity] string and gnirts are 100.0% similar
[JaroWinklerSimilarity] string and gnirts are 38.888888888888886% similar

Isn't that interesting...

As you can see, Jaro-Winkler does a much better job at determining the similarity of strings because it takes order into account. In my next installment of this series, I will tackle the Levenshtein distance algorithm, which not only tells you how similar two strings are, but also what needs to change in order for both strings to be exact (can anyone say DIFF?).

I hope you found this post useful and/or enjoyable. If you want the source code from this post, please look at the resources section below.

Resources:

I'm publishing all of my stuff on GitHub now. The following is the repository for the StringSimilarity stuff I have been presenting: https://github.com/rclayton/StringSimilarity

References:

Wikipedia contributors. "Jaccard index." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 13 Dec. 2010. Web. 23 Jan. 2011.

Wikipedia contributors. "Jaro–Winkler distance." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 9 Jan. 2011. Web. 23 Jan. 2011. 

Wikipedia contributors. "Levenshtein distance." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 27 Dec. 2010. Web. 23 Jan. 2011.

Wikipedia contributors. "Paul Jaccard." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 17 Nov. 2010. Web. 23 Jan. 2011. 

Wikipedia contributors. "Sørensen similarity index." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 8 Nov. 2010. Web. 23 Jan. 2011.

Wikipedia contributors. "Thorvald Sørensen." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 1 Jan. 2011. Web. 23 Jan. 2011.

7 comments:

  1. Hey Richard. Yes, that Todd. I had a quick question for you regarding your Jaro-Winkler implementation. I have conducted research on this myself. I am seeing conflicts between several other implementations and yours specifically when I compare "Todd" to "Brad" or something similar.

    The difference between implementations is that you increment the match counter when a transposed value is found and other algorithms don't. This causes the values to come out different. Your implementation is considering "Todd" and "Brad" much more similar because of this.

    I tried to find information regarding this particular situation but couldn't. Do you have any thoughts on the subject?

    ReplyDelete
    Replies
    1. Todd, great to hear from you. It's completely possible I did something wrong and need to update the implementation. I'm going to pull down the repo and see what's going on. My guess is that the match counter is a little too overzealous. I've haven't calculated the difference between "Todd" and "Brad" yet, would you mind telling me what value you are getting?

      Delete
    2. With other implementations, that don't increment the match counter when a transposed value is found, I get .5. With yours I am getting 0.33. I have flipped the scale so 0 is an exact match (or no distance). The true value would be about 0.67 which seems high for "Todd" and "Brad" (only one match).

      It seems your algorithm is counting the "d" in "Brad" as a match for the the last "d" in "Todd" AND a transposition for the preceding "d". It might have something to do with the fact that "Todd" has two consecutive "d"s. I see similar discrepancies if I test other values where one has duplicate letters.

      I did some research but couldn't find anything specific regarding how to best handle two consecutive letters or if the algorithm should even take care of it. Therefore, it seems to come down to the fact that you increment that match counter for a transposed value. So, is a transposed letter a match or is it a transposed letter? I don't think it can be both. My understanding is that transposed letters are matches, but matches of lesser value.

      When you get down to it, if we visually evaluate the two strings, we can see that we should have one match (the last "d"). I don't know if we are supposed to consider the last d also a transposition of the other "d". Most the algorithms I tested do this. I get back 1 match and 1 transposition. With yours I get 2 matches and 1 transposition.

      Technically speaking, I don't think the last "d" should count as a transposition since it matched. I can't find information on how these cases should be handled. I guess it would be possible to, while looking for transposed values, check if the next value is a match.

      In your research, did you come across any information on how to appropriately deal with duplicate characters? Also, if your research showed that incrementing the match count for a transposition is correct, just let me know. It is always possible that I am missing something.

      Delete
  2. Hey great article! And the example helps to put things in perspective!

    ReplyDelete
  3. Hi,

    First of all, thank you for the article and the code.

    To the point...
    Probably, I missed something, but "Given the strings DWAYNE and DUANE", the result should be "0.84".
    That's what appears in Wikipedia.

    Your implementation produces "0.765"...
    The amount of transpositions is 2 (in your impl), and this makes the results different.
    Was it by design for some reason?

    ReplyDelete
    Replies
    1. Mark,

      First, thank you for your comments. There is a bug in my implementation that I need to fix to ensure I don't have more than one transposition. I need to go back and fix it and update this post. Unfortunately, I haven't had much time to make the correction.

      Thanks,

      Richard

      Delete