Saturday, June 25, 2011

Calculating Similarity (Part 3): Damerau-Levenshtein Distance

I promised this post a while ago and have unfortunately too busy to complete it.  I noticed a couple of people had searched Google explicitly for this post, so that encouraged me to complete it!

According to Wikipedia:
"Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters." [1]
In Computer Science, we commonly call algorithms like Damerau-Levenshtein Distance "Edit Distance", since the distance and resultant matrix tell us the number of transpositions, insertions, deletions, etc. necessary to make two strings identical.

The algorithm to calculate Damerau-Levenshtein is remarkably simple. Please follow the inline comments for a better understanding of how the algorithm works.

DamerauLevenshteinDistance.java

package com.berico.similarity;

/**
 * Damerau-Levenshtein Distance
 * Based on the algorithms provided at the following websites:
 * 
 * http://snippets.dzone.com/posts/show/6942
 * http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
 * 
 * @author Richard Clayton (Berico Technologies)
 * @date June 25, 2011
 */
public class DamerauLevenshteinDistance 
                   implements IDistanceCalculator {

  /**
   * Calculate the Damerau-Levenshtein Distance (edit distance)
   * between two strings.
   * @param source Source input string
   * @param target Target input string
   * @return The number of substitutions it would take
   *          to make the source string identical to the target
   *         string
   */
  public int calculate(String source, String target){
    //If both strings are empty, I'm of the opinion that
    //this is an error (technically the distance is zero).
    assert( !(source.isEmpty() && target.isEmpty()));
    
    //If the source string is empty, the distance is the
    //length of the target string.
    if(source.isEmpty()){
      return target.length();
    }
    
    //If the target string is empty, the distance is the
    //length of the source string.
    if(target.isEmpty()){
      return source.length();
    }
    
    //Delegate the calculation to the method that produces the matrix
    //and distance, but then only return the distance
    return calculateAndReturnFullResult(source, target).getDistance();
  }
  
  /**
   * Perform the distance calculation, but also return the
   * resulting matrix and distance.
   * @param source Source input string
   * @param target Target input string
   * @return A simple object with the matrix and distance
   */
  public DameauLevenshteinDistanceResult 
           calculateAndReturnFullResult(String source, String target){

    //If both strings are empty, I'm of the opinion that
    //this is an error (technically the distance is zero).
    assert( !(source.isEmpty() && target.isEmpty()));
    
    //We are going to construct a matrix of distances
    int[][] distanceMatrix 
      = new int[source.length() + 1][target.length() + 1];
    
    //We need indexers from 0 to the length of the source string.
    //This sequential set of numbers will be the row "headers"
    //in the matrix.
    for(int sourceIndex = 0; 
      sourceIndex <= source.length(); 
      sourceIndex++){
      
      //Set the value of the first cell in the row
      //equivalent to the current value of the iterator
      distanceMatrix[sourceIndex][0] = sourceIndex;  
    }
    
    //We need indexers from 0 to the length of the target string.
    //This sequential set of numbers will be the 
    //column "headers" in the matrix.
    for(int targetIndex = 0;
      targetIndex <= target.length();
      targetIndex++){
      
      //Set the value of the first cell in the column
      //equivalent to the current value of the iterator
      distanceMatrix[0][targetIndex] = targetIndex;
    }
    
    //We'll use this to add a penalty
    //to some operations.
    int cost = 0;
    
    //Iterate over all characters in the source
    //string.
    for(int sourceIndex = 1; 
    sourceIndex <= source.length(); 
    sourceIndex++){
      
      //Iterate over all characters in the target
      //string.
      for(int targetIndex = 1;
      targetIndex <= target.length();
      targetIndex++){
        
        //If the current characters in both strings are equal
        if(source.charAt(sourceIndex - 1)
               == target.charAt(targetIndex - 1))
        {
          //There is no penalty.
          cost = 0;
        }
        else 
        {
          //Not equal, there is a penalty.
          cost = 1;
        }
        
        //We want to find the current distance by determining
        //the shortest path to a match (hence the 'minimum'
        //calculation on distances).
        distanceMatrix[sourceIndex][targetIndex] 
          = minimum(
           //Character match between current character in 
           //source string and next character in target
           distanceMatrix[sourceIndex - 1][targetIndex] + 1, 
           //Character match between next character in
           //source string and current character in target
           distanceMatrix[sourceIndex][targetIndex - 1] + 1,
           //No match, at current, add cumulative penalty
           distanceMatrix[sourceIndex - 1][targetIndex - 1] + cost);
        
        //We don't want to do the next series of calculations on
        //the first pass because we would get an index out of bounds
        //exception.
        if(sourceIndex == 1 || targetIndex == 1){
          continue;
        }
        
        //transposition check (if the current and previous 
        //character are switched around (e.g.: t[se]t and t[es]t)...
        if(source.charAt(sourceIndex - 1) 
              == target.charAt(targetIndex - 2)
          && source.charAt(sourceIndex - 2) 
              == target.charAt(targetIndex - 1)){
          
          //What's the minimum cost between the current distance
          //and a transposition.
          distanceMatrix[sourceIndex][targetIndex] 
            = minimum(
               //Current cost
             distanceMatrix[sourceIndex][targetIndex],
             //Transposition
             distanceMatrix[sourceIndex - 2][targetIndex - 2] + cost);
        }
      }
    }
    
    //Return the matrix and distance as the result
    return new DameauLevenshteinDistanceResult(distanceMatrix);
  }
  
  /**
   * Calculate the minimum value from an array of values.
   * @param values Array of values.
   * @return minimum value of the provided set.
   */
  private static int minimum(int... values){
    
    //Hopefully, everything should be smaller
    //than the max int value!
    int currentMinimum = Integer.MAX_VALUE;
    
    //Iterate over all provided values
    for(int value : values){
      
      //Take the minimum value between the current
      //minimum and the current value of the
      //iteration
      currentMinimum = Math.min(value, currentMinimum);
    }
    
    //return the minimum value.
    return currentMinimum;
  }
  
  /**
   * Simple container for the result of the Dameau-Levenshtein
   * Distance calculation
   * @author Richard Clayton (Berico Technologies)
     * @date June 25, 2011
   */
  public class DameauLevenshteinDistanceResult {
    
    //Distance matrix
    private int[][] distanceMatrix;
    
    /**
     * Instantiate the object with the resulting distance matrix
     * @param distanceMatrix Matrix of distances between edits
     */
    public DameauLevenshteinDistanceResult(int[][] distanceMatrix){
      this.distanceMatrix = distanceMatrix;
    }

    /**
     * Get the Distance Matrix
     * @return Matrix of edit distances
     */
    public int[][] getDistanceMatrix() {
      return distanceMatrix;
    }
    
    /**
     * Get the Edit Distance
     * @return number of changes to make before
     *         both strings are identical
     */
    public int getDistance(){
      return 
        distanceMatrix
          [distanceMatrix.length - 1][distanceMatrix[0].length - 1];
    }

    /**
     * Get a string representation of this class
     * @return A friendly display of the distance and matrix
     */
    @Override
    public String toString() {
      
      StringBuilder sb = new StringBuilder();
      
      sb.append(
         String.format(
           "Distance: %s \n", this.getDistance()));
      sb.append("Matrix: \n\n");
      
      for(int i = 0; i < this.distanceMatrix.length; i++){
        
        sb.append("| ");
        
        for(int j = 0; j < this.distanceMatrix[0].length; j++){
        
          sb.append(String.format("\t%s", this.distanceMatrix[i][j]));
        }
        
        sb.append(" |\n");
      }
      
      return sb.toString();
    }
  }
}

Here are some examples of using the distance calculator.

IDistanceCalculator distanceCalc = new DamerauLevenshteinDistance();
      
String distOne = "snapple";
String distTwo = "apple";
      
int editDistance = distanceCalc.calculate(distOne, distTwo);
      
System.out.println(
  String.format("The distance between %s and %s is %s",
    distOne, distTwo, editDistance));

And the output from the console:

The distance between snapple and apple is 2

I've also added a method for getting the full result back from the calculator (matrix and distance). There is an example below, but remember, you will not be able to access the method if your are using the interface's type to reference the calculator.

String distOne = "snapple";
String distTwo = "apple";

DamerauLevenshteinDistance distanceCalc2 
     = new DamerauLevenshteinDistance();

DameauLevenshteinDistanceResult result 
     = distanceCalc2.calculateAndReturnFullResult(distOne, distTwo);

System.out.println(result);

And the output from the console:

Distance: 2 
Matrix: 

|  0 1 2 3 4 5 |
|  1 1 2 3 4 5 |
|  2 2 2 3 4 5 |
|  3 2 3 3 4 5 |
|  4 3 2 3 4 5 |
|  5 4 3 2 3 4 |
|  6 5 4 3 2 3 |
|  7 6 5 4 3 2 |

Once again, you can access the Eclipse project at the following link: http://dl.dropbox.com/u/12311372/StringSimilarity.zip.

If you have any questions or comments, I would love to hear them.

Richard

[1]. Wikipedia contributors. "Damerau–Levenshtein distance." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 20 Jun. 2011. Web. 25 Jun. 2011.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.