(Dot Product, 2010). |

I once worked on a project that was attempting to fix OCR errors by comparing the similarity of the erroneous string to potential candidate strings. In the process, I did a lot of research on string comparison, especially in the area of similarity. I came across, and eventually implemented, a number of mathematical formulas for determining similarity and want to share these techniques with you. The first formula I'm going to demonstrate for you is "cosine similarity".

Cosine similarity is literally the angular difference between two vectors. Cosine Similarity is expressed by the formula:

(Cosine Similarity, 2010) |

For those of you allergic that don't speak "mad mathematician" (like me), to calculate the cosine similarity, we need to:

- Take the dot product of vectors A and B.
- Calculate the magnitude of Vector A.
- Calculate the magnitude of Vector B.
- Multiple the magnitudes of A and B.
- Divide the dot product of A and B by the product of the magnitudes of A and B.

There is still a problem. We need to figure out a way to represent a string as an vector of numbers, since we need to calculate the dot product and magnitude of two vectors.

Let's eliminate the first question that might arise: What is a vector?

*A vector is a geometric structure with both length (magnitude) and direction.*

On a 2-dimensional Cartesian coordinate plane, you might see a vector like (2, 1), which indicate a line starting at the origin (0, 0) and extending to 2 units "right" on the x-axis and 1 unit "up" on the y-axis. Adding a 3rd dimension to a vector is easy, simply add another point to the vector: (2, 1, 3). In the previous example, we've added a value of "3" to the z-axis.

We can perform mathematical operations on vectors like we can scalar values. Vectors can support standard mathematical operations (addition, multiplication, division) if and only if the vectors the operations are being performed on have an equal number of dimensions (you will see that this is what causes the

**VectorMathException**in the code submitted with this post).

Vectors also feature a couple of special operations, which we will be using for the purposes of this demonstration: magnitude and dot product.

For those of you who have "data-dumped" the magnitude and dot product operations of vectors, here is a short reminder:

Magnitude - literally the length of a vector.

(Magnitude, 2010). |

Here the example on Wikipedia: "the magnitude of [4, 5, 6] is √(4

^{2}+ 5

^{2}+ 6

^{2}) = √77 or about 8.775" (Magnitude, 2010).

Dot Product - the inner product of two vectors.

(Dot Product, 2010). |

Here is an example of the Dot Product of two vectors:

(Dot Product, 2010). |

Now that we have gotten through the obligatory mathematics necessary in understanding how this algorithm works, lets see how we can use it to compare the similarity of strings.

The first task we have is to convert the strings into vectors. We say before that we can represent three dimensional Cartesian coordinate space by three variables: x, y, and z. We can do the same thing with a string by representing each character 'a', 'b', 'c', etc. as a dimension. If we impose the limitation that we are going to only use lower-case letters as the possible dimensions of our vector, we are left with 26 dimensions.

If we store only the occurrence of a letter in the vector, we have a

*binary occurrence vector*. If we store the frequency in which a letter occurs in the vector, we have a

*frequency of occurrence vector*.

Consider the term "apple":

Dimension keys in order

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z

*Binary Occurrence Vector of "apple"*

1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0

*Frequency of Occurrence Vector of "apple"*

1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0

Of course, there are much easier ways to deal with strings than storing all of those zeros. When comparing strings, we can simply take the "union" of dimensions in both strings, reducing the number of dimensions we need to calculate.

Taking the union of the strings "apple" and "applet", we get the vector: (a, e, l, p, t).

For this demonstration, we are going to use

*frequency of occurrence vectors*since they more accurately measure similarity. Keep in mind, if you don't care how often a dimension occurs, the

*binary occurrence vector*can be used for this formula.

Converting our previous words "apple" and "applet" into

*frequency of occurrence vectors*:

apple = (1, 1, 1, 2, 0) applet = (1, 1, 1, 2, 1)

We can now perform the cosine similarity calculation:

- Dot Product = (1 * 1) + (1 * 1) + (1 * 1) + (2 * 2) + (0 * 1) = 1 + 1 + 1 + 4 + 0 = 7
- Magnitude of apple = √(1
^{2}+ 1^{2}+ 1^{2}+ 2^{2}+ 0^{2}) = √(1 + 1 + 1 + 4 + 0) = √(7) - Magnitude of applet = √(1
^{2}+ 1^{2}+ 1^{2}+ 2^{2}+ 1^{2}) = √(1 + 1 + 1 + 4 + 1) = √(8) - Products of magnitudes A & B = √(7) * √(8) = √(56) = 7.48331477
- Divide the dot product of A & B by the product of magnitude of A & B = 7 / 7.48331477 = .935414347 (or about 94% similar).

*frequency of occurrence*vectors.

*Here's the link to the Eclipse Project: StringSimilarity.zip*

Here's the signature of the API I've created to calculate similarity:

public static void main(String[] args) { ISimilarityCalculator similarity = new CosineSimilarity(); String one = "apple"; String two = "applet"; double percentageSimilar = similarity.calculate(one, two) * 100; System.out.println( String.format("%s and %s are %s%% similar", one, two, percentageSimilar)); }And here is the output from the command line:

apple and applet are 93.54143466934852% similarAs you can see, we're getting results similar to our manually calculated similarity.

Going back through the steps, here is how I've accomplished this algorithm.

Convert string to vector:

/** * Convert a string to a set of characters * @param string input string * @return set of characters */ public static Collection<Character> stringToCharacterSet(String string){ Collection<Character> charSet = new HashSet<Character>(); for(Character character : string.toCharArray()){ charSet.add(character); } return charSet; }

Take the union of two character vectors:

/** * Get the union of characters from two strings, * the set of characters are sorted alphabetically. * @param string1 First String * @param string2 Second String * @return the unique set of characters * occurring in both strings */ public static Collection<Character> union(String string1, String string2){ Collection<Character> mergedVector = new TreeSet<Character>(); mergedVector.addAll(stringToCharacterSet(string1)); mergedVector.addAll(stringToCharacterSet(string2)); return uniqueCharacters(mergedVector); } /** * Get the unique set of characters from a string. * @param vector input string vector * @return set of unique characters */ public static Collection<Character> uniqueCharacters(Collection<Character> vector){ Collection<Character> uniqueSet = new HashSet<Character>(); for(Character c : vector){ if(!uniqueSet.contains(c)){ uniqueSet.add(c); } } return uniqueSet; }Create a Frequency of Occurrence Vector:

/** * Get the Frequency of Occurrence Vector * from the Union Set and target string * @param string Input String * @param unionSet Set of all Character-dimensions * @return Frequency of Occurrence Vector */ private static Collection<Integer> createFrequencyOfOccurrenceVector( String string, Collection<Character> unionSet){ Collection<Integer> occurrenceVector = new Vector<Integer>(); for(Character c : unionSet){ occurrenceVector.add( (Integer)countCharacter(string, c)); } return occurrenceVector; } /** * Count the number of times a character * occurs in a string * @param string Input String * @param character Character to be counted * @return Frequency of Occurrence */ private static int countCharacter( String string, Character character){ int count = 0; for(Character c : string.toCharArray()){ if(c.equals(character)){ count++; } } return count; }Calculate the Dot Product:

/** * Calculate the Dot Product * (inner product) of two vectors * @param vectorOne Vector * @param vectorTwo Vector * @return Dot Product * @throws VectorMathException Thrown if vectors * are not of equal length */ public static int dotp( Integer[] vectorOne, Integer[] vectorTwo) throws VectorMathException { if(vectorOne.length != vectorTwo.length){ throw new VectorMathException( "Input Vectors do not have the" + "same number of dimensions."); } int dotProduct = 0; for(int i = 0; i < vectorOne.length; i++){ dotProduct += (vectorOne[i] * vectorTwo[i]); } return dotProduct; }

Calculate the Magnitude:

/** * Calculate the Magnitude of a vector * @param vector Vector * @return Magnitude of the Vector */ public static double magnitude(Integer[] vector){ double magnitude = 0; for(int i = 0; i < vector.length; i++){ magnitude += Math.pow(vector[i], 2); } return Math.sqrt(magnitude); }And finally, here is the Cosine Similarity function:

/** * Calculate the similarity of two * strings using Cosine Similarity * @param stringOne First input string * @param stringTwo Second input string * @return cosine of the two angles * (percentage of similarity) */ @Override public double calculate( String stringOne, String stringTwo) { Collection<Character> unionOfStringsOneAndTwo = union(stringOne, stringTwo); Collection<Integer> stringOneOccurrenceVector = createFrequencyOfOccurrenceVector( stringOne, unionOfStringsOneAndTwo); Collection<Integer> stringTwoOccurrenceVector = createFrequencyOfOccurrenceVector( stringTwo, unionOfStringsOneAndTwo); int dotProduct = 0; //This should be an unnecessary exception //since we're submitting the union //of both strings try { dotProduct = dotp( stringOneOccurrenceVector, stringTwoOccurrenceVector); } catch (VectorMathException e){ e.printStackTrace(); System.out.println(stringOneOccurrenceVector); System.out.println(stringTwoOccurrenceVector); return -2; } double vectorOneMagnitude = magnitude(stringOneOccurrenceVector); double vectorTwoMagnitude = magnitude(stringTwoOccurrenceVector); return dotProduct / (vectorOneMagnitude * vectorTwoMagnitude); }That's it. The rest of the code can be found attached to this post (and that includes some initial unit tests. As a side note, I know the use of Java collections are probably pretty confusing. I really do not like the Java collection API, but did not want to add any new dependencies (like Apache Commons Collections or the Google Collections).

Conclusion.

The Cosine Similarity algorithm is useful for determining if the composition of two strings are similar, and does not take the order of the strings into account. There are better, though more computationally expensive algorithms for calculating a more accurate percentage of similarity. In later posts, I hope to introduce those algorithms and how they work.

Thank you for your time.

Richard

Resources

Eclipse Project: StringSimilarity.zip

Works Cited

Wikipedia contributors. "Cosine similarity."

*Wikipedia, The Free Encyclopedia*. Wikipedia, The Free Encyclopedia, 3 Dec. 2010. Web. 5 Dec. 2010.

Wikipedia contributors. "Dot product."

*Wikipedia, The Free Encyclopedia*. Wikipedia, The Free Encyclopedia, 18 Nov. 2010. Web. 5 Dec. 2010.

Wikipedia contributors. "Magnitude (mathematics)."

*Wikipedia, The Free Encyclopedia*. Wikipedia, The Free Encyclopedia, 20 Nov. 2010. Web. 5 Dec. 2010.

I just wanted to say that I am your one hundredth visitor!

ReplyDeleteHi, I found this explanation of how the cosine similarity works very nice. I seem to understand it now which is great since wiki page was a bit, eck!

ReplyDeleteI'm doing a project where I'm comparing a few similarity algorithms with eachother. Since I have to make a demo too I was thinking of using this algorithm as one of the algorithm.

After running the code I saw that this code takes the letters and count them as unique "words" instead of the actual words. So I was wondering what I should do to make it work on strings instead of words.

Instead of "apple" and " applet" but like this expample: "I was out running yesterday night and then fell down the hill." and "You were out running yesterday night and then fell down the hill."

Should I make a new class for words or can I change the char out with strings? Programming isn't my major so sorry if I sound too stupid, I'm still in the middle of learning java.^^

@n00programmer - Thank you for your comment. Instead of determine how many of the characters in two words are similar, you want to know how many words in two sentences/paragraphs are similar.

ReplyDelete1. First, create a union of all words in both sentences (unique word instances).

Richard has blue eyes.

Richard has big blue eyes.

The union therefore is:

['Richard', 'has', 'big', 'blue', 'eyes']

2. Create frequency of occurrence vectors for each sentence:

Sentence 1: [1, 1, 0, 1, 1]

Sentence 2: [1, 1, 1, 1, 1]

3. Calculate the dot product of the two vectors:

1*1 + 1*1 + 0*1 + 1*1 + 1*1 = 4

4. Calculate the magnitude of the two vectors:

Sentence 1: √(1^2 + 1^2 + 0^2 + 1^2 + 1^2) = √4 = 2

Sentence 2: √(1^2 + 1^2 + 1^2 + 1^2 + 1^2) = √5

5. Find the product of the magnitudes:

2 * √5 = 4.47213

6. Divide the dot product (step 3) by the product of the magnitudes (step 4):

4 / 4.47213 = 0.8944

(the sentences were 89% similar).

Keep in mind, we are only asking if a word exists in both sentences, not if they exist in the same order. Jaro-Winkler Similarity is much better at this. I would also use the String.equalsIgnoreCase() variant of the equals method unless case is important.

You can easily take the API I've provided in the example and convert the functions from using Characters to Strings (Words).

Let me know if you need any more help.

Richard

Thanks for answering my questions.^^

ReplyDeleteYes, exactly; how many words in two sentences/paragraphs are similar.

"1. First, create a union of all words in both sentences (unique word instances).

Richard has blue eyes.

Richard has big blue eyes.

The union therefore is:

['Richard', 'has', 'big', 'blue', 'eyes']"

Shouldn't I start with making the sentence to an String array and then make to vector? Or is there no need for that since it's words?

About String.equalsIgnoreCase(), I used String.toLowerCase()instead but it should give the same result, I hope.

About Jaro Winkler, the algorithm seems to be best for short sentences likes names. It okay, I know that cosine will count the words without looking much into the sentence structure.

@n00programmer

ReplyDeleteFirst, I'm assuming that you are doing this for something like a school project. If not, I would encourage you not to "reinvent the wheel", but rather to use a Natural Language Processing library. Stanford University and the University of Illinois have a great set of tools, and there's also the Apache OpenNLP project (amongst others). Also, instead of using the classic Java collections libraries, you might find Google Collections or the Apache Commons Collections ( Generics version: http://sourceforge.net/projects/collections/) more appropriate for your project.

If you want to do it on your own:

Yes, you want to create vectors of both instances. You are going to have to consider a couple of factors:

1. What is your "tokenization" strategy (e.g.: how are you going to actually split the sentence into a vector)?

The obvious solution is to split the string on whitespace (spaces, tabs, new lines). In this case of string comparison, I would use a generic collection like array list, because it will offer you a numer of nice functions:

String sentence = "This is my sentence.";

List words = Arrays.asList(sentence.split(" "));

Now you can do things like:

if(words.contains("sentence")){

//Do something

}

2. Do you need to normalize the data?

Consider the presence of commas and periods in your sentence. The strings "end" and "end." will not be considered a match in a binary decision mechanism (either yes or no, not % of similarity). You may also consider character case. The capitalization of characters can really affect the algorithm; "bush" has a diffent context in the phrases "President George Bush" and "trimming a bush". Of course, this will depend on your dataset, and I wouldn't spend too much time trying to solve this problem since many academics are still struggling with this topic.

Yes, you could call it a schoolproject. I don't know if I'm allowed to use the libraries or not. I think I'm not.

ReplyDeleteNow to answering your questions before asking mine.

1.

Yes, I will do it the easy way and split the string by words since I'm using the words in a string as tokens/vectors to compare with the other string.

2.

About normalizing the data: Well I was thinking of going by lowercase and removing all ",.?! from the string before running the algorithm on the strings. Of course this will happen via the String class' methods like String.toLowerCase() and String.replaceAll(>symbol<, " ").

About the Bush and bush example. I figured that I won't have to do something that hard, like you say it perfectly yourself. Many academics are still struggling with this topic. Though I shall refer to the problem and may go in depth with it a little. I expect it not to be that big a problem since the content of the text should be enough for the algorithm to "see" that we arent talking about the president but a plant.

Questions:

As I said before I'm not that great in programming so I may sound stupid, I apologize beforehand.

1.

I replaced all "Character" with "String", Collection with Collection and all string.toCharArray() to string.split() and somehow it worked perfectly on lines. I'm shocked at how easy it was so I'm going to ask this question: Can it really be that easy to make the algorithm to work for sentences instead of words by just doing that? :O

2.

"charSet" is that the set of tokens which should be vectors, right? Can I use that to pick out words if I have to and put in other words instead? Like say, we got the sentence "The President Bush got trimmed" then take out the word President and instead australian instead so we get "The australian bush got trimmed". Can I do that with the vector set "charSet"?

3.

Is there a reason for you to writing overloading methods in the classes, like the ones in the VectorMath and the one who makes the charSet?

@n00programmer

ReplyDelete-Sorry about the removed post (I made an error and posted the same paragraph twice).

Answering your questions in order:

1. For building vectors with "non-unique" elements, yes this will work.

2. On this topic, if you are talking about replacing the value of one element in the set with another, yes you could remove the element and insert another (these are functions on the Java Collection interface), but order will not be preserve (if this matters to you). You may need to look at using a different type of Generic collection. Also, you will need to change the collections type from Character to String.

3. I have reviewed this code in a while, but at one time I was using it with another project. I believe those overloads might have been used for it.

-It's okay I never the removed post (I dont have net at home atm so I can only check it when I'm at school)

ReplyDelete1.

Hmmm...I can't see the need for unique elements since it compares strings fine, so great. :D

2.

Super. I haven't read anything about collection class and vectors so far in the book I'm using to learning Java so I guess I should look more into these types.

3.

I see. Then I can delete those ;)

Apropos order; You put your tokens into alphabetically order but I don't see the function called anywhere in the code. Is it being used at all(could be that it was used with another project too) and is there really a need to but them in alphabetically order in this cosine similarity example?

Same thing with as above with the intersect method. I don't seem to see it's importance in this code....so is it from the old project too?

@n00programmer

ReplyDeleteDon't worry about the (removed post comment), that was my mistake. I posted, realized I made a mistake, temporarily removed the post, added a new post (with the comment), and then figured out how to permanently delete the original. Sorry for the confusion.

I remembered the purpose of those functions not being used in the demo. The first time I wrote this post, I only had those functions I needed to perform Cosine Similarity. I have a subsequent post about calculating similarity where I expanded on the API to perform Jaccard and Jaro-Winkler. If the function is not listed on this blog post, it is probably not used for Cosine Similarity (I can't say definitively).

Thanks for answering and no need to be sorry about the deleted post, accidents happen^^

ReplyDeleteAfter having this conversation on your blog about this I seem to have gotten to know cosine similarity much better than compared to what I've reading on other people's blogs. For that I want to thanks you a lot.

I tried to implement the cosine similarity using your code as a skeleton and it seems to work fine. After accomplished that I seem to have found out how to use programming in math (than just adding/subtracting/dividing and multiplying)and what possibilities it open up for. Thanks for that insight.

I still want to thank you for putting up with me for all the stupid questions I may have asked and hope that you didn't found them too annoying. ^^

After all the thanks I will give you an advice on Levenshtein,rather one of it's "brothers"since I read that would be your next topic in similarity. A thing which most people seems to miss out on when they program.

The pseudocode for Levenshtein-Damerau is not the algorithm written first on wiki. It's the restricted LD which is called OptimalStringAlignment Distance. The unrestricted algorithm comes under it in Actionscript (God knows why it's actionscript and not normal pseudocode).

Lastly, again thanks for being so much help for me. I really appreciate it.

PS: I will be clicking and checking your blog from now on for goodies.

Nicely done, thanks for sharing.

ReplyDeleteGlad I could help.

Deletethank you very much !!! this is exactly what I was looking for !

ReplyDeleteMy pleasure.

Deletehi. Thanks for the post.I'm also working on similar thing. I need big help on this. I need to compare similarity between two documents instead. So how can i change your code to achieve that target?Please help me to overcome this problem. Give me code sample that can compare two documents instead of two words in cosine similarity. thank you

ReplyDeleteExcellent article, thanks for this. Can't wait to check out the rest of the blog!

ReplyDeleteHi Richard, excellent article. I have a question, how can you compare two vectors giving each value a relevance o ranking. For example, comparing vector1 {a, b, c, d} vs. vector2 {a, f, g, h} gives me a similary of 0.25. But how can I get my first value a bigger relevance? I'm trying to identify an element for a set of 10 values and three of them are the most important.

ReplyDeleteThanks in advance.

I learned more from this .. Thanks a lot..

ReplyDeleteI studied in some sites that cosine similarity is used in calculating the weight of the data in dataset...

Is it possible?

Please say how it is happening?

Raje,

ReplyDeleteNo sure what you mean by calculating the weight of data. Would you mind elaborating?

Thanks,

Richard

Many Thanks for the post, it realy helped...

ReplyDeleteI Wish if you can explain other alogrithms (some of them are mentioned below) as well.

Tanimoto

Minkowski

Bhattacharya

This comment has been removed by the author.

ReplyDeleteThanks for sharing this .This is very helpful.Please tell me how to find similarity between two documents.

ReplyDeletevery nice article !! :)

ReplyDeletei jus wanted to know if the output has to be printed in matrix format

ReplyDeleteI'm not certain I understand the question. What are you trying to do?

DeleteThis comment has been removed by the author.

ReplyDelete