Wednesday, December 29, 2010

JBoss Rules (Drools) "Brush" for Syntax Highlighter

After completing my last two JBoss Rules (Drools) posts, I decided to create a Syntax Highlighter brush for the Drools language. The syntax was largely inspired by Alan Daniel's vim syntax for Drools: http://www.vim.org/scripts/script.php?script_id=1940.  If anyone with a more stable hosting solution wouldn't mind hosting the file, please leave me a comment on this post.

Creating a new brush for Syntax Highlighter is really quite simple.  Here's all it takes to get support for Drools:

SyntaxHighlighter.brushes.Drools = function()
{

  var keywords    = 'assert duration end eval function' 
                  + 'modify query salience '
                  + 'retract return rule then when ' 
                  + 'activation-group '
                  + 'agenda-group auto-focus no-loop';
  var external    = 'native package import static';
  var conditional = 'if else switch and excludes '
                  + 'exists or not matches';
  var operator    = 'new instanceof ->';
  var type        = 'boolean char byte short int '
                  + 'long float double void';
 
  this.regexList = [
    { regex: /\s*#.*/gm, 
       css: 'comments' }, 
    { regex: SyntaxHighlighter.regexLib.singleLineCComments, 
       css: 'comments' }, 
    { regex: /\/\*([^\*][\s\S]*)?\*\//gm, 
       css: 'comments' }, 
    { regex: /\/\*(?!\*\/)\*[\s\S]*?\*\//gm, 
       css: 'preprocessor' }, 
    { regex: SyntaxHighlighter.regexLib.doubleQuotedString, 
       css: 'string' }, 
    { regex: SyntaxHighlighter.regexLib.singleQuotedString, 
       css: 'string' }, 
    { regex: /\b([\d]+(\.[\d]+)?|0x[a-f0-9]+)\b/gi, 
       css: 'value' }, 
    { regex: new RegExp(this.getKeywords(keywords), 'gm'), 
       css: 'keyword' }, 
    { regex: new RegExp(this.getKeywords(external), 'gm'), 
       css: 'color3' },
    { regex: new RegExp(this.getKeywords(conditional), 'gm'), 
       css: 'constants' }, 
    { regex: new RegExp(this.getKeywords(operator), 'gm'), 
       css: 'color2' }, 
    { regex: new RegExp(this.getKeywords(type), 'gm'), 
       css: 'color2' } 
 ];
   
  this.forHtmlScript({
    left : /(&lt;|<)%[@!=]?/g, 
    right : /%(&gt;|>)/g 
  });
};
SyntaxHighlighter.brushes.Drools.prototype 
   = new SyntaxHighlighter.Highlighter();
SyntaxHighlighter.brushes.Drools.aliases  = ['drools', 'drl'];

You will need to place this in a JavaScript file and link to it after you have imported the mandatory Syntax Highlighter JavaScript files.

<script type='text/javascript' 
  src='http://path/to/drools/brush/shBrushDrools.js' />

Now all you need to do is wrap your Drools Expert code in "pre" tags, referencing the appropriate brush class:

<pre class="brush: drools">
... include rules here ...
</pre>

Here's the "LocalAssertionsNotPingPong.drl" example from the source package of the Drools 5.1.1 framework:

/**
 * Copyright 2010 JBoss Inc
 *
 * Licensed under the Apache License, 
 * Version 2.0 (the "License");
 * you may not use this file except in 
 * compliance with the License.
 * You may obtain a copy of the License at
 *
 *  http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or 
 * agreed to in writing, software
 * distributed under the License is distributed 
 * on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 
 * either express or implied.
 * See the License for the specific 
 * language governing permissions and
 * limitations under the License.
 */

package org.drools.test;

import java.lang.String;
import java.lang.Integer;
import java.util.List;

global java.lang.Integer i;
global java.lang.String s;
global java.util.List l;

rule "not s then i"
  when
    not String()
  then
    if (l.size() < 10) {
      l.add(new Integer(0));
      System.out.println("Logical assert i");
      insertLogical(i);
    }
end
rule "i then s"
  when
    Integer()
  then
    if (l.size() < 10) {
      l.add(new Integer(0));
      System.out.println("Logical assert s");      
      insertLogical(s);
    }
end

That's it.  Enjoy.

Richard

Tuesday, December 28, 2010

Using JBoss Rules (Drools) in Clojure

On December 11th, I posted a demonstration of how developers could interact with the JBoss Rules framework in Scala (not only to instantiate and run the rules engine, but also to provide model objects for consideration in the rules process). This week, I want to demonstrate the same principles, but with the JVM-based Lisp variant, Clojure (http://clojure.org/).

The Clojure programming language was created by Rich Hickey in 2007 in an effort to bring a capable functional language to the JVM.  Hickey's Lisp implementation was design to address two issues specifically.  First, Hickey wanted a language that was completely interoperable with Java (Clojure uses Java primitives and compiles to byte code).  Second, Clojure was designed to address problems commonly encountered in concurrent programming (via actor model and software transactional memory).  For a more detailed description of the Clojure language, I suggest you visit the language's website. [1]

For those unfamiliar with the JBoss Rules framework (Drools), I highly suggest you visit the project's website:  http://www.jboss.org/drools.  In short, Drools enables developers to abstract business logic from their "middle tier" (Business layer).  The Drools framework is a rules engine, business process orchestrator and event processor all rolled into one.  For today's demonstration, however, we will only concern ourselves with the rules engine aspect of the framework, Drools Expert.

For today's demonstration, we will mirror the last post's functionality of evaluating Temperature objects.  As a reminder,  if the temperature is greater than 85 degrees, a "HEAT WARNING" should be set. Alternatively, if the temperature is less than 32 degrees, a "FREEZE WARNING" will be declared.

If you read the previous article, you will probably notice that this is the same rules file used for the Scala demonstration:

package com.berico.rules
 
import com.berico.model.weather.Temperature

rule "Too Hot"
   dialect "mvel"
   when
      temp : Temperature( value > 85 )
   then
      System.out.println(
        temp.value.toString() + " F is too hot."
        + " Declare HEAT WARNING!"); 
end

rule "Too Cold"
   dialect "mvel"
   when
      temp : Temperature( value < 32 )
   then
      System.out.println(
         temp.value.toString() + " F is too cold."
         + " Declare FREEZE WARNING!"); 
end

Clojure is probably one of the best languages (next to Groovy) in terms of interoperability with Java; this is especially surprising considering how dissimilar both languages are to each other. A program written completely in Clojure doesn't really need classes or interfaces. To interoperate with other JVM languages, Clojure has a method for class and interface generation [2]. The following code is a fully-compatible class written in Clojure.

*Note: lines starting with double semi-colons ";;" are comments (sorry syntax highlighting is not working at the moment).

(ns com.berico.model.weather.Temperature
  (:gen-class
   :init init

   ;; Define the method `getValue` 
   ;; which will be accessible to Java
   :methods [[getValue [] double]]

   ;; The double brackets indicate a default 
   ;; (parameterless) constructor
   :constructors {[double] []}

   ;; Store instance variables in 
   ;; the state variable
   :state state))

;; Constructor (accepts a double `value`)
(defn -init [value]

  ;; we are actually going to store reference to the
  ;; value as a key-value pair in the variable `state`
  [[] (ref {:value value})])

;; Get the value of the Temperature object 
;; (e.g. temperature in Fahrenheit)
(defn -getValue [this]

  ;; The get function allows us to pull the 
  ;; key `value` from the state variable
  (get @(. this state) :value))

Just like in Java, we need to declare our namespace (package) and class imports that exist in this Clojure file's scope. Note the use of vectors for the org.drools and org.drools.builder namespaces; we can use this 'syntax sugar' to include multiple classes from a namespace on an import (without the need for 'greedy' inclusion).

(ns com.berico.rules.RuleRunner
  (:gen-class
   :require ['[
     [org.drools KnowledgeBase 
                 KnowledgeBaseFactory]
     [org.drools.builder KnowledgeBuilder 
                         KnowledgeBuilderError 
                         KnowledgeBuilderErrors 
                         KnowledgeBuilderFactory]
      org.drools.builder.ResourceType 
      org.drools.io.ResourceFactory 
      org.drools.runtime.StatefulKnowledgeSession]]))

The following is a function to construct the KnowledgeSession we will need for inserting facts and running the Weather rules.

* Clojure does not support the "aliasing" of Java classes [yet] (though, it is possible for Clojure namespaces).  This is the reason why we are using fully qualified class names in the code.


;; Construct a Knowledge Session 
;; from the 'WeatherRules.drl' definition.
(defn build-knowledge-session [] 

  ;; Create the KnowledgeBuilder object
  (def knowledge-builder 
    (org.drools.builder.KnowledgeBuilderFactory/newKnowledgeBuilder))

  ;; Adding the Weather Rules definition 
  ;; to the KnowledgeBuilder
  (.add knowledge-builder
    (org.drools.io.ResourceFactory/newClassPathResource 
       "WeatherRules.drl")
     org.drools.builder.ResourceType/DRL)

  ;; Creating the KnowledgeBase
  (def knowledge-base 
    (org.drools.KnowledgeBaseFactory/newKnowledgeBase))

  ;; Adding Knowledge Packages to the KnowledgeBase
  (.addKnowledgePackages knowledge-base 
    (. knowledge-builder getKnowledgePackages))

  ;; Create a Stateful Knowledge Session 
  ;; and return it to the calling function
  (. knowledge-base newStatefulKnowledgeSession))

And finally, the entry point of our application. We begin by creating the KnowledgeSession. The next two statements create Temperature objects, initializing their values to 100 and 20 degrees Fahrenheit, respectively. Finally, using the "doto" method, we call 3 functions on the KnowledgeSession in sequence, inserting the Temperatures as facts and finally firing all rules.

;; Start the Application
(defn app-start []

    ;; Create the Knowledge Session
    (def knowledge-session (build-knowledge-session))

    ;; Create a Temperature fact
    (def shouldBeTooHot 
      (new com.berico.model.weather.Temperature 100))

    ;; Create another Temperature fact
    (def shouldBeTooCold 
      (new com.berico.model.weather.Temperature 20))

    ;; Insert the facts & fire all rules
    (doto knowledge-session 
      (.insert shouldBeTooHot) 
      (.insert shouldBeTooCold) 
      (.fireAllRules)))

I'm running this application using the Clojure REPL. You initiate the application using the "app-start" method. Here's the output from the console:

Clojure 1.2.0
1:1 user=> #<Namespace com.berico.rules.RuleRunner>
1:2 com.berico.rules.RuleRunner=> (app-start)
20.0 F is too cold. Declare FREEZE WARNING!
100.0 F is too hot. Declare HEAT WARNING!
#<StatefulKnowledgeSessionImpl 
    org.drools.impl.StatefulKnowledgeSessionImpl@3ab28980>
1:3 com.berico.rules.RuleRunner=>

As you can see, developers are not limited to Java when using the JBoss Rules framework.  I hope you found this post helpful, but more importantly encouraging (you are not limited to Java on the JVM!). 

Finally, I wanted to thank Mark and the rest of the Drools team for recognizing my Dec 11th post.  You guys doubled my all time page views in a day (and it was a holiday)!  More importantly, thanks for your extremely impressive framework.

Richard Clayton 

References:

[1]. http://en.wikipedia.org/wiki/Clojure
[2]. http://clojure.org/compilation

Resources:

I personally own, and recommend, the following books on Clojure and JBoss Rules:

Friday, December 17, 2010

Task Parallel Library: Concurrency made easy in .NET

The .NET 4.0 Framework brought many notable features, but the highlight of the release was definitely Task Parallel Library (TPL).

Concurrency is an issue many of us have had to tackle at one point or another.  The way concurrency is handled in many frameworks feels very unnatural.  Typically, these frameworks require the use of classes with a parameterless function that does not return any value.  Execution of this function is often handled on another thread, which in a multi-core/processor environment could potentially be ran on a separate processor.  Synchronizing the state of applications using these special "Runnable" (Java) classes becomes very difficult, and like so many other things in these languages, requires more scaffolding to accomplish than really should be needed.

Concurrency is such an issue, many languages have been designed to addressed this issue.  Scala and Erlang are specific examples, designed around an "actor" model, where state is transferred using message-passing.  Clojure takes a different approach by using "Software Transactional Memory", enforcing an ACID model on updating state.  These approaches to concurrency are extremely compelling alternatives to the traditional thread model, but do us no good when we are creating software on highly-adopted languages like Java and C#.

Task Parallel Library is Microsoft's answer to the concurrency issue.  TPL makes the creation of highly-concurrent applications trivial, leaving the engineer to focus on modelling the domain and implementing business logic.  However, to understand the importance of the new framework, it's important to understand where we were at prior to TPL.

.NET's model for concurrency has morphed over time, we can definitely see that Microsoft has learned from the inadequacies of Java's thread model.  .NET 1.1 supported the passing of a static function to the Thread.Start() method by wrapping the method in a ThreadStart object.

using System;
using System.Threading;

public class ThreadExample {

    public static void ThreadProc() {
        for (int i = 0; i < 10; i++) {
            Console.WriteLine("ThreadProc: {0}", i);
            Thread.Sleep(0);
        }
    }

    public static void Main() {
        Thread t = new Thread(
           new ThreadStart(ThreadProc));
        t.Start();
    }
}
Excerpt from MSDN (System.Threading.Thread)


While we don't have to implement a special interface to achieve asynchonicity, the code is still pretty ugly.  This demo does not even demonstrate the trouble one would encounter trying to pull resulting data from the operation, and it certainly doesn't show a transfer of state from the main class to the wrapped function.


TPL answers the issue of concurrency by adding two distinct features to the .NET Framework:  Task Parallelism and Data ParallelismTask Parallelism addresses the execution of independent tasks in parallel, while Data Parallelism focuses on the use of parallel operations against a collection.  To support parallelism, the .NET Framework introduces a new set of concurrent collections:

System.Collections.Concurrent.BlockingCollection<T>
System.Collections.Concurrent.ConcurrentBag<T>
System.Collections.Concurrent.ConcurrentDictionary<TKey, TValue>
System.Collections.Concurrent.ConcurrentQueue<T>
System.Collections.Concurrent.ConcurrentStack<T>

Concurrent collections are excellent for working on a collection type from within a parallel loop or set of tasks.  Much of what we want done tends to be some sort of mapping of values to a distilled set, possibly followed by a reduction into a summary variable.  Most of the core .NET collections are not thread safe.  Even if you know there will be no conflicts in your data (e.g.: inserting on a single List collection from multiple threads where every key is unique), you will encounter threading issues.

Perhaps the most pervasive issue is the IndexOutOfRangeException which occurs when multiple threads attempt to insert on a non-thread-safe collection.  The real cause of the problem is that the List class must preallocate space for the collection, which is backed by an array of that type.  Gabe talks about the underlying reason behind the error on Stack Overflow:  http://stackoverflow.com/questions/4007248/index-out-of-range-exception-when-using-parallel-for-loop.

Let start by introducing Data Parallelism, which encompasses Parallel For and ForEach loops, as well as, PLINQ (Parallel LINQ).

* Both features in TPL can be found in the System.Threading.Tasks namespace.

First, let's look at the Parallel.For loop.  In this demonstration, we are going to add a million Guids to a ConcurrentBag in Parallel:

//Our concurrent collection
ConcurrentBag<guid> bagOfGuids 
   = new ConcurrentBag<guid>();

//Parallel For's signature in this example is:
// (int startingInt, 
//  int endingInt, 
//  Func(int iteration) action)
Parallel.For(1, 1000000, 
   (i) => {
      //Every ten thousandth key let's output a
      //status message.
      Debug.WriteLineIf((i % 10000) == 0, 
         string.Format("{0} Keys Created.", i));
      //Add the Guid to the collection
      bagOfGuids.Add(Guid.NewGuid());
   });

And here is the output. Notice how the messages are out of order? This is happening because the messages are being outputted by different thread (some finishing faster than others).

10000 Keys Created.
20000 Keys Created.
500000 Keys Created.
510000 Keys Created.
520000 Keys Created.
530000 Keys Created.
30000 Keys Created.
40000 Keys Created.
540000 Keys Created.
50000 Keys Created.
60000 Keys Created.
70000 Keys Created.
550000 Keys Created.
80000 Keys Created.
90000 Keys Created.
100000 Keys Created.
560000 Keys Created.
570000 Keys Created.
120000 Keys Created.
130000 Keys Created.
140000 Keys Created.
580000 Keys Created.
110000 Keys Created.
160000 Keys Created.
150000 Keys Created.
170000 Keys Created.
590000 Keys Created.
600000 Keys Created.
610000 Keys Created.
180000 Keys Created.
620000 Keys Created.
200000 Keys Created.
210000 Keys Created.
630000 Keys Created.
640000 Keys Created.
650000 Keys Created.
220000 Keys Created.
230000 Keys Created.
190000 Keys Created.
240000 Keys Created.
660000 Keys Created.
670000 Keys Created.
680000 Keys Created.
250000 Keys Created.
690000 Keys Created.
700000 Keys Created.
710000 Keys Created.
260000 Keys Created.
720000 Keys Created.
270000 Keys Created.
290000 Keys Created.
730000 Keys Created.
300000 Keys Created.
280000 Keys Created.
320000 Keys Created.
740000 Keys Created.
330000 Keys Created.
340000 Keys Created.
750000 Keys Created.
350000 Keys Created.
360000 Keys Created.
760000 Keys Created.
380000 Keys Created.
370000 Keys Created.
770000 Keys Created.
390000 Keys Created.
400000 Keys Created.
780000 Keys Created.
410000 Keys Created.
420000 Keys Created.
430000 Keys Created.
440000 Keys Created.
790000 Keys Created.
450000 Keys Created.
460000 Keys Created.
470000 Keys Created.
480000 Keys Created.
800000 Keys Created.
490000 Keys Created.
810000 Keys Created.
820000 Keys Created.
830000 Keys Created.
840000 Keys Created.
860000 Keys Created.
850000 Keys Created.
870000 Keys Created.
880000 Keys Created.
890000 Keys Created.
900000 Keys Created.
910000 Keys Created.
920000 Keys Created.
930000 Keys Created.
940000 Keys Created.
950000 Keys Created.
960000 Keys Created.
310000 Keys Created.
970000 Keys Created.
990000 Keys Created.
980000 Keys Created.

Parallel ForEach Loops work a lot similar, but obviously require some IEnumberable<T> collection as its iteration source. In the next demo, we will partition the ConcurrentBag of 1,000,000 Guids using a consistent hashing strategy I use on Redis (the remainder of first byte of the Guid divided by the desired number of partitions [very effective!]).


//Create a concurrent dictionary to store the
//partitions
ConcurrentDictionary<int, ConcurrentBag<Guid>> partitions 
   = new ConcurrentDictionary<int, ConcurrentBag<Guid>>();

//Iterate in parallel over the previous collection
Parallel.ForEach(bagOfGuids,
  (guid) =>
   {
      //This is my key partitioning strategy
      //creating 20 partitions of keys.
      int index = guid.ToByteArray()[0] % 20;
      //Try adding the partition to the dictionary
      //this will fail if the key already exists
      //*Note: Using if(!dictionary.ContainsKey(...)) 
      //does not work very well in a concurrent environment.
      //This is why the concurrent dictionary has the
      //TryAdd function.
      partitions.TryAdd(index, new ConcurrentBag<Guid>());
      //Add the key to the appropriate partition
      partitions[index].Add(guid);
   });

//Here's a little LINQ magic to show you how the keys
//were distributed in the bag.
partitions.ToList().ForEach(
   (kvp) => { 
      Debug.WriteLine("Partition {0} has {1} keys.", 
         kvp.Key, kvp.Value.Count); 
   });

Partition 0 has 50786 keys.
Partition 1 has 50835 keys.
Partition 2 has 50760 keys.
Partition 3 has 50591 keys.
Partition 4 has 50365 keys.
Partition 5 has 51199 keys.
Partition 6 has 51138 keys.
Partition 7 has 50950 keys.
Partition 8 has 50912 keys.
Partition 9 has 50891 keys.
Partition 10 has 50655 keys.
Partition 11 has 50723 keys.
Partition 12 has 50564 keys.
Partition 13 has 50743 keys.
Partition 14 has 50803 keys.
Partition 15 has 50418 keys.
Partition 16 has 46620 keys.
Partition 17 has 47033 keys.
Partition 18 has 46961 keys.
Partition 19 has 47052 keys.

In addition to traditional Data Parallelism, TPL also includes parallel extensions to LINQ. Here's a little taste as to how parallelism can be applied to LINQ:

IEnumerable<Guid> partitionZero =
   //Note the use of AsParallel() on the
   //source collection
   from guid in bagOfGuids.AsParallel() 
   where ((guid.ToByteArray()[0] % 20) == 0) 
   select guid;

Debug.WriteLine("There are {0} keys in partition zero.", 
   partitionZero.Count());

Here is the output from the code above. You will notice that we get the same count as we did in the previous demo.

There are 50786 keys in partition zero.

Now that we have seen how parallelism can be applied to data, let's look at how we can use TPL to address problems traditionally handled by the Thread class. Let's demonstrate the basics of Task and Task.Factory objects which form the foundation of Task Parallelism in .NET 4.0:

//Store a collection of concurrent tasks.
//You can just as easily create a task without
//keeping a reference of it, but you will
//learn later that their are benefits to
//remembering these tasks.
Task[] tasks = new Task[]{

  //Create the first parallel task
   Task.Factory.StartNew(()=> { 

      for(int i = 0; i < 1000000; i++){
         Debug.WriteLineIf((i % 50000) == 0, 
            string.Format(
            "Hit {0} iterations on the first task.", i));
      }

   }),

   //Create the second parallel task
   Task.Factory.StartNew(()=> {

      for (int i = 0; i < 1000000; i++){
         Debug.WriteLineIf((i % 50000) == 0, 
            string.Format(
            "Hit {0} iterations on the second task.", i));
      }

   }),

   //Create the third parallel task
   Task.Factory.StartNew(()=> {

         for (int i = 0; i < 1000000; i++)
         {
         Debug.WriteLineIf((i % 50000) == 0, 
            string.Format(
            "Hit {0} iterations on the third task.", i));
         }

    })
};

And the output of the three concurrent tasks:

Hit 0 iterations on the first task.
Hit 0 iterations on the second task.
Hit 50000 iterations on the first task.
Hit 50000 iterations on the second task.
Hit 100000 iterations on the first task.
Hit 100000 iterations on the second task.
Hit 150000 iterations on the first task.
Hit 150000 iterations on the second task.
Hit 200000 iterations on the first task.
Hit 200000 iterations on the second task.
Hit 250000 iterations on the first task.
Hit 250000 iterations on the second task.
Hit 300000 iterations on the first task.
Hit 300000 iterations on the second task.
Hit 350000 iterations on the first task.
Hit 350000 iterations on the second task.
Hit 400000 iterations on the first task.
Hit 400000 iterations on the second task.
Hit 450000 iterations on the first task.
Hit 450000 iterations on the second task.
Hit 500000 iterations on the first task.
Hit 500000 iterations on the second task.
Hit 550000 iterations on the first task.
Hit 550000 iterations on the second task.
Hit 600000 iterations on the first task.
Hit 600000 iterations on the second task.
Hit 650000 iterations on the first task.
Hit 650000 iterations on the second task.
Hit 700000 iterations on the first task.
Hit 700000 iterations on the second task.
Hit 750000 iterations on the first task.
Hit 750000 iterations on the second task.
Hit 800000 iterations on the first task.
Hit 800000 iterations on the second task.
Hit 850000 iterations on the first task.
Hit 850000 iterations on the second task.
Hit 900000 iterations on the first task.
Hit 900000 iterations on the second task.
Hit 950000 iterations on the first task.
Hit 950000 iterations on the second task.
Hit 0 iterations on the third task.
Hit 50000 iterations on the third task.
Hit 100000 iterations on the third task.
Hit 150000 iterations on the third task.
Hit 200000 iterations on the third task.
Hit 250000 iterations on the third task.
Hit 300000 iterations on the third task.
Hit 350000 iterations on the third task.
Hit 400000 iterations on the third task.
Hit 450000 iterations on the third task.
Hit 500000 iterations on the third task.
Hit 550000 iterations on the third task.
Hit 600000 iterations on the third task.
Hit 650000 iterations on the third task.
Hit 700000 iterations on the third task.
Hit 750000 iterations on the third task.
Hit 800000 iterations on the third task.
Hit 850000 iterations on the third task.
Hit 900000 iterations on the third task.
Hit 950000 iterations on the third task.

Sometimes it's necessary to wait for all of the concurrent tasks to complete before regular execution can continue. TPL provides this functionality using the Task.WaitAll(params Task[] tasks) function. You will see an demonstration of WaitAll in the next example; I have staggered the number of operations with the intention of the first two tasks finishing long before the third.

Task[] waitForAllTasks = new Task[]{
   Task.Factory.StartNew(()=> { 
      for(int i = 0; i < 1000; i++){
         Debug.WriteLineIf((i % 100) == 0, 
          string.Format(
           "Hit {0} iterations on the first task.", i));
      }
   }),
   Task.Factory.StartNew(()=> {
      for (int i = 0; i < 10000; i++)
      {
         Debug.WriteLineIf((i % 1000) == 0, 
          string.Format(
           "Hit {0} iterations on the second task.", i));
      }
   }),
   Task.Factory.StartNew(()=> {
      for (int i = 0; i < 100000; i++)
      {
         Debug.WriteLineIf((i % 10000) == 0, 
          string.Format(
           "Hit {0} iterations on the third task.", i));
      }
   })
};

//This will be printed before all tasks are finished
Debug.WriteLine("Waiting for all tasks to finish");

//Block until all concurrent tasks have finished.
Task.WaitAll(waitForAllTasks);

//This should be printed after all tasks have finished
Debug.WriteLine("Finished all tasks");

Here is the output:

Hit 0 iterations on the first task.
Hit 0 iterations on the second task.
Hit 100 iterations on the first task.
Hit 1000 iterations on the second task.
Hit 200 iterations on the first task.
Hit 2000 iterations on the second task.
Hit 300 iterations on the first task.
Hit 3000 iterations on the second task.
Hit 400 iterations on the first task.
Hit 4000 iterations on the second task.
Hit 500 iterations on the first task.
Hit 5000 iterations on the second task.
Hit 600 iterations on the first task.
Hit 6000 iterations on the second task.
Hit 700 iterations on the first task.
Hit 7000 iterations on the second task.
Hit 800 iterations on the first task.
Hit 8000 iterations on the second task.
Hit 900 iterations on the first task.
Hit 9000 iterations on the second task.
Hit 0 iterations on the third task.
Hit 10000 iterations on the third task.
Hit 20000 iterations on the third task.
Waiting for all tasks to finish
Hit 30000 iterations on the third task.
Hit 40000 iterations on the third task.
Hit 50000 iterations on the third task.
Hit 60000 iterations on the third task.
Hit 70000 iterations on the third task.
Hit 80000 iterations on the third task.
Hit 90000 iterations on the third task.
Finished all tasks

Sometimes we don't care if all tasks have finished before continuing synchronous execution. If we only need a subset of those tasks to finish, we can use the Task.WaitAny(params Task[] tasks) method to force the main thread to wait until the input tasks have exited.

Task[] notWaitingForLongRunningTask = new Task[]{
   Task.Factory.StartNew(()=> { 
      for(int i = 0; i < 1000; i++){
         Debug.WriteLineIf((i % 100) == 0, 
          string.Format(
           "Hit {0} iterations on the first task.", i));
      }
   }),
   Task.Factory.StartNew(()=> {
      for (int i = 0; i < 10000000; i++)
      {
         Debug.WriteLineIf((i % 1000000) == 0, 
          string.Format(
           "Hit {0} iterations on the long running task.", i));
      }
   })
};

//Wait only for the first task to finish
Task.WaitAny(notWaitingForLongRunningTask[0]);

Debug.WriteLine("Only waiting for the first task to finish.");

//Wait for all remaining tasks
Task.WaitAll(notWaitingForLongRunningTask);

Debug.WriteLine("All tasks finished.");

And the output...

Hit 0 iterations on the first task.
Hit 0 iterations on the long running task.
Hit 100 iterations on the first task.
Hit 200 iterations on the first task.
Hit 1000000 iterations on the long running task.
Hit 300 iterations on the first task.
Hit 400 iterations on the first task.
Hit 500 iterations on the first task.
Hit 600 iterations on the first task.
Hit 2000000 iterations on the long running task.
Hit 700 iterations on the first task.
Hit 800 iterations on the first task.
Hit 900 iterations on the first task.
Only waiting for the first task to finish.
Hit 3000000 iterations on the long running task.
Hit 4000000 iterations on the long running task.
Hit 5000000 iterations on the long running task.
Hit 6000000 iterations on the long running task.
Hit 7000000 iterations on the long running task.
Hit 8000000 iterations on the long running task.
Hit 9000000 iterations on the long running task.
All tasks finished.

My final demonstration is how to get data out of Parallel tasks. Fortunately, this is a lot simpler than using traditional thread models. We can use a special form of Task<T> to accomplish this functionality.

//We want Tasks that return a List of Guids
Task<List<Guid>>[] tasksReturningValues 
   = new Task<List<Guid>>[]{

   Task<List<Guid>>.Factory.StartNew(()=>{
      List<Guid> guids = new List<Guid>();
      for(int i = 0; i < 1000; i++){
         guids.Add(Guid.NewGuid());
      }
      //Your delegate must return the specified
      //template type
      return guids;
   }),

   Task<List<Guid>>.Factory.StartNew(()=>{
      List<Guid> guids = new List<Guid>();
      for(int i = 0; i < 1000; i++){
         guids.Add(Guid.NewGuid());
      }
      return guids;
   }),

};

//Let's wait for all tasks to finish
Task.WaitAll(tasksReturningValues);

//Create a list of Guids
List<Guid> returnedGuids = new List<Guid>();

//Add all Guids from the first task
returnedGuids.AddRange(tasksReturningValues[0].Result);

//Add all Guids from the second task
returnedGuids.AddRange(tasksReturningValues[1].Result);

//Display the count (we should have 2000)
Debug.WriteLine("I have {0} Guids in my bag.", 
  returnedGuids.Count);

And Finally...

I have 2000 Guids in my bag.

As you can see, Task Parallel Library allows Engineers to easy convert traditional synchronous applications into highly concurrent ones with little effort.

Richard

Sunday, December 12, 2010

Aspect-Oriented Programming in JBoss AOP: Method Interception

Aspect-Oriented Programming is a modern design paradigm that focuses on the separation of "unrelated concerns" from the implementation of a piece of functionality.  AOP is especially useful in Domain Driven Design because it allows engineers to focus on business logic without needing replicate tedious logging, security, and persistence related code (not that these are not important). 

My goal for this post is not to describe the merits or technical rationale of AOP.  If you do not understand the principles of AOP, I suggest search for the topic in Google or evaluating a number of the key AOP frameworks:

Java
  • AspectJ (utilized by the Spring Framework)
  • JBoss AOP
There are many others for Java, but these are the two dominant (in my opinion).

.NET
  • Spring.NET (similar to AspectJ)
  • Enterprise Library 3.0:  Policy Injection Application Block
  • PostSharp
  • Aspect#
  • LOOM.NET
There are also a number of books on the subject, but I think you will find the AspectJ in Action title from Manning excellent (don't worry about the framework, the principles are identical and API's very similar):  AspectJ in Action (2nd Ed):  http://www.manning.com/laddad2/.

AOP is an important concept because it is used by almost all of the important enterprise frameworks.  You typically find AOP in those frameworks that perform Dependency Injection;  Spring and J2EE serve as perfect examples.  In this demonstration, I'm going to use JBoss AOP.  JBoss AOP is an excellent framework, and is functionally equivalent to the better known AspectJ framework.  My rationale for choosing JBoss AOP over AspectJ is primarily due to JBoss AOP's integration on the JBoss Application Server (no need to add the libraries since they already exist).  This demo, however, will not require the use of a J2EE application server.

In this post, I'm going to demonstrate how to perform method interception using the JBoss AOP framework.  For those who don't know what Method Interception is, AOP frameworks allow engineers to intercept a method before or after it is executed.  Commonly used examples include logging/auditing and transactions.  What if there was a requirement to log the execution of a particular method, recording the time, user and outcome (operation successful?).  We also need to wrap the logic in a transaction, because the method Account.withdraw needs to be rolled back if it fails.

A traditional approach to this requirement might look something like this:

public bool withdrawMoney(Account account, 
     User user, Currency amountToWithdraw){

   if(account.hasSufficientFunds(amountToWithdraw)){
      
      Transaction transaction = GetTransaction();

      try {

        transaction.begin();

        account.withdraw(user, amountToWithdraw);

        logger.info("Customer %s tried successfully "
                + "attempted to withdraw %s from " 
                + "account %s at %s.", 
                account.getId(), user.getFullName(),
                amountToWithdraw.getDollarAmount(),
                (new Date()).toString());

      }  catch (Exception e){

        transaction.rollback();

        logger.info("Customer %s tried unsuccessfully "
                + "attempted to withdraw %s from " 
                + "account %s at %s.  "  
                + "Details: %s.",
                account.getId(), 
                user.getFullName(),
                amountToWithdraw.getDollarAmount(),
                (new Date()).toString()
                e.getMessage());

      } finally {

        transaction.close();
      }
      
      return transaction.wasSuccessful();
   } else {

     logger.info("Customer %s tried unsuccessfully "
                + "attempted to withdraw %s from " 
                + "account %s at %s.  "  
                + "Details: %s.",
                account.getId(), 
                user.getFullName(),
                amountToWithdraw.getDollarAmount(),
                (new Date()).toString()
                "Insufficient Funds.");

      return false;
   }
}

The AOP approach to this method is to remove the "logging" and "transaction" concerns from the business logic (withdrawing money). Let's take a look at the AOP version of withdrawMoney.

public bool withdrawMoney(Account account, 
     User user, Currency amountToWithdraw){

   if(account.hasSufficientFunds(amountToWithdraw)){

        account.withdraw(user, amountToWithdraw);

        return true;
   }

   return false;
}

What!? Clearly this method does not meet the requirement of logging and using a transaction on the Account.withdraw method. AOP allows us to intercept the withdrawMoney method, applying Advice to it before, after, or around its execution. Advice is the formal term for the code (typically a class) responsible for applying functionality onto a method call.

In our AOP example, we would add the following advice to the withdrawMoney method:

import org.jboss.aop.joinpoint.MethodInvocation;

public class LoggingAndTransactionAdvice {

   public Object invoke(MethodInvocation invocation) 
      throws Throwable {  
     
      Object returnValue;

      Transaction transaction = GetTransaction();

      try {

        transaction.begin();

        // Call the original method and collect
        // its return value
        returnValue = invocation.invokeNext();

        LogInfo(invocation.getMethod().getName(),
                 invocation.getArguments());

      }  catch (Exception e){

        transaction.rollback();

        LogError(invocation.getMethod().getName(),
                 e.getMessage(),
                 invocation.getArguments());

      } finally {

        transaction.close();
      }
      
      // return the method's original value
      return returnValue;
   }


   private void LogInfo(String template, 
           Object[] methodParameters){

       //Get the correct template (method name) 
       //and apply the parameters
   }

   private void LogError(String template, 
           String errorMessage, Object[] methodParameters){

       //Get the correct template (method name) 
       //and apply the error message and parameters
   }
 
}

Under the hood, the AOP framework is going to generate proxy objects that wrap your object with another, creating chains of interceptors that each get a chance at applying their functionality to your method call. The previous example is actually not appropriate because we probably would not want to mix transactions and logging in the same Advice. You should probably thinking right now, "how do we proxy the advice onto the method?" That's an excellent question. JBoss AOP allows us to define the methods we want to apply advice to by a number of properties: classes, types, packages, arguments of methods, etc. In my opinion, the best way to apply this advice is through XML configuration, however, you can also do this via Annotation. As a side note, if you use Annotation, this class is commonly called an Aspect, not an Advice because it contains the pointcut description with its Advice. Wait a minute, what is a "pointcut"!?

A pointcut is a place where we need to separate concerns. In the case of the previous example, it was "around" the withdrawMoney method. Sometimes, you may not want to entirely wrap a method. Pointcuts can also be "Before" or "After" method execution, or during "Execeptional" circumstances that occur within a method. In JBoss AOP, we can define Pointcuts in XML:

<?xml version="1.0" encoding="UTF-8"?>
<aop>
   <aspect 
      class=
      "com.berico.aop.LoggingAndTransactionAdvice" 
      scope="PER_VM" />
   <bind 
      pointcut=
      "execution(* com.berico.banking.->*(..))" >
      <around name="invoke" 
      aspect=
      "com.berico.aop.LoggingAndTransactionAdvice" />
   </bind>
</aop>

The pointcut expression, execution(* com.berico.banking.->*(..)) basically says we want to intercept execution on all methods in the com.berico.banking namespace, with any return type (the * at the beginning of the expression, and any argument signature (the two periods ".." in the parenthesis of the expression). This expression syntax gives engineers quite a bit of flexibility in applying AOP against any method or methods in their application.

The last thing that we need to do to get this hypothetical application to run is to apply an environment parameter to the Java run path:
-javaagent:/path/to/jboss-aop-2.1.8.GA/lib/jboss-aop.jar
-Djboss.aop.path=/path/to/pointcuts/jboss-aop.xml

That's it. Now we have AOP integrated into our application. Of course, the previous example was a non-compilable demonstration. Let me show you a real example of AOP, and introduce an interesting feature you may not have thought about regarding AOP.

What if I am working on a system, and I want logging or transaction support on an API I do not have the source for? AOP allows us to apply Advice onto classes in any library, not just the ones we have created. In this next example, I'm going to demonstrate apply Advice onto the JODA Time Library: http://joda-time.sourceforge.net/.

Consider the following application:
package com.berico.aop;

import org.joda.time.Chronology;
import org.joda.time.DateTime;

public class InterceptionDemo {

   public static void main(String... args){
  
      DateTime dt = new DateTime(
          Chronology.getGregorianUTC());

      int year = dt.getYear();

      System.out.println(
         String.format(
            "Welcome to the year %s.", year));
  
  }
 
}

Here is the output from the console:

CALLING METHOD getYear 
  ON org.joda.time.DateTime
CALLING METHOD getYear 
  ON org.joda.time.chrono.GregorianChronology
METHOD HAS RETURNED VALUE: 2010
METHOD HAS RETURNED VALUE: 2010
Welcome to the year 2010.

Obviously, there's a lot more information than was defined in the main method of the InterceptionDemo class.  Let's look at the Advice producing this extra information:

package com.berico.aop;

import org.jboss.aop.joinpoint.MethodInvocation;

public class LoggingAspect {

   public Object invoke(MethodInvocation invocation) 
         throws Throwable {
      
      System.out.println(String.format(
         "CALLING METHOD %s \n ON %s", 
         invocation.getMethod().getName(), 
         invocation.getTargetObject()
                   .getClass().getName()));

      Object returnValue = invocation.invokeNext();

      System.out.println(
         String.format("METHOD HAS RETURNED VALUE: %s",
         (returnValue != null)? 
             returnValue : "No return value"));

      return returnValue;
   }
 
}

And finally, here is our pointcut definition using the JBoss AOP xml format.
<?xml version="1.0" encoding="UTF-8"?>
<aop>
   <aspect 
   class="com.berico.aop.LoggingAspect" 
   scope="PER_VM" />
   <bind 
   pointcut=
     "execution(* org.joda.time.*->getYear(..))" >
   <around 
      name="invoke" 
      aspect="com.berico.aop.LoggingAspect" />
   </bind>
</aop>

*One thing to note is that this piece of advice actually captures two methods from different classes in the Joda Time framework!

AOP offers engineers considerable flexibility, while separating unnecessary concerns from the business logic of their application. AOP can also be used to enhance the functionality of external code without requiring that code to be modified in anyway. JBoss AOP is an incredibly useful framework, and in a subsequent post, I will use it to demonstrate a more powerful capability of AOP called "introductions".

Rich

Resources

Saturday, December 11, 2010

Using JBoss Rules (Drools) in Scala


The JBoss Rules Framework, more often referred to as "Drools", is an excellent tool for combining Rules, Workflow, and Events into what the framework authors call "the Business Logic Integration Platform" (BLIP) [1].  We use Drools extensively on our project; the framework enabled us to quickly construct a rich Event Driven Architecture (SOA 2.0) in a very short time frame.  In fact, we like Drools so much, we've been considering using it in the realm of NLP as a way for applying rules to Entity Extraction and Resolution (I should mention this was originally John's idea).

The downside to Drools is that the framework is written in Java.  Sure, Mark Proctor and gang have put a lot effort in making the platform accessible to other languages through web services, but this is far from the level of integration I would enjoy (like a C# version).  I think I should take the time an caveat the fact that I consider myself to be a Java programmer (first and foremost) and then a C# developer, but after a couple of solid weeks in .NET, I find it difficult to go back.  The one thing I do love about Java (more than anything else) is the language's amazing Open Source community.  So instead of embracing the "Dark Side" and becoming a Microsoft evangelist, I have decided to look into the newer languages written for the JVM.

The two JVM languages I find the most intriguing are Clojure and Scala.  Both offer a number of benefits over Java, especially in the area of concurrency (probably one of the most important features for my line of work).  Of course, both also have their disadvantages.  Clojure has proven very difficult to learn, at least for me, since I do not have a solid LISP background.  Scala feels much more natural,  after a couple of chapters of Martin Odersky's book on the language [2] I was writing some minor applications, but it does have some interoperability issues with Java.   For instance, Scala cannot access static properties of Java objects [3].

Since I like both languages, I wanted to see how compatible they were with Drools.  In this post (sorry for the long introduction), I will demonstrate how Scala can be used to not only run the Drools engine, but also be used to define domain objects that can be used within the engine.

I was a meteorologist in another life, so I'm going to draw from that domain.  Let me start by introducing our one, and only, domain model object "Temperature".

package com.berico.model.weather

trait Temperature {
 def value : Double
}

I admit that this is a terrible example of a domain object because I am not constraining the "value" to a specific temperature unit.

For the purpose of this demo, we are going to write some rules (in Drools Expert) that will evaluate this temperature to determine if we need to declare a "heat warning" or a "freeze warning".  The Rule Definition (a text file called "WeatherRules.drl" is listed below.

package com.berico.rules
 
import com.berico.model.weather.Temperature

rule "Too Hot"
   dialect "mvel"
   when
      temp : Temperature( value > 85 )
   then
      System.out.println(
        temp.value.toString() + " F is too hot."
        + " Declare HEAT WARNING!"); 
end

rule "Too Cold"
   dialect "mvel"
   when
      temp : Temperature( value < 32 )
   then
      System.out.println(
         temp.value.toString() + " F is too cold."
         + " Declare FREEZE WARNING!"); 
end

We have now established that if a temperature greater than 85 degrees, or less than 32, is inserted into the Drool Knowledge Session, we should see an message declaring a HEAT WARNING or FREEZE WARNING, respectively.

The next step is to load the rule file (WeatherRules.drl) into the Drools runtime.  We'll wrap this logic into a function that returns the StatefulKnowledgeSession, which is primary object you use to insert facts into and run rules against in the Drools API.

def GetKnowledgeSession() : 
          StatefulKnowledgeSession = {
   
   // Get a new knowledge builder instance
   var kbuilder : KnowledgeBuilder 
      = KnowledgeBuilderFactory
          .newKnowledgeBuilder()

   // Add our ruleset (which will be parsed 
   // and compiled) to the knowledge builder
   kbuilder.add(ResourceFactory
      .newClassPathResource("WeatherRules.drl"),
         ResourceType.DRL)

   // Create a new knowledgebase
   var kbase : KnowledgeBase 
     = KnowledgeBaseFactory.newKnowledgeBase()

   // Add the compiled rule sets and workflows 
   // into the knowledgebase
   kbase.addKnowledgePackages(
      kbuilder.getKnowledgePackages())

   // Create the knowledge session
   var ksession : StatefulKnowledgeSession 
      = kbase.newStatefulKnowledgeSession()

   // If you want to see what's going on within
   // the Drools Engine just uncomment this line
   //var logger : KnowledgeRuntimeLogger = 
   //  KnowledgeRuntimeLoggerFactory
   //    .newConsoleLogger(ksession)

   // Return the knowledge session
   ksession
} 

Finally, we will retrieve the StatefulKnowledgeSession, insert two Temperature objects (called facts in Drools) and fire the ruleset.

def main(args : Array[String]) : Unit = {

   println("Creating Knowledge Session")
   
   // Retrieve the knowledge session
   var ksession : StatefulKnowledgeSession 
      = GetKnowledgeSession()
    
   println("Creating and insertng Temperature")
    
   // Here is our first temperature object
   val shouldBeTooHot = new Temperature {
      def value = 100 
   }
    
   // Here is our second temperature object
   val shouldBeTooCold = new Temperature {
      def value = 20
   }
    
   // Insert the temperatures into the 
   // knowledge session
   ksession.insert(shouldBeTooHot)
   ksession.insert(shouldBeTooCold)
    
   println("Firing all rules")
    
   // Fire the rules
   ksession.fireAllRules()
}

That's it. The output at the console should read (copied this from Eclipse):

Creating Knowledge Session
Creating and insertng Temperature
Firing all rules
20.0 F is too cold. Declare FREEZE WARNING!
100.0 F is too hot. Declare HEAT WARNING!

As you can see, it is completely possible to use Scala in conjunction with JBoss Rules, and not necessarily as a second-class citizen.

In a later article, I hope to demonstrate the exact same tutorial in Clojure.

Rich

Resources:

Eclipse Project  http://dl.dropbox.com/u/12311372/DroolsScala.zip

References:

[1].  http://www.jboss.org/drools
[2].  http://www.artima.com/shop/programming_in_scala
[3].  http://www.scala-lang.org/faq/4

Sunday, December 5, 2010

Calculating Similarity (Part 1): Cosine Similarity

(Dot Product, 2010).

I once worked on a project that was attempting to fix OCR errors by comparing the similarity of the erroneous string to potential candidate strings. In the process, I did a lot of research on string comparison, especially in the area of similarity. I came across, and eventually implemented, a number of mathematical formulas for determining similarity and want to share these techniques with you. The first formula I'm going to demonstrate for you is "cosine similarity".


Cosine similarity is literally the angular difference between two vectors.  Cosine Similarity is expressed by the formula:

(Cosine Similarity, 2010)





For those of you allergic that don't speak "mad mathematician" (like me), to calculate the cosine similarity, we need to:

  1. Take the dot product of vectors A and B.
  2. Calculate the magnitude of Vector A.
  3. Calculate the magnitude of Vector B.
  4. Multiple the magnitudes of A and B.
  5. Divide the dot product of A and B by the product of the magnitudes of A and B.
The result of this calculation will always be a value between 0 and 1, where 0 means 0% similar, and the 1 means 100% similar.  Pretty convenient, huh?

There is still a problem.  We need to figure out a way to represent a string as an vector of numbers, since we need to calculate the dot product and magnitude of two vectors.

Let's eliminate the first question that might arise: What is a vector? 


A vector is a geometric structure with both length (magnitude) and direction.


On a 2-dimensional Cartesian coordinate plane, you might see a vector like (2, 1), which indicate a line starting at the origin (0, 0) and extending to 2 units "right" on the x-axis and 1 unit "up" on the y-axis.  Adding a 3rd dimension to a vector is easy, simply add another point to the vector: (2, 1, 3).  In the previous example, we've added a value of "3" to the z-axis.


We can perform mathematical operations on vectors like we can scalar values.  Vectors can support standard mathematical operations (addition, multiplication, division) if and only if the vectors the operations are being performed on have an equal number of dimensions (you will see that this is what causes the VectorMathException in the code submitted with this post).


Vectors also feature a couple of special operations, which we will be using for the purposes of this demonstration:  magnitude and dot product.


For those of you who have "data-dumped" the magnitude and dot product operations of vectors, here is a short reminder:


Magnitude - literally the length of a vector.

(Magnitude, 2010).





Here the example on Wikipedia: "the magnitude of [4, 5, 6] is √(42 + 52 + 62) = √77 or about 8.775" (Magnitude, 2010).


Dot Product - the inner product of two vectors.

(Dot Product, 2010).





Here is an example of the Dot Product of two vectors:

(Dot Product, 2010).




Now that we have gotten through the obligatory mathematics necessary in understanding how this algorithm works, lets see how we can use it to compare the similarity of strings.

The first task we have is to convert the strings into vectors.  We say before that we can represent three dimensional Cartesian coordinate space by three variables: x, y, and z.  We can do the same thing with a string by representing each character 'a', 'b', 'c', etc. as a dimension.  If we impose the limitation that we are going to only use lower-case letters as the possible dimensions of our vector, we are left with 26 dimensions.

If we store only the occurrence of a letter in the vector, we have a binary occurrence vector.  If we store the frequency in which a letter occurs in the vector, we have a frequency of occurrence vector.

Consider the term "apple":

Dimension keys in order
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z

Binary Occurrence Vector of "apple"
1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0

Frequency of Occurrence Vector of "apple"
1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0

Of course, there are much easier ways to deal with strings than storing all of those zeros.  When comparing strings, we can simply take the "union" of dimensions in both strings, reducing the number of dimensions we need to calculate.

Taking the union of the strings "apple" and "applet", we get the vector: (a, e, l, p, t).

For this demonstration, we are going to use frequency of occurrence vectors since they more accurately measure similarity.  Keep in mind, if you don't care how often a dimension occurs, the binary occurrence vector can be used for this formula.

Converting our previous words "apple" and "applet" into frequency of occurrence vectors:

apple = (1, 1, 1, 2, 0)    applet = (1, 1, 1, 2, 1)

We can now perform the cosine similarity calculation:
  1. Dot Product = (1 * 1) + (1 * 1) + (1 * 1) + (2 * 2) + (0 * 1) = 1 + 1 + 1 + 4 + 0 = 7
  2. Magnitude of apple = √(12 + 12 + 12 + 22 + 02) = √(1 + 1 + 1 + 4 + 0) = √(7)
  3. Magnitude of applet = √(12 + 12 + 12 + 22 + 12) = √(1 + 1 + 1 + 4 + 1) = √(8)
  4. Products of magnitudes A & B =  √(7) * √(8) = √(56) = 7.48331477
  5. Divide the dot product of A & B by the product of magnitude of A & B = 7 / 7.48331477 = .935414347 (or about 94% similar).
Very cool.  Now we have a framework for mathematically comparing strings.  As you can see, there's not a whole lot of programming required to implement this functionality.  We simply need to convert our strings to vectors, take the union of those vectors to create a shared dimensional space, and then apply the equation to the resulting frequency of occurrence vectors.

Here's the link to the Eclipse Project: StringSimilarity.zip

Here's the signature of the API I've created to calculate similarity:

public static void main(String[] args) {  
   ISimilarityCalculator similarity = 
      new CosineSimilarity();
   String one = "apple";
   String two = "applet";
   double percentageSimilar = 
      similarity.calculate(one, two) * 100;
   System.out.println(
      String.format("%s and %s are %s%% similar", 
         one, two, percentageSimilar));
}
And here is the output from the command line:

apple and applet are 93.54143466934852% similar
As you can see, we're getting results similar to our manually calculated similarity.

Going back through the steps, here is how I've accomplished this algorithm.

Convert string to vector:

/**
 * Convert a string to a set of characters
 * @param string input string
 * @return set of characters
 */
public static Collection<Character> 
               stringToCharacterSet(String string){

   Collection<Character> charSet = 
      new HashSet<Character>();
   for(Character character : string.toCharArray()){
      charSet.add(character);
   }
   return charSet;
}

Take the union of two character vectors:
/**
 * Get the union of characters from two strings, 
 * the set of characters are sorted alphabetically.
 * @param string1 First String
 * @param string2 Second String
 * @return the unique set of characters 
 *         occurring in both strings
 */
public static Collection<Character> 
                union(String string1, String string2){
 
   Collection<Character> mergedVector = 
      new TreeSet<Character>();
   mergedVector.addAll(stringToCharacterSet(string1));
   mergedVector.addAll(stringToCharacterSet(string2));
   return uniqueCharacters(mergedVector);
}

/**
 * Get the unique set of characters from a string.
 * @param vector input string vector
 * @return set of unique characters
 */
public static Collection<Character>  
       uniqueCharacters(Collection<Character> vector){
  
   Collection<Character> uniqueSet = 
      new HashSet<Character>();
   for(Character c : vector){
      if(!uniqueSet.contains(c)){
         uniqueSet.add(c);
      }
   }
   return uniqueSet;
}
Create a Frequency of Occurrence Vector:

/**
 * Get the Frequency of Occurrence Vector 
 * from the Union Set and target string
 * @param string Input String
 * @param unionSet Set of all Character-dimensions
 * @return Frequency of Occurrence Vector
 */
private static Collection<Integer> 
   createFrequencyOfOccurrenceVector(
     String string, Collection<Character> unionSet){
  
   Collection<Integer> occurrenceVector 
      = new Vector<Integer>();
   for(Character c : unionSet){
      occurrenceVector.add(
         (Integer)countCharacter(string, c));
   }
   return occurrenceVector;
}

/**
 * Count the number of times a character 
 * occurs in a string
 * @param string Input String
 * @param character Character to be counted
 * @return Frequency of Occurrence
 */
 private static int countCharacter(
   String string, Character character){

   int count = 0;
   for(Character c : string.toCharArray()){
      if(c.equals(character)){
         count++;
      }
   }
   return count;
}
Calculate the Dot Product:

/**
 * Calculate the Dot Product 
 * (inner product) of two vectors
 * @param vectorOne Vector
 * @param vectorTwo Vector
 * @return Dot Product
 * @throws VectorMathException Thrown if vectors 
 *         are not of equal length
 */
public static int dotp(
   Integer[] vectorOne, Integer[] vectorTwo) 
   throws VectorMathException {

   if(vectorOne.length != vectorTwo.length){
      throw new VectorMathException(
         "Input Vectors do not have the" + 
         "same number of dimensions.");
   }
 
   int dotProduct = 0;
   for(int i = 0; i < vectorOne.length; i++){
      dotProduct += (vectorOne[i] * vectorTwo[i]);
   }
   return dotProduct; 
}

Calculate the Magnitude:
/**
 * Calculate the Magnitude of a vector
 * @param vector Vector
 * @return Magnitude of the Vector
 */
public static double magnitude(Integer[] vector){
   double magnitude = 0;
   for(int i = 0; i < vector.length; i++){
      magnitude += Math.pow(vector[i], 2);
   }
   return Math.sqrt(magnitude);
}
And finally, here is the Cosine Similarity function:

/**
 * Calculate the similarity of two 
 * strings using Cosine Similarity
 * @param stringOne First input string
 * @param stringTwo Second input string
 * @return cosine of the two angles 
 *         (percentage of similarity)
 */
@Override
public double calculate(
   String stringOne, String stringTwo) {

   Collection<Character> unionOfStringsOneAndTwo = 
     union(stringOne, stringTwo);

   Collection<Integer> stringOneOccurrenceVector = 
     createFrequencyOfOccurrenceVector(
        stringOne, unionOfStringsOneAndTwo);

   Collection<Integer> stringTwoOccurrenceVector = 
     createFrequencyOfOccurrenceVector(
        stringTwo, unionOfStringsOneAndTwo);

   int dotProduct = 0;

   //This should be an unnecessary exception 
   //since we're submitting the union
   //of both strings
   try {

      dotProduct = dotp(
        stringOneOccurrenceVector, 
        stringTwoOccurrenceVector);

   } catch (VectorMathException e){
      e.printStackTrace();
      System.out.println(stringOneOccurrenceVector);
      System.out.println(stringTwoOccurrenceVector);
      return -2;
   }
  
   double vectorOneMagnitude = 
      magnitude(stringOneOccurrenceVector);
   double vectorTwoMagnitude = 
      magnitude(stringTwoOccurrenceVector);
   
   return dotProduct / 
           (vectorOneMagnitude * vectorTwoMagnitude);
}
That's it. The rest of the code can be found attached to this post (and that includes some initial unit tests. As a side note, I know the use of Java collections are probably pretty confusing. I really do not like the Java collection API, but did not want to add any new dependencies (like Apache Commons Collections or the Google Collections).

Conclusion.

The Cosine Similarity algorithm is useful for determining if the composition of two strings are similar, and does not take the order of the strings into account. There are better, though more computationally expensive algorithms for calculating a more accurate percentage of similarity. In later posts, I hope to introduce those algorithms and how they work.

Thank you for your time.

Richard

Resources

Eclipse Project:  StringSimilarity.zip

Works Cited


Wikipedia contributors. "Cosine similarity." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 3 Dec. 2010. Web. 5 Dec. 2010.

Wikipedia contributors. "Dot product." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 18 Nov. 2010. Web. 5 Dec. 2010.

Wikipedia contributors. "Magnitude (mathematics)." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 20 Nov. 2010. Web. 5 Dec. 2010.