Sunday, November 21, 2010

Performance Comparison: String split() vs substr() on Composite Keys

So I have a very interesting engineering problem. I have "buckets" of information, consider a bucket as a unique data source, and each bucket contains a collection of items. I'm going to use UUID's for the item ids (I want to be able to generate ids outside of the database), but I will still need to be able to associate each item to its corresponding bucket.

In code, an item might look like:
public interface Item {
   //This represents a UUID, but since item is a domain
   //object that we use in both .NET and Java, we keep it a
   //string value
   String getId();
   String getBucketId();
}

I need to have extremely fast access to this data, so I'm going to store Items in a cache cluster (I prefer Redis). However, to cluster, I need to come up with a partitioning strategy. The natural strategy is to partition the dataset by bucket, and then route the request to the appropriate Redis box. However, I must retain the bucket id with the item ids, so I'm going to need to concatenate the both ids together to create a composite id.

*Note: there are some other constraints I'm leaving out to keep this explanation simple, so let's assume this is way the keys have to be structured.

Now, once I pull a bunch of keys out of the cache (possibly 1,000,000+ at any given time), I'm going to need to parse the composite ids to get back the bucket and item ids. So the question is, what is the fastest way to parse these keys? I'm going to have a huge amount of keys and every nanosecond will count! Keep in mind, I'm allowed to structure the composite key anyway I want.

So I've come up with two ways to construct this id:

1. Concatenate the bucket id to either the front or back of the item id. Since the UUID's a fixed length, all I need to do is perform a substring operation to get both ids.
def CreateSubstrId(bucketId : Long) : String = {
    val itemId : String = CreateUUID
    String.format("%019d%s", long2Long(bucketId), itemId)
}

2. Concatenate the bucket and item id together and use a delimiter to separate the ids so I can use a split operation to get both ids.
def CreateSplitId(bucketId : Long) : String = {
    val itemId : String = CreateUUID
    String.format("%s~%s", bucketId.toString, itemId)
}

The CreateUUID function simply uses the Java UUID Generator library to create a random UUID, returning the value as a String
def CreateUUID() : String = {
    val nic : EthernetAddress 
         = EthernetAddress.fromInterface()
    val uuidGenerator : TimeBasedGenerator 
         = Generators.timeBasedGenerator(nic)
    uuidGenerator.generate().toString
}

The following are the functions I will use to deconstruct the composite ids back into a bucket id and an item id, and compare the bucket key to the original bucket key that was originally used to create it (for accuracy sake). The primary mission for these functions is to record the time necessary to execute the operation, not to return the keys! (At this point, I only want to know what parsing strategy is faster).

1. For the fixed-length id, using a call to substring:
def CalculatePadZeros(strBucketId : String, 
                          matchId : Long) : Long = {
    //Record start time
    val start = System.nanoTime
    //Take the substring of the key to find the bucket id
    val bucketId = strBucketId.substring(0,19)
    //Compare the bucket id
    val equals = IsSame(bucketId, matchId)
    //Calculate total time
    val total = System.nanoTime - start
    //If the id's weren't equal, display a message
    if(!(equals)) println(String.format
        ("PadZero -> Not Equals: %s & %s", 
        strBucketId, matchId.toString))
    //Return the total time it took to perform the operation
    total
}

2. For the delimited id, using a string split:
def CalculateStringSplit(strBucketId : String, 
                             matchId : Long) : Long = {
    //Record start time
    val start = System.nanoTime
    //Perform a string split on the tilda '~' delimiter 
    //and return the first item in the array
    val bucketId = strBucketId.split('~')(0)
    //Compare the bucket id
    val equals = IsSame(bucketId, matchId)
    //Calculate total time
    val total = System.nanoTime - start
    //If the id's weren't equal, display a message
    if(!(equals)) println(String.format
       ("StringSplit -> Not Equals: %s & %s",
        strBucketId, matchId.toString))
    //Return the total time it took to perform the operation
    total
}

The comparison function IsSame() is fairly straight forward:
def IsSame(bucketIdAsStr : String, comparedBucketId : Long) : Boolean = {
    //Parse the string
    val bucketId : Long = bucketIdAsStr.toLong
    //Compare the two values
    bucketId == comparedBucketId
}

Now that we've seen all the methods necessary to produce and parse the ids, here is my main function that will perform the test some designated number of iterations (in this case 100,000) outputting the results to the console:
def main(args : Array[String]) : Unit = {
    //Sum of all times in the substr test
    var substrCombinedTotal : Long = 0L
    //Sum of all times in the split test
    var stringSplitCombinedTotal : Long = 0L
    //Total number of iterations to run the test
    var totalIterations : Int = 100000
    //Iterate 
    for(i <- 1 to totalIterations){
 //Create a Bucket Id
 var randomBucketId = new Random().nextLong
 //Make sure this is a positive number
 randomBucketId = Math.abs(randomBucketId)
 //Run the substr test, store the time taken
 substrCombinedTotal += CalculatePadZeros(
             CreateSubstrId(randomBucketId), 
             randomBucketId)
 //Run the split test, store the time taken
 stringSplitCombinedTotal 
             += CalculateStringSplit(
                     CreateSplitId(randomBucketId), 
                     randomBucketId)
    }
    //Calculate the average time for the substr 
    //test and output to the console
    println(String.format("Substr averages %s ns", 
        (substrCombinedTotal / totalIterations)
            .toString))
    //Calculate the average time for the split test and 
    //output to the console
    println(String.format("String Split averages %s ns", 
        (stringSplitCombinedTotal / totalIterations)
            .toString))
}
The results are pretty interesting. Here is the output of the application from the console at 100, 1000, 10000, 100000 iterations:

100 Iterations
Substr averages 30940 ns
String Split averages 116860 ns
1000 Iterations
Substr averages 6907 ns
String Split averages 56163 ns
10000 Iterations
Substr averages 2631 ns
String Split averages 23613 ns
100000 Iterations
Substr averages 557 ns
String Split averages 5687 ns

As you can see, the Substring strategy is significantly faster, especially when the number of iterations is low. Part of my research requirement was to determine if the performance difference between the two methods was trivial (since we suspected the substring would be faster). These numbers demonstrate that the cost is not trivial and that you could potentially save an enormous amount of time by not using delimiters and string splits.

Hope you found this interesting.

Rich

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.