05 April 2009

insert performance variance

Disclaimer: I've done some research here based purely on measurement and assumed use cases. The performance discrepancies shown are on very large data volumes (1 million elements), so the findings may or may not be applicable.



Here's the premise under which I started searching for measurement:



  • I see Sorted collections, specifically TreeSet objects being used rampantly to maintain an ordered set of data

  • Typically, the ordering is comparison based (driven by the object implementing Comparable)
  • Often, the contents of the collection, once populated, are not tampered with

  • A frequent use of the contents of the collection is to iterate over it in sequence



Considering the above, it seems like we (I know I've done this) generally feel comfortable incurring some cost upon adding elements to a collection in order to be able to recover them in sorted fashion quickly. The specific cost of insert time is log(n) where n is the number of elements already in the Set. For a clean explanation about why the cost is log(n), read about Binary Search Trees.



Here's what my dilemma is though: add operations on lists cost very little (a constant for all practical purposes). And sorting on a list that can manage a swap operation quickly is likely to be very fast too. A simple QuickSort should theoretically be able to accomplish n(log(n)) performance if the data is reasonably well distributed.



So the question that I wanted to tackle was: For large enough data sets, would it be quicker to get a sorted collection by adding data to a simple list and then sorting it, or would it be quicker to work with a data structure that sorts with each insert? I decided to play with the ArrayList and TreeSet to dig into this.



Here's the ArrayList code:



ArrayList<Long> arrayList = new ArrayList<Long>();
Random r = new Random(1l);
long start = System.currentTimeMillis();
for (int x = 0; x < 1000000; x++) {
long l = r.nextLong();
arrayList.add(l);
if (arrayList.size() % 10000 == 0) {
long now = System.currentTimeMillis();
System.out.println(now - start);
}
}



Here's the TreeSet:



TreeSet<Long> treeSet = new TreeSet<Long>();
Random r = new Random(1l);
long start = System.currentTimeMillis();
for (int x=0; x<1000000; x++) {
long l = r.nextLong();
treeSet.add(l);
if (treeSet.size() %10000 == 0) {
long now = System.currentTimeMillis();
System.out.println(now - start);
}
}



Here's a graph of the performance:





Not surprisingly, the ArrayList is way quicker. Looking at the cumulative cost though, you get a better picture of the total time it takes to add a lot of data to a sorted vs unsorted collection:





Again, not terribly suprising. But, now let's try and equalize the game a bit. Let's transcribe the QuickSort Algorithm to code, and then run it on the ArrayList after we've added everything to it. Here's the QuickSort:




private void sort(List<Long> arr, int start, int end, int d) {
if (start < end) {
int pivot = start;
int newPivot = partition(arr, pivot, start, end);
sort(arr, start, newPivot - 1, ++d);
sort(arr, newPivot + 1, end, ++d);
}
}


private int partition(List<Long> arr, int pivot, int start, int end) {
long pivotValue = arr.get(pivot);
swap(arr, pivot, end);
int newPivot = start;
for (int x = start; x < end; x++) {
if (arr.get(x) <= pivotValue) {
swap(arr, x, newPivot);
newPivot++;
}
}
swap(arr, newPivot, end);
return newPivot;
}

/**

* Helper to manage a 3-op swap
*/

private void swap(List<Long> arr, int from, int to) {
long tmp = arr.get(from);
arr.set(from, arr.get(to));
arr.set(to, tmp);
}

public void sort(List<Long> arr) {
sort(arr, 0, arr.size() -1, 0);
}



And, tweaking the ArrayList insert code to leverage the sort:



ArrayList<Long> arrayList = new ArrayList<Long>();
Random r = new Random(1l);
QuickSort qs = new QuickSort();
long start = System.currentTimeMillis();
for (int x = 0; x < 1000000; x++) {
long l = r.nextLong();
arrayList.add(l);
if (arrayList.size() % 10000 == 0) {
long now = System.currentTimeMillis();
System.out.println(now - start);
}
}
qs.sort(arrayList);
long now = System.currentTimeMillis();
System.out.println(now - start);



Our performance numbers are:



Even with the separate sort, costing n(log(n)), the ArrayList won handsomely (a factor of approx 20%). Granted, the quicksort wasn't very general, but this does seem to raise a few questions. Perhaps for large volumes, this is a viable strategy?

2 comments:

  1. Oh, well, yeah, you're only sorting the ArrayList once.

    A true comparison would be be qs.sort on each iteration. Then TreeSort would probably win. :-)

    I guess the question is how often you need to sort vs. how often you need to insert. If inserts out weight sorts by some (large?) factor, go array list + late-bound sort.

    It would be interesting to build a List that sorted on-demand and then cached the result until it was invalidated by an add/remove. So you could add 10k things to it, and not pay any sorting costs, until .iterate() (or whatever) is called, then it does a quick sort, and caches the result.

    This would get the same speed up of your ArrayList + quick sort, but encapsulated so that caller's did know about it.

    Good post.

    ReplyDelete
  2. A true comparison would be be qs.sort on each iteration. Then TreeSort would probably win. :-)

    No doubt about that- the cost of addition on the array list with this technique would be n(log(n)), while the cost with the tree set would only be log(n).

    The concept of a list that sorted lazily is quite compelling. In fact, it seems odd that a "sorted list" doesn't seem to exist out of the box that encompasses this behavior.

    ReplyDelete