Disclaimer: I've done some research here based purely on measurement and assumed use cases. The performance discrepancies shown are on very large data volumes (1 million elements), so the findings may or may not be applicable.
Here's the premise under which I started searching for measurement:
- I see Sorted collections, specifically TreeSet objects being used rampantly to maintain an ordered set of data
- Typically, the ordering is comparison based (driven by the object implementing Comparable)
- Often, the contents of the collection, once populated, are not tampered with
- A frequent use of the contents of the collection is to iterate over it in sequence
Considering the above, it seems like we (I know I've done this) generally feel comfortable incurring some cost upon adding elements to a collection in order to be able to recover them in sorted fashion quickly. The specific cost of insert time is log(n) where n is the number of elements already in the Set. For a clean explanation about why the cost is log(n), read about Binary Search Trees.
Here's what my dilemma is though: add operations on lists cost very little (a constant for all practical purposes). And sorting on a list that can manage a swap operation quickly is likely to be very fast too. A simple QuickSort should theoretically be able to accomplish n(log(n)) performance if the data is reasonably well distributed.
So the question that I wanted to tackle was: For large enough data sets, would it be quicker to get a sorted collection by adding data to a simple list and then sorting it, or would it be quicker to work with a data structure that sorts with each insert? I decided to play with the ArrayList and TreeSet to dig into this.
Here's the ArrayList code:
ArrayList<Long> arrayList = new ArrayList<Long>();
Random r = new Random(1l);
long start = System.currentTimeMillis();
for (int x = 0; x < 1000000; x++) {
long l = r.nextLong();
arrayList.add(l);
if (arrayList.size() % 10000 == 0) {
long now = System.currentTimeMillis();
System.out.println(now - start);
}
}
Here's the TreeSet:
TreeSet<Long> treeSet = new TreeSet<Long>();
Random r = new Random(1l);
long start = System.currentTimeMillis();
for (int x=0; x<1000000; x++) {
long l = r.nextLong();
treeSet.add(l);
if (treeSet.size() %10000 == 0) {
long now = System.currentTimeMillis();
System.out.println(now - start);
}
}
Here's a graph of the performance:
Not surprisingly, the ArrayList is way quicker. Looking at the cumulative cost though, you get a better picture of the total time it takes to add a lot of data to a sorted vs unsorted collection:
Again, not terribly suprising. But, now let's try and equalize the game a bit. Let's transcribe the QuickSort Algorithm to code, and then run it on the ArrayList after we've added everything to it. Here's the QuickSort:
private void sort(List<Long> arr, int start, int end, int d) {
if (start < end) {
int pivot = start;
int newPivot = partition(arr, pivot, start, end);
sort(arr, start, newPivot - 1, ++d);
sort(arr, newPivot + 1, end, ++d);
}
}
private int partition(List<Long> arr, int pivot, int start, int end) {
long pivotValue = arr.get(pivot);
swap(arr, pivot, end);
int newPivot = start;
for (int x = start; x < end; x++) {
if (arr.get(x) <= pivotValue) {
swap(arr, x, newPivot);
newPivot++;
}
}
swap(arr, newPivot, end);
return newPivot;
}
/**
* Helper to manage a 3-op swap
*/
private void swap(List<Long> arr, int from, int to) {
long tmp = arr.get(from);
arr.set(from, arr.get(to));
arr.set(to, tmp);
}
public void sort(List<Long> arr) {
sort(arr, 0, arr.size() -1, 0);
}
And, tweaking the ArrayList insert code to leverage the sort:
ArrayList<Long> arrayList = new ArrayList<Long>();
Random r = new Random(1l);
QuickSort qs = new QuickSort();
long start = System.currentTimeMillis();
for (int x = 0; x < 1000000; x++) {
long l = r.nextLong();
arrayList.add(l);
if (arrayList.size() % 10000 == 0) {
long now = System.currentTimeMillis();
System.out.println(now - start);
}
}
qs.sort(arrayList);
long now = System.currentTimeMillis();
System.out.println(now - start);
Our performance numbers are:
Even with the separate sort, costing n(log(n)), the ArrayList won handsomely (a factor of approx 20%). Granted, the quicksort wasn't very general, but this does seem to raise a few questions. Perhaps for large volumes, this is a viable strategy?