Recently, Robert Hundt of Google published a paper titled “Loop Recognition in C++/Java/Go/Scala”
The paper compares various aspects (run-time performance, code size, etc.) of the implementation of the same algorithm in different languages. See the paper for a detailed analysis. The final (best) performance results for the various languages are:
Benchmark Time [sec] C++ : 23 Java : 89 Scala : 58 Go : 126
I ported the elegantly written Scala code by Google’s Daniel Mahler to F# and then further optimized it with the help of Visual Studio 2010 Profiler (which makes profiling almost effortless) to achieve:
F# Performance (updated) 31 sec on Core Duo 2.00 Ghz 20 sec on Core i5
Updated: Independent verification of results here.
While the results are not directly comparable because of the different hardware/OS used, it does show that F# (and by subsumption .Net / C#) offer comparable performance to Java and Scala.
The F# binary and source can be downloaded from http://loopfinder.codeplex.com/
(update: To get the source go to “View all downloads” at the above link)
The equivalent Scala code is here.
Two significant F# optimizations done are:
- “for” loops are used in performance critical regions (instead of Seq.iter, or Seq.fold)
- Object.Equals method was used for equality / inequality testing instead of “=” and “<>” operators
As with Scala, the F# version uses the mutable platform collections (List<> and HashSet<>) instead of F# immutable collections.
The source code size for F# and Scala is about the same. Scala has a “short cut” lambda syntax that make for very concise code when writing simple lambda expressions.
Note: Subsequent to the paper, another team optimized Go and C++ code to run in less than 4 seconds however the optimizations were very specific to the benchmark and cannot be applied generally.
Here is the blog post that talks about C++ and Go improvements : http://blog.golang.org/2011/06/profiling-go-programs.html
It has links to the C++ source which is here: http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.cc
Reading this post I get the impression that when we have tight loops then even a little bit of allocation makes a big difference.
The main improvements to go and c++ are the reduction in garbage by reusing cached objects when inside tight loops.
The way the benchmark is written it first constructs a graph (of about 20K nodes) and then finds loops in it 50 times.
For each loop-find iteration most implementations generate some garbage. The benchmark is very sensitive to the amount of garbage generated in each iteration.
I think a better benchmark would have been to generate the graph and find the loops in it once and then repeat this cycle 50 times without caching anything across cycles. Recreating the graph and other objects for each cycle is a more realistic benchmark.
The only sensible conclusion we can draw is that F# on Windows seems to be faster than Scala on Linux because the F# code is a direct port of the Scala code (I believe Google used a Core i7 processor for the benchmarks as indicated here: http://www.theregister.co.uk/2011/07/01/go_v_cpluplus_redux/).
Update: Google used an old Pentium for the benchmark study so the results are not comparable with the F# numbers. However read the comments for some additional information.
Also, I ran the same F# code on Windows 8 and .Net 4.5 (Core i5; HP Pavillion g7) and consistently get 11-12 second runs.