Jesper Hald and others recently did a benchmark of a certain algorithm to figure out, which was fastest. It evolved into a kind of competition to make the fastest algorithm to solve this problem:
Fill a data structure with 1,000,000 random, unsorted values from 0-100
Run through this data structure 200 times, counting
a) number of values = 42
b) average for all 1,000,000*200 values
The benchmark was run on a new T5800 (2GHz, 800MHz) Toshiba laptop running 32-bit Windows Vista. Nothing had been done to the Vista to make it faster or behave in a special way.
The results were interesting, and our conclusions were:
* The fastest algorithm was made in Delphi (2009 version), was reasonably easy to read, and achieved to count the 42-values amongst 200,000,000 values in 55ms. That's pretty fast: 0.275 nanoseconds per value or about 0.5 clock cycles per value.
* Some of the first attempts in C# were 30-50 times slower than the winner.
* C# and Java were generally about 1.5 times slower than Delphi. Using normal generic lists in C# would make it 13 times slower than a simple Delphi implementation with static arrays. Is this comparison fair? Well, that's how the first results were made by the various programmers.
* Using unsafe code in C# seemed obvious, but actually wasn't necessary. It was possible to make it approximately as fast in C# without going unsafe.
* Delphi was approximately same speed inside and outside the IDE, whereas C# was almost 4-5 times slower when running the code inside the IDE.
* PHP was 1000-2500 times slower than Delphi.
* We gave up BASH scripting because it took too long time to fill the array with values (!)
Please do not generalize from this example, because there are many other things in this world than counting and averaging integer values. I'm not saying that Delphi is faster than C# or Java, and always remember, that a performance ratio of less than 2 in a specific algorithm is too little to make a difference in a large application.
64 comments:
I think it would be interesting to see the source, especially of the "as fast" C# vs the fastest Delphi code.
Just as a performance ratio of 2:1 is a practical minimum for any real advantage (depending on context of course), continued maintainability of code is also key.
Plus of course how easy it is to achieve the speed in the first place - if it demands arcane constructs and weird syntax or techniques that the majority of developers won't have at their fingertips and which aren't easily intuited, then the practical implication is that those performance levels won't be achieved "in the wild".
As I say, it would be interesting to see the sources of the best performing solutions, but also to put the performance of the "average" solutions in the context of the legibility of the code.
A comparable floating point example could be interesting.
PHP was 1000-2500 times slower than Delphi
Making PHP that much slower seems an achievement :-)
Was the PHP program using a different slower algorithm?
Was the PHP program written by an experienced PHP programmer?
C# and Java were generally about 1.5 times slower than Delphi
That's the sort of difference that seems to show up for tiny programs - although of course there's wide variation from program to program.
So, do you do those steps 200 times? Each iteration, do you regenerate the random values in the list/array? If not, why bother with the extra iterations, when you could just multiply the number of counter 42's by 200?
As far as I know, one of the C# programmers picked the conditions, and I don't know why. However, most solutions are not slowed down by the main RAM because the data is available in the CPU cache.
I participated as the Delphi programmer, and I wrote this code, which seemed impossible to beat by Java and C#:
for i:=0 to 199 do begin
c:=0;
for j:=0 to length(arr)-1 do begin
if arr[j]=42 then
inc (c);
end;
end;
This could run in approx. 220ms, or 2 clock cycles per value (C# and Java were at about 330ms at this stage). Then I asked if multithreading was allowed etc., and Jesper told me that all techniques may be employed. Then I removed branch prediction failures by repeating the inner part of the loop 8 times and by replacing the "if" with an array lookup, replaced the if branch with an array lookup, made it check 2 values at a time and added parallelism, exploiting both cores of the CPU, and got it down to 55ms.
ParallelExecute (
procedure
var i:integer;
j:integer;
c:integer;
begin
for i:=0 to 99 do begin
c:=0;
for j:=0 to (length(arrw) div 8)-1 do begin
inc (c,arr42w[arrw[j*8]]);
inc (c,arr42w[arrw[j*8+1]]);
inc (c,arr42w[arrw[j*8+2]]);
inc (c,arr42w[arrw[j*8+3]]);
inc (c,arr42w[arrw[j*8+4]]);
inc (c,arr42w[arrw[j*8+5]]);
inc (c,arr42w[arrw[j*8+6]]);
inc (c,arr42w[arrw[j*8+7]]);
end;
end;
c1:=c;
end,
procedure
var i:integer;
j:integer;
c:integer;
begin
for i:=0 to 99 do begin
c:=0;
for j:=0 to (length(arrw) div 8)-1 do begin
inc (c,arr42w[arrw[j*8]]);
inc (c,arr42w[arrw[j*8+1]]);
inc (c,arr42w[arrw[j*8+2]]);
inc (c,arr42w[arrw[j*8+3]]);
inc (c,arr42w[arrw[j*8+4]]);
inc (c,arr42w[arrw[j*8+5]]);
inc (c,arr42w[arrw[j*8+6]]);
inc (c,arr42w[arrw[j*8+7]]);
end;
end;
c2:=c;
end);
The arrw array is a word-array on top of the original byte-array, and ParallelExecute() is explained in one of my previous blog posts.
These latest optimizations are not Delphi-specific, I just mention them here, in case somebody else wants to redo it.
It was interesting to see how Delphi optimized it's assembler generation. For instance, in the example above, the loop is changed so that j is a variable that is incremented by 8 for each iteration.
If anybody can make a C# or Java program do the same stuff with the same speed, I would very much like to see the source code.
Regarding PHP: Scripting languages like PHP and Python are extremely slow at handling large data structures. Seriously.
It would be interesting to see how faster is FreePascal than Delphi and how faster is C++ Visual Studio than Free Pascal ;-)
Actually, the fastest C# implementation, that we have, is still 3 times slower than the fastest non-threaded Delphi implementation, and 6 times slower than the fastest threaded Delphi implementation, so I am still curious to see some suggestions for fast C# and Java implementations.
Here is a list of the current best results. If you can improve any of these, please submit your code...
http://spreadsheets.google.com/ccc?key=ry6ptc4zgXSvLFpFP-wwE6A
I'd also agree that it would be interesting to see (1) Delphi compared with C/C++ here and (2) a floating point example (because the Delphi compiler only creates FPU, but no XMM code).
It would be interesting to see the source code and exact steps how the performance is measured (skip the first run etc.). In any case, you should compare code which does the exactly same thing. Is your Delphi source code as much safe as the C# version ? Does it always check the array index range ? If not, the C# unsafe version is the real candidate for comparison.
In real-world programming there are much more interesting aspects to compare. Just take some functionality from RTL/VCL and compare its performance with similar one from .NET. Lets start with a 100 MB XML file creating (TXMLDocument versus XmlWriter) ;-)
I know the XML parsing problem - we got very tired of Microsoft's solution, so we implemented our own XML parser in Delphi, which parses a 98.3 MB XML file in 0.81 seconds (loading from disk is not included).
This was my first naive python attempt:
import numpy.random as RNG
data = RNG.random_integers(0, 101, 1000000)
def test42():
....return (data == 42).sum()
def testmean():
....return data.mean()
if __name__=='__main__':
....from timeit import Timer
....t = Timer("test42()", "from __main__ import test42")
....print 'Value=42: %.2fms' % (t.timeit(200)*1e3)
....t = Timer("testmean()", "from __main__ import testmean")
....print 'Average : %.2fms' % (t.timeit(200)*1e3)
Please replace leading periods (....) with spaces, in order to preserve the indentation when you execute it.
FWIW timings on my fast machine are:
Value=42: 2315.21ms
Average : 2059.34ms
You can find a multithreaded java implementation on http://niels.dybdahl.dk/public/Speed.java and http://niels.dybdahl.dk/public/Speed.jar
You can start the jar file with:
java -jar Speed.jar
As an option you can enter the number of threads you want to run.
I was the one Jesper originally talked to about this, and it started with just an "I wonder how fast a random task would take in Delphi vs C#".
We just picked the task to create 1 mio random numbers, loop them 200 times and find how many times 42 existed, simply because it was something that was fast to write and execute.
The first C# solution was written using Generic Lists - and that, well, that took time. Then we started to use arrays, and it speeded up quite a bit. Then it turned out, that if you execute the program from within the IDE (even in release-mode) it's nearly 3 times as fast, as if you execute the .exe file manually from cmd.exe.
I'm not surprised that Delphi beats C#, because C# tries to abstract a lot of things that you have to make yourself in Delphi. As with everything else, it's how fast it is to execute vs how fast it is to write. If we wanted to, I'm sure it could be made insanely fast if written in C.
However, I must say I'm amazed that Delphi counts the 1 mio numbers 200 times in only 55 ms, which right now is the number to beat. The best C# I've been able to write, is about 290 ms - 6 times faster.
So in this simple task, Delphi wins on speed. I'm still gonna keep using C# since I love it's simplicity when doing everyday tasks. :)
If anyone is interested, here is the first test of how to search the array. NumbersList is the array containing 1 mio random numbers, from 1-100:
int count = 0;
DateTime start = DateTime.Now;
for (int i = 0; i < 200;i++)
for (int j = 0; j < numbersList.Length; j++)
{
if(numbersList[j] == 42)
count++;
}
end = DateTime.Now;
TimeSpan ts = end - start;
Console.WriteLine("avg42 : " + ts.TotalMilliseconds.ToString() + " ms");
In an earlier post, I mentioned that running the application in releasemode within the IDE, was 3 times faster than if the .exe was run from the outside.
Of course it was 3 times faster outside, than inside the IDE.
My bad :)
Lars D > Regarding PHP: Scripting languages like PHP and Python are extremely slow at handling large data structures. Seriously.
Comparing a hash table in PHP with a statically allocated array in some other language might give you that impression ;-)
Let's do PHP much better for number of values = 42
srand();
$n = 1000000;
$data = str_repeat("0",$n);
for ($i = 0; $i < $n; $i++) { $data[$i] = chr( rand(0,100) ); }
for ($r = 0; $r < 200; $r++) {
# chr(42) == *
preg_match_all('/\*/',$data,$matches);
$c = sizeof( $matches[0] );
}
Hm, editing doesn't seem to be something this place uses.
Anyway, I ran the php code and got some much better results than before.
I ran it using php-5.2.10-Win32.
The biggest problem seems to be in measurement.
$starttime = microtime();
// the reading part of your code here
$endtime = microtime();
echo $endtime-$starttime;
I get values ranging from -245 ms to +823 ms. So if you have a better way to measure this, it would be nice.
Otherwise I think we can just conclude that it's <1 second now using php.
Instead of
for j:=0 to (length(arrw) div 8)-1 do begin
you could write
for j:=0 to length(arrw)-1 step 8 do begin
That would eliminate the need for "j*8" below. But as you say, the compiler seems to be smart enough to produce the same assembly code either way.
As for adding a comparison with C/C++: yes that would be interesting. With C you should be able to come up with the same performance as Delphi and even better when you use the Intel optimizing compiler with all bells and whistles. But when it comes to writing very fast, easily readable code, in a very short time, Delphi wins hands down!
In Delphi 2009 I tried using a combination of ansistring, PosEx and Lars' ParallelExecute:
const
L = 1000000;
var
Z: integer;
C, C1, C2: integer;
Arr: ansistring;
begin
// FILL ARR
Randomize;
SetLength(Arr,L);
for Z := 1 to L do
Arr[Z] := ansichar(chr(Random(101)));
// START MEASURING TIME
ParallelExecute(
procedure
var
I, J, K: integer;
begin
K := -1;
I := 0;
repeat
J := 1;
repeat
J := PosEx(#42, Arr, J) + 1;
Inc(K);
until J = 1;
inc(I);
until I = 100;
C1 := K;
end,
procedure
var
I, J, K: integer;
begin
K := -1;
I := 0;
repeat
J := 1;
repeat
J := PosEx(#42, Arr, J) + 1;
Inc(K);
until J = 1;
inc(I);
until I = 100;
C2 := K;
end);
C := C1 + C2;
// STOP MEASURING TIME
On my laptop it runs about 2.5 times faster than the original 'for I := 0 to 199 do for J := 0 to 999999' algorithm. This seems on par with the current record, or maybe even better.
The ansistrings.posex() solution is interesting - on my computer it beats the simplest static array algorithm by almost 1.5 times, but it is still almost 2 times slower than the fastest algorithm, and it is less readable because you store the values as char values. However, it still beats all the examples from other programming languages, and on my computer it also beats the php preg_match_all, which can be measured reasonably precisely on my computer, but not on Jesper's reference laptop.
@Lars D: Thank you for your quick follow up. You are of course right that it is less readable :)
Hald > I get values ranging from -245 ms to +823 ms
Looks like overflow doesn't it?
function secs() {
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
$starttime = secs();
// the reading part of your code here
$endtime = secs();
echo $endtime-$starttime;
Let's do PHP much much better for number of values = 42
srand();
$data = str_repeat("0",$n);
for ($i = 0; $i < 1000000; $i++) { $data[$i] = chr( rand(0,100) ); }
for ($r = 0; $r < 200; $r++) $counts = count_chars($data);
echo $counts[42],"\n";
I'm not a PHP programmer. I think a PHP programmer would have thought of this solution really quickly.
Is PHP 1000-2500 times slower than Delphi? ;-)
What about c/c++?
I think any test should include c/c++ as well.
The Delphi runtime library is a mix of assembler code and Delphi code. The PHP runtime library is very much written in C/C++. If you want to compare the speed of the languages, and not the runtime libraries, then the task probably has to be more advanced than just calling a runtime library function.
The currently fastest algorithm does not call a runtime library function. It's pure Delphi source code.
Previously you wrote - "and Jesper told me that all techniques may be employed".
Yes - but Jesper's rules do not imply that the performance of one PHP runtime library function describes the speed of the PHP programming language.
The "current best results" spreadsheet describes sajbar's PHP programs as using "static array".
Perhaps I'm mistaken but my understanding is that PHP "arrays" are hash tables.
Do "Jesper's rules" imply that the performance of hash-tables against static arrays describes the speed of a programming language? :-)
Is it "all techniques may be employed" in Delphi? :-)
I notice that there is a reason, why the Delphi runtime is written in Delphi, but the php runtime library is not written in php. I fully acknowledge, that the php runtime solves many problems very efficiently, but even though Jesper's benchmark allows all tricks, my comment was purely targeted at the php programming language, and not php as a tool.
If you can write a php example, which does not let code written in other programming languages solve the problem, and which executes faster than 50 seconds on Jespers machine, then you will have proven that I was wrong.
Take the challenge! :-)
Lars D > purely targeted at the php programming language, and not php as a tool
I don't think it makes much sense to talk "purely" about the performance of programming languages - as if how the way the language is implemented didn't matter.
Is C faster than PHP?
Not when using a C interpreter ;-)
Incidentally, please note my typo
$data = str_repeat("0",$n);
should be
$data = str_repeat("0",1000000);
I published my solution, and I would like to see other people's solution, for inspiration. When programmers get pushed to do their best, interesting solutions come up. I just saw that Jesper updated the spreadsheet with unsafe, multithreaded C# code from "Misha", which takes 131ms, which is a new record for C#. I hope that I will see the source code :-)
I just heard from Jesper, that the fastest C# algorithm uses pointers and goto-statements. It is the use of pointers that makes the code unsafe.
Lars D. > I published my solution, and I would like to see other people's solution, for inspiration.
And how fast would your solution run on a Delphi interpreter? :-)
I like to compare programs too - I'd be interested in seeing what a real PHP programmer would do - but let's not confuse this with some notion of "purely" comparing programming languages.
I'm not sure what your point is, but I know which language I would pick for writing a complex algorithm that needs to be fast with big amounts of data, and it's not PHP.
My point was simply that "the Delphi runtime is written in Delphi, but the php runtime library is not written in php" is just an arbitrary distinction, when you haven't limited your comparison to self-hosted language implementations.
Is the C# language implemented in C# or C++?
Is the Java language implemented in Java or C++?
I very much doubt that preg_match_all is written in PHP.
"I very much doubt that preg_match_all is written in PHP." - totally my point.
I also doubt that PHP
for ($i = 0; $i < $n; $i++)
is implemented in PHP so I don't see how your point applies to the comparisons you have been making?
This seems more "obvious" a solution than preg_match_all
srand();
$data = str_repeat("0",1000000);
for ($i = 0; $i < 1000000; $i++) { $data[$i] = chr( rand(0,100) ); }
for ($r = 0; $r < 200; $r++) $c = sizeof( explode(chr(42),$data) ) - 1;
echo $c,"\n";
Victor:
for j:=0 to length(arrw)-1 step 8 do
Wich version of Delphi do you use?
PHP times updated with the secs function.
Time is 1.8 seconds
Hald > PHP times updated with the secs function. Time is 1.8 seconds
Pleased that worked for you; so ~36x times slower than the fastest.
And I would guess the new program using count_chars is more like 0.8 seconds on your machine?
If I understood the requirement correctly, here for completeness is PHP average for all 1,000,000*200 values
function secs() {
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
$n = 1000000;
$repeat = 200;
srand();
$data = str_repeat("0",$n);
for ($i = 0; $i < $n; $i++) { $data[$i] = chr( rand(0,100) ); }
$t0 = secs();
$sum = 0;
for ($r = 0; $r < $repeat; $r++){
$c = count_chars($data);
for ($i = 0; $i <= 100; $i++) { $sum += $c[$i] * $i; }
}
$mean = (double)$sum / ($n * $repeat);
$t = secs() - $t0;
printf("%d\n%0.3f\n%0.3fs\n",$c[42],$mean,$t);
Jesper, the latest PHP function, that does counting & averaging, is actually less than 600ms on my PC, so it should be around 300-400ms on your laptop. Would it be possible to publish the algorithms from the spreadsheet?
If I understood the requirements correctly, that last PHP script should also be the fastest PHP solution to "number of values = 42".
Although it might be a tiny little bit faster for that problem to comment out one line
// for ($i = 0; $i <= 100; $i++) { $sum += $c[$i] * $i; }
I've updated the spreadsheet to show which codes has been tested to get the results.
The latest php from Isaac is also put in, as is a vbs version, another python version and c# using a COM object
I've updated the spreadsheet to include links to the code which has been run to get the results.
I've also updated with the latest php from Isaac, some new vbs and python and c# using a com object.
Instead of -
PHP was 1000-2500 times slower than Delphi.
- the data now shows PHP was 8-1100 times slower, and I'll suggest that extreme range is down to using or ignoring what comes standard with PHP :-)
Lars D > I know which language I would pick for writing a complex algorithm that needs to be fast with big amounts of data, and it's not PHP.
A good answer to that is often - Use whichever language you know best!
But if we wanted a better answer then is speed to access contiguous memory locations really that relevant, or is it just a proxy to show that the language can go low-level?
Isaac: I'm a big fan of php, but I also know its limits. Comparing PHP to Delphi by benchmarking a C/C++ library function does not impress me.
Try this one:
http://test.tvsori.com/bench2.cpp - veeeeery simple, strait & stupid implementation, easiest to read.
And executable (please check for viruses who knows what I have):
http://test.tvsori.com/Bench_intel_autovec_autoparr.exe
Bench_intel_autovec_autoparr . exe
Nothing to say more...
Victor:
for j:=0 to length(arrw)-1 step 8 do
Wich version of Delphi do you use?
OMG! I'm so sorry. That isn't Pascal, it's Basic!
for i = 0 to n step 8
..
next
Well, I guess not all is bad about Basic either= :-)
In Delphi you would have to use a while loop and increment the iterator yourself. Actually, it might be worth a try to see if it's faster to start the index at the end and count back to zero. That way the while test condition would be while (i > 0). That saves a single subtraction operation for the test. It's not much, but everything helps. Question is whether the memory access pattern would cause slowdown.
Actually, I tried that. In Delphi, performance gets slower if you try to substitute a for-loop with a while-loop...
Delphi knows all kinds of for-loop optimizations, including reversing the order if the order doesn't matter, changing the loop variable to be 8 times bigger, arranging machine code instructions for optimal timing etc.
You can observe the changes in loop variable behavior directly in the debugger, because the debugger reports the optimized loop variable value when you hold the cursor over the variable name during debugging.
And, please, don't forget about OpenMP, it gives linear speedup without any additional efforts :)
DWORD myavg1;
stime = timeGetTime();
#pragma omp parallel for
for (int cnt=0; cnt < 200; ++cnt) {
int lmyavg1 = 0;
for (int i=0; i < 1000000; ++i) {
lmyavg1 += arrw[i];
}
myavg1 = lmyavg1;
}
etime = timeGetTime();
printf("Single Thread Average: %u Time %u\n",myavg1/1000000,etime-stime);
Lars D > Comparing PHP to Delphi by benchmarking a C/C++ library function does not impress me.
But you do seem to be impressed by comparing Delphi to PHP on a task that is so obviously suited to the low-level programming available in a language like C - the odd thing is that Delphi isn't being compared to C or C++.
Incidentally, did you check that loop-through-array.php gives a sensible answer?
Unfortunately Delphi isn't listed at http://shootout.alioth.debian.org but Free Pascal is there (and it does reasonably well IMHO)
It is not correct to compare such different languages as Delphi and PHP(Static Typing versus Dynamic Typing, Strong Typing versus Weak Typing). This is the dishonest comparison.
In the tests given above, are compared the languages from different categories(native code versus byte code/JIT compiler output). Why not to compare Delphi and C++ (both uses native code)?
>Delphi knows all kinds of for-loop optimizations, including reversing the order if the order doesn't matter...
... and nevertheless preserves variable into the memory, even if it is not used
PS: Sorry for my English
--
Anton Golinko
http://www.grimes.demon.co.uk/dotnet/man_unman.htm
this one compares C++ and C# and C# won! I also have seen a lot of other articles saying that managed code is on par or might be faster than unmanaged code.
So it's pretty strange that Delphi beat C# to the punch... Something is simply not taken in account.
That page compiles C++ for .net, and not natively. Delphi 2009 compiles natively.
If you would try a native C/C++ compiler it would beat C++ for .net.
Lars, that example you posted isn't multi-threading.
Multi-Threading is using multiple threads. Yours only uses 1 thread. Try searching up the CreateThread() API.
My example is multithreading, but the actual thread creation is hidden in the procedure call, which is implemented using TThread.
One few comments from my side.
IMHO it helps to differentiate between problem, algorithm and implementation. I think you are all using the same algorithm, i.e. walk through the array and count the occurrence of a certain value/compute the mean.
The problem: given a set of random numbers in 1 to 100 give me the count of the ones with a certain value (without loss of generality 42).
The algorithm: For this problem, I am proposing a different algorithmic approach, i.e. keep the array and a seperate data structure with the count of the individual values. A query for the count is then a simple lookup. You circumvent all memory and computational problems. It works like a cache.
If the use-case is what I am imagining, i.e. a long series of integer that you keep and frequently query for the count. Of course this is a typical example of a time/space tradeoff, as this solution uses more space (especially for a large set of possible values). If the tracking of the counts in a seperate datastructure cannot happen alongside, but the ratio between setting and quering is 1:1, then this approach obviously does not work.
I implemented the algorithm using a dictionary for the count. For this example it could be implemented even faster using an array. Obviously this approach is very fast...
The original Python solution using numpy:
Value=42: 2863.27ms
Average : 2606.65ms
My Python solution:
Value=42: 1.50ms
Average : 0.81ms
-----
Source code here:
import numpy.random as RNG
class intdict:
def __init__(self):
self.data = RNG.random_integers(0, 101, 1000000)
self.count = {}
for k in self.data:
if k in self.count:
self.count[k] += 1
else:
self.count[k] = 1
self.vmean = self.data.mean()
def test(self,k):
if k in self.count:
return self.count[k]
else:
return 0
def mean(self):
return self.mean
data = intdict()
def test42():
return data.test(42)
def testmean():
return data.mean()
if __name__=='__main__':
from timeit import Timer
t = Timer("test42()", "from __main__ import test42")
print 'Value=42: %.2fms' % (t.timeit(200)*1e3)
t = Timer("testmean()", "from __main__ import testmean")
print 'Average : %.2fms' % (t.timeit(200)*1e3)
Post a Comment