I was looking for the fastest way to
popcount
large arrays of data. I encountered a very
weird effect: Changing the loop variable from
unsigned
to uint64_t
made the
performance drop by 50% on my PC.
The
Benchmark
#include
#include
#include
int main(int argc, char* argv[])
{
using namespace std;
if (argc != 2) {
cerr
<< "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer =
reinterpret_cast(buffer);
for (unsigned i=0; i ++i)
charbuffer[i] = rand()%256;
uint64_t
count,duration;
chrono::time_point
startP,endP;
{
startP =
chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k
< 10000; k++){
// Tight unrolled loop with unsigned
for
(unsigned i=0; i count +=
_mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count +=
_mm_popcnt_u64(buffer[i+3]);
}
}
endP =
chrono::system_clock::now();
duration =
chrono::duration_cast(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
{
startP =
chrono::system_clock::now();
count=0;
for( unsigned k = 0; k <
10000; k++){
// Tight unrolled loop with uint64_t
for
(uint64_t i=0;i count +=
_mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count +=
_mm_popcnt_u64(buffer[i+3]);
}
}
endP =
chrono::system_clock::now();
duration =
chrono::duration_cast(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
free(charbuffer);
}
As
you see, we create a buffer of random data, with the size being
x
megabytes where x
is read from the
command line. Afterwards, we iterate over the buffer and use an unrolled version of the
x86 popcount
intrinsic to perform the popcount. To get a more
precise result, we do the popcount 10,000 times. We measure the times for the popcount.
In the upper case, the inner loop variable is unsigned
, in the
lower case, the inner loop variable is uint64_t
. I thought that
this should make no difference, but the opposite is the
case.
The (absolutely crazy)
results
I compile it like this (g++ version:
Ubuntu 4.8.2-19ubuntu1):
g++ -O3
-march=native -std=c++11 test.cpp -o
test
Here are the
results on my rel="noreferrer">Haswell href="http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29#Desktop_processors"
rel="noreferrer">Core i7-4770K CPU @ 3.50 GHz, running test
(so 1 MB random
1
data):
- unsigned
41959360000 0.401554 sec
26.113 GB/s - uint64_t
41959360000 0.759822 sec
13.8003 GB/s
As
you see, the throughput of the uint64_t
version is
only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First,
I thought of a compiler bug, so I tried clang++
(Ubuntu href="http://en.wikipedia.org/wiki/Clang" rel="noreferrer">Clang version
3.4-1ubuntu3):
clang++ -O3
-march=native -std=c++11 teest.cpp -o
test
Result:
test
1
- unsigned
41959360000 0.398293 sec 26.3267
GB/s - uint64_t 41959360000 0.680954 sec
15.3986
GB/s
So,
it is almost the same result and is still strange. But now it gets super
strange. I replace the buffer size that was read from input with a constant
1
, so I
change:
uint64_t size =
atol(argv[1]) <<
20;
to
uint64_t
size = 1 <<
20;
Thus, the compiler
now knows the buffer size at compile time. Maybe it can add some optimizations! Here are
the numbers for
g++
:
- unsigned
41959360000 0.509156 sec
20.5944 GB/s - uint64_t
41959360000 0.508673 sec
20.6139 GB/s
Now,
both versions are equally fast. However, the unsigned
got even slower! It dropped from
26
to 20 GB/s
, thus replacing a
non-constant by a constant value lead to a deoptimization.
Seriously, I have no clue what is going on here! But now to
clang++
with the new
version:
- unsigned
41959360000 0.677009 sec
15.4884 GB/s - uint64_t
41959360000 0.676909 sec
15.4906 GB/s
Wait,
what? Now, both versions dropped to the slow
number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow
code in both cases for
Clang!
I asked a colleague with an href="http://en.wikipedia.org/wiki/Ivy_Bridge_%28microarchitecture%29"
rel="noreferrer">Ivy Bridge CPU to compile my benchmark. He got similar
results, so it does not seem to be Haswell. Because two compilers produce strange
results here, it also does not seem to be a compiler bug. We do not have an AMD CPU
here, so we could only test with Intel.
More
madness, please!
Take the first example (the
one with atol(argv[1])
) and put a
static
before the variable,
i.e.:
static uint64_t
size=atol(argv[1])<<20;
Here
are my results in
g++:
- unsigned 41959360000
0.396728 sec 26.4306
GB/s - uint64_t 41959360000 0.509484 sec
20.5811
GB/s
Yay,
yet another alternative. We still have the fast 26 GB/s with
u32
, but we managed to get u64
at
least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the
u64
version became even faster than the
u32
version, yielding the fastest result of all.
Sadly, this only works for g++
,
clang++
does not seem to care about
static
.
My
question
Can you explain these results?
Especially:
- How can there
be such a difference betweenu32
and
u64
? - How can replacing a
non-constant by a constant buffer size trigger less optimal
code? - How can the insertion of the
static
keyword make theu64
loop
faster? Even faster than the original code on my collegue's
computer!
I
know that optimization is a tricky territory, however, I never thought that such small
changes can lead to a 100% difference in execution time and
that small factors like a constant buffer size can again mix results totally. Of course,
I always want to have the version that is able to popcount 26 GB/s. The only reliable
way I can think of is copy paste the assembly for this case and use inline assembly.
This is the only way I can get rid of compilers that seem to go mad on small changes.
What do you think? Is there another way to reliably get the code with most
performance?
The
Disassembly
Here is the disassembly for the
various results:
26 GB/s version from
g++ / u32 / non-const
bufsize:
0x400af8:
lea
0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea
0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea
0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add
$0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add
%rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp
%rbp,%rcx
jb
0x400af8
13 GB/s
version from g++ / u64 / non-const
bufsize:
0x400c00:
popcnt
0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add
%rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add
%rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add
%rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb
0x400c00
15 GB/s
version from clang++ / u64 / non-const
bufsize:
0x400e50:
popcnt
(%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt
0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt
0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt
0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
%rbp,%rcx
jb
0x400e50
20 GB/s
version from g++ / u32&u64 / const
bufsize:
0x400a68:
popcnt
(%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add
%rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add
%rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add
%rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne
0x400a68
15 GB/s
version from clang++ / u32&u64 / const
bufsize:
0x400dd0:
popcnt
(%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt
0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt
0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt
0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
$0x20000,%rcx
jb
0x400dd0
Interestingly,
the fastest (26 GB/s) version is also the longest! It seems to be the only solution that
uses lea
. Some versions use jb
to
jump, others use jne
. But apart from that, all versions seem to
be comparable. I don't see where a 100% performance gap could originate from, but I am
not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very
short and good. Can anyone explain
this?
Lessons
learned
No matter what the answer to this
question will be; I have learned that in really hot loops every
detail can matter, even details that do not seem to have any association to
the hot code. I have never thought about what type to use for a loop
variable, but as you see such a minor change can make a 100%
difference! Even the storage type of a buffer can make a huge difference, as we saw with
the insertion of the static
keyword in front of the size
variable! In the future, I will always test various alternatives on various compilers
when writing really tight and hot loops that are crucial for system
performance.
The interesting thing is also that
the performance difference is still so high although I have already unrolled the loop
four times. So even if you unroll, you can still get hit by major performance
deviations. Quite interesting.
No comments:
Post a Comment