I was looking for the fastest way to
popcount
large arrays of data. I encountered a very
weird effect: Changing the loop variable from
unsigned
to uint64_t
made the
performance drop by 50% on my PC.
The
Benchmark
#include
#include
#include
int main(int argc, char* argv[])
{
using namespace std;
if (argc != 2) {
cerr
<< "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer =
reinterpret_cast(buffer);
for (unsigned i=0; i ++i)
charbuffer[i] = rand()%256;
uint64_t
count,duration;
chrono::time_point
startP,endP;
{
startP =
chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k
< 10000; k++){
// Tight unrolled loop with unsigned
for
(unsigned i=0; i count +=
_mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count +=
_mm_popcnt_u64(buffer[i+3]);
}
}
endP =
chrono::system_clock::now();
duration =
chrono::duration_cast(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
{
startP =
chrono::system_clock::now();
count=0;
for( unsigned k = 0; k <
10000; k++){
// Tight unrolled loop with uint64_t
for
(uint64_t i=0;i count +=
_mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count +=
_mm_popcnt_u64(buffer[i+3]);
}
}
endP =
chrono::system_clock::now();
duration =
chrono::duration_cast(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
free(charbuffer);
}
As
you see, we create a buffer of random data, with the size being
x
megabytes where x
is read from the
command line. Afterwards, we iterate over the buffer and use an unrolled version of the
x86 popcount
intrinsic to perform the popcount. To get a more
precise result, we do the popcount 10,000 times. We measure the times for the popcount.
In the upper case, the inner loop variable is unsigned
, in the
lower case, the inner loop variable is uint64_t
. I thought that
this should make no difference, but the opposite is the
case.
The (absolutely crazy)
results
I compile it like this (g++ version:
Ubuntu 4.8.2-19ubuntu1):
g++ -O3
-march=native -std=c++11 test.cpp -o
test
Here are the
results on my rel="noreferrer">Haswell href="http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29#Desktop_processors"
rel="noreferrer">Core i7-4770K CPU @ 3.50 GHz, running test
(so 1 MB random
1
data):
- unsigned
41959360000 0.401554 sec
26.113 GB/s - uint64_t
41959360000 0.759822 sec
13.8003 GB/s
As
you see, the throughput of the uint64_t
version is
only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First,
I thought of a compiler bug, so I tried clang++
(Ubuntu href="http://en.wikipedia.org/wiki/Clang" rel="noreferrer">Clang version
3.4-1ubuntu3):
clang++ -O3
-march=native -std=c++11 teest.cpp -o
test
Result:
test
1
- unsigned
41959360000 0.398293 sec 26.3267
GB/s - uint64_t 41959360000 0.680954 sec
15.3986
GB/s
So,
it is almost the same result and is still strange. But now it gets super
strange. I replace the buffer size that was read from input with a constant
1
, so I
change:
uint64_t size =
atol(argv[1]) <<
20;
to
uint64_t
size = 1 <<
20;
Thus, the compiler
now knows the buffer size at compile time. Maybe it can add some optimizations! Here are
the numbers for
g++
:
- unsigned
41959360000 0.509156 sec
20.5944 GB/s - uint64_t
41959360000 0.508673 sec
20.6139 GB/s
Now,
both versions are equally fast. However, the unsigned
got even slower! It dropped from
26
to 20 GB/s
, thus replacing a
non-constant by a constant value lead to a deoptimization.
Seriously, I have no clue what is going on here! But now to
clang++
with the new
version:
- unsigned
41959360000 0.677009 sec
15.4884 GB/s - uint64_t
41959360000 0.676909 sec
15.4906 GB/s
Wait,
what? Now, both versions dropped to the slow
number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow
code in both cases for
Clang!
More
madness, please!
Take the first example (the
one with atol(argv[1])
) and put a
static
before the variable,
i.e.:
static uint64_t
size=atol(argv[1])<<20;
Here
are my results in
g++:
- unsigned 41959360000
0.396728 sec 26.4306
GB/s - uint64_t 41959360000 0.509484 sec
20.5811
GB/s
Yay,
yet another alternative. We still have the fast 26 GB/s with
u32
, but we managed to get u64
at
least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the
u64
version became even faster than the
u32
version, yielding the fastest result of all.
Sadly, this only works for g++
,
clang++
does not seem to care about
static
.
My
question
Can you explain these results?
Especially:
- How can there
be such a difference betweenu32
and
u64
? - How can replacing a
non-constant by a constant buffer size trigger less optimal
code? - How can the insertion of the
static
keyword make theu64
loop
faster? Even faster than the original code on my collegue's
computer!
I
know that optimization is a tricky territory, however, I never thought that such small
changes can lead to a 100% difference in execution time and
that small factors like a constant buffer size can again mix results totally. Of course,
I always want to have the version that is able to popcount 26 GB/s. The only reliable
way I can think of is copy paste the assembly for this case and use inline assembly.
This is the only way I can get rid of compilers that seem to go mad on small changes.
What do you think? Is there another way to reliably get the code with most
performance?
The
Disassembly
Here is the disassembly for the
various results:
26 GB/s version from
g++ / u32 / non-const
bufsize:
0x400af8:
lea
0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea
0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea
0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add
$0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add
%rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp
%rbp,%rcx
jb
0x400af8
13 GB/s
version from g++ / u64 / non-const
bufsize:
0x400c00:
popcnt
0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add
%rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add
%rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add
%rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb
0x400c00
15 GB/s
version from clang++ / u64 / non-const
bufsize:
0x400e50:
popcnt
(%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt
0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt
0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt
0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
%rbp,%rcx
jb
0x400e50
20 GB/s
version from g++ / u32&u64 / const
bufsize:
0x400a68:
popcnt
(%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add
%rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add
%rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add
%rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne
0x400a68
15 GB/s
version from clang++ / u32&u64 / const
bufsize:
0x400dd0:
popcnt
(%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt
0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt
0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt
0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
$0x20000,%rcx
jb
0x400dd0
Interestingly,
the fastest (26 GB/s) version is also the longest! It seems to be the only solution that
uses lea
. Some versions use jb
to
jump, others use jne
. But apart from that, all versions seem to
be comparable. I don't see where a 100% performance gap could originate from, but I am
not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very
short and good. Can anyone explain
this?
Lessons
learned
No matter what the answer to this
question will be; I have learned that in really hot loops every
detail can matter, even details that do not seem to have any association to
the hot code. I have never thought about what type to use for a loop
variable, but as you see such a minor change can make a 100%
difference! Even the storage type of a buffer can make a huge difference, as we saw with
the insertion of the static
keyword in front of the size
variable! In the future, I will always test various alternatives on various compilers
when writing really tight and hot loops that are crucial for system
performance.
The interesting thing is also that
the performance difference is still so high although I have already unrolled the loop
four times. So even if you unroll, you can still get hit by major performance
deviations. Quite interesting.
class="normal">Answer
Culprit: False Data
Dependency (and the compiler isn't even aware of
it)
On Sandy/Ivy Bridge and Haswell
processors, the
instruction:
popcnt src,
dest
appears to have a
false dependency on the destination register dest
. Even though
the instruction only writes to it, the instruction will wait until
dest
is ready before executing. This false dependency is (now)
documented by Intel as erratum href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf"
rel="noreferrer">HSD146 (Haswell) and href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf"
rel="noreferrer">SKL029
(Skylake)
href="https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter">Skylake
fixed this for lzcnt
and
tzcnt
.
Cannon Lake (and Ice Lake) fixed
this for popcnt
.
/>bsf
/bsr
have a true output
dependency: output unmodified for input=0. (But href="https://stackoverflow.com/questions/41351564/vs-unexpected-optimization-behavior-with-bitscanreverse64-intrinsic/41352456#41352456">no
way to take advantage of that with intrinsics - only AMD documents it and
compilers don't expose it.)
(Yes, these
instructions all run href="https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware">on
the same execution unit).
/>
This dependency doesn't just hold up the 4
popcnt
s from a single loop iteration. It can carry across loop
iterations making it impossible for the processor to parallelize different loop
iterations.
The
unsigned
vs. uint64_t
and other tweaks
don't directly affect the problem. But they influence the register allocator which
assigns the registers to the variables.
In your
case, the speeds are a direct result of what is stuck to the (false) dependency chain
depending on what the register allocator decided to
do.
- 13 GB/s has
a chain:
popcnt
-add
-popcnt
-popcnt
→ next iteration - 15 GB/s has a chain:
popcnt
-add
-popcnt
-add
→ next iteration - 20 GB/s has a chain:
popcnt
-popcnt
→ next
iteration - 26 GB/s has a chain:
popcnt
-popcnt
→ next
iteration
The difference
between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing.
Either way, the processor starts to hit other bottlenecks once you reach this
speed.
/>
To test this, I used inline assembly to
bypass the compiler and get exactly the assembly I want. I also split up the
count
variable to break all other dependencies that might mess
with the benchmarks.
Here are the
results:
Sandy Bridge Xeon @ 3.5
GHz: (full test code can be found at the
bottom)
- GCC 4.6.3:
g++ popcnt.cpp -std=c++0x -O3 -save-temps
-march=native - Ubuntu
12
Different
Registers: 18.6195
GB/s
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq
16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4,
%rax
popcnt %r8, %r8
add %r8,
%rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10,
%r10
add %r10, %rdi
popcnt %r11, %r11
add %r11,
%rsi
cmpq $131072, %rax
jne
.L4
Same Register:
8.49272
GB/s
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq
16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq
$4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add
%rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp,
%rax
add %rax, %rdi
cmpq $131072,
%rdx
jne
.L9
Same Register with
broken chain: 17.8869
GB/s
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq
16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4,
%rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax
# Break the cross-iteration dependency by zeroing "rax".
popcnt %r9,
%rax
add %rax, %rcx
popcnt %r10, %rax
add
%rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp,
%rax
add %rax, %rdi
cmpq $131072, %rdx
jne
.L14
/>
So what went wrong with the
compiler?
It seems that neither
GCC nor Visual Studio are aware that popcnt
has such a false
dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter
of whether the compiler is aware of
it.
popcnt
isn't
exactly the most used instruction. So it's not really a surprise that a major compiler
could miss something like this. There also appears to be no documentation anywhere that
mentions this problem. If Intel doesn't disclose it, then nobody outside will know until
someone runs into it by
chance.
(Update:
rel="noreferrer">As of version 4.9.2, GCC is aware of this false-dependency
and generates code to compensate it when optimizations are enabled. Major compilers from
other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this
microarchitectural erratum and will not emit code that compensates for
it.)
Why does the CPU
have such a false dependency?
We
can speculate: it runs on the same execution unit as bsf
/
bsr
which do have an output dependency.
( href="https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware">How
is POPCNT implemented in hardware?). For those instructions, Intel documents
the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually
gives a stronger guarantee to avoid breaking old software: output unmodified. AMD
documents this behaviour.
Presumably it was
somehow inconvenient to make some uops for this execution unit dependent on the output
but others not.
AMD processors do not appear to
have this false dependency.
/>
The full test code is below for
reference:
#include
#include
#include
int main(int argc, char* argv[])
{
using namespace std;
uint64_t
size=1<<20;
uint64_t* buffer = new
uint64_t[size/8];
char*
charbuffer=reinterpret_cast(buffer);
for (unsigned
i=0;i
uint64_t
count,duration;
chrono::time_point
startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 =
0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP
= chrono::system_clock::now();
for( unsigned k = 0; k < 10000;
k++){
for (uint64_t i=0;i uint64_t r0 = buffer[i
+ 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i +
2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5
\n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6,
%2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r"
(c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r"
(r3)
);
}
}
count = c0 + c1 + c2
+ c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast(endP-startP).count();
cout << "No Chain\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
{
uint64_t c0 =
0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 =
0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k
< 10000; k++){
for (uint64_t i=0;i uint64_t
r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 =
buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax
\n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add
%%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r"
(c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP =
chrono::system_clock::now();
duration=chrono::duration_cast(endP-startP).count();
cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << "
GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000;
k++){
for (uint64_t i=0;i
uint64_t r0
= buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 =
buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"xor %%rax, %%rax \n\t" // <--- Break the chain.
"popcnt %4, %%rax
\n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add
%%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2
\n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
:
"+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2),
"r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP =
chrono::system_clock::now();
duration=chrono::duration_cast(endP-startP).count();
cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
free(charbuffer);
}
/>
An equally interesting benchmark can be found here:
rel="noreferrer">http://pastebin.com/kbzgL8si
This
benchmark varies the number of popcnt
s that are in the (false)
dependency chain.
False Chain 0:
41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec
17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448
GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784
GB/s
False Chain 4: 41959360000 1.23572 sec 8.48557
GB/s
No comments:
Post a Comment