Saturday, 25 November 2017

c++ - Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs

itemprop="text">

I was looking for the fastest way to
popcount large arrays of data. I encountered a very
weird
effect: Changing the loop variable from
unsigned to uint64_t made the
performance drop by 50% on my PC.



The
Benchmark



#include


#include
#include


int main(int argc, char* argv[])
{

using namespace std;
if (argc != 2) {
cerr
<< "usage: array_size in MB" << endl;
return -1;

}


uint64_t size = atol(argv[1])<<20;

uint64_t* buffer = new uint64_t[size/8];
char* charbuffer =
reinterpret_cast(buffer);
for (unsigned i=0; i ++i)
charbuffer[i] = rand()%256;

uint64_t
count,duration;
chrono::time_point
startP,endP;
{

startP =
chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k
< 10000; k++){
// Tight unrolled loop with unsigned
for
(unsigned i=0; i count +=
_mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);

count += _mm_popcnt_u64(buffer[i+2]);
count +=
_mm_popcnt_u64(buffer[i+3]);
}

}
endP =
chrono::system_clock::now();
duration =
chrono::duration_cast(endP-startP).count();

cout << "unsigned\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
{
startP =
chrono::system_clock::now();
count=0;
for( unsigned k = 0; k <
10000; k++){

// Tight unrolled loop with uint64_t
for
(uint64_t i=0;i count +=
_mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);

count += _mm_popcnt_u64(buffer[i+2]);
count +=
_mm_popcnt_u64(buffer[i+3]);
}
}
endP =
chrono::system_clock::now();
duration =
chrono::duration_cast(endP-startP).count();


cout << "uint64_t\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}


free(charbuffer);
}


As
you see, we create a buffer of random data, with the size being
x megabytes where x is read from the
command line. Afterwards, we iterate over the buffer and use an unrolled version of the
x86 popcount intrinsic to perform the popcount. To get a more
precise result, we do the popcount 10,000 times. We measure the times for the popcount.
In the upper case, the inner loop variable is unsigned, in the
lower case, the inner loop variable is uint64_t. I thought that
this should make no difference, but the opposite is the
case.




The (absolutely crazy)
results



I compile it like this (g++ version:
Ubuntu 4.8.2-19ubuntu1):



g++ -O3
-march=native -std=c++11 test.cpp -o
test


Here are the
results on my rel="noreferrer">Haswell href="http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29#Desktop_processors"
rel="noreferrer">Core i7-4770K CPU @ 3.50 GHz, running test
1
(so 1 MB random
data):





  • unsigned
    41959360000 0.401554 sec
    26.113 GB/s

  • uint64_t
    41959360000 0.759822 sec
    13.8003 GB/s



As
you see, the throughput of the uint64_t version is
only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First,
I thought of a compiler bug, so I tried clang++ (Ubuntu href="http://en.wikipedia.org/wiki/Clang" rel="noreferrer">Clang version
3.4-1ubuntu3):



clang++ -O3
-march=native -std=c++11 teest.cpp -o
test


Result:
test
1





  • unsigned
    41959360000 0.398293 sec 26.3267
    GB/s

  • uint64_t 41959360000 0.680954 sec
    15.3986
    GB/s



So,
it is almost the same result and is still strange. But now it gets super
strange.
I replace the buffer size that was read from input with a constant
1, so I
change:



uint64_t size =
atol(argv[1]) <<
20;



to



uint64_t
size = 1 <<
20;


Thus, the compiler
now knows the buffer size at compile time. Maybe it can add some optimizations! Here are
the numbers for
g++:




  • unsigned
    41959360000 0.509156 sec
    20.5944 GB/s


  • uint64_t
    41959360000 0.508673 sec
    20.6139 GB/s



Now,
both versions are equally fast. However, the unsigned
got even slower! It dropped from
26 to 20 GB/s, thus replacing a
non-constant by a constant value lead to a deoptimization.
Seriously, I have no clue what is going on here! But now to
clang++ with the new
version:




  • unsigned
    41959360000 0.677009 sec
    15.4884 GB/s

  • uint64_t
    41959360000 0.676909 sec
    15.4906 GB/s




Wait,
what?
Now, both versions dropped to the slow
number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow
code in both cases for
Clang!



I asked a colleague with an href="http://en.wikipedia.org/wiki/Ivy_Bridge_%28microarchitecture%29"
rel="noreferrer">Ivy Bridge CPU to compile my benchmark. He got similar
results, so it does not seem to be Haswell. Because two compilers produce strange
results here, it also does not seem to be a compiler bug. We do not have an AMD CPU
here, so we could only test with Intel.



More
madness, please!



Take the first example (the
one with atol(argv[1])) and put a
static before the variable,
i.e.:



static uint64_t
size=atol(argv[1])<<20;



Here
are my results in
g++:




  • unsigned 41959360000
    0.396728 sec 26.4306
    GB/s

  • uint64_t 41959360000 0.509484 sec
    20.5811
    GB/s



Yay,
yet another alternative
. We still have the fast 26 GB/s with
u32, but we managed to get u64 at
least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the
u64 version became even faster than the
u32 version, yielding the fastest result of all.

Sadly, this only works for g++,
clang++ does not seem to care about
static.




My
question



Can you explain these results?
Especially:




  • How can there
    be such a difference between u32 and
    u64?

  • How can replacing a
    non-constant by a constant buffer size trigger less optimal
    code
    ?

  • How can the insertion of the
    static keyword make the u64 loop
    faster? Even faster than the original code on my collegue's
    computer!




I
know that optimization is a tricky territory, however, I never thought that such small
changes can lead to a 100% difference in execution time and
that small factors like a constant buffer size can again mix results totally. Of course,
I always want to have the version that is able to popcount 26 GB/s. The only reliable
way I can think of is copy paste the assembly for this case and use inline assembly.
This is the only way I can get rid of compilers that seem to go mad on small changes.
What do you think? Is there another way to reliably get the code with most
performance?



The
Disassembly



Here is the disassembly for the
various results:



26 GB/s version from
g++ / u32 / non-const
bufsize
:



0x400af8:
lea
0x1(%rdx),%eax

popcnt (%rbx,%rax,8),%r9
lea
0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea
0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add
$0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add
%rcx,%rax

mov %edx,%ecx
add %rax,%r14
cmp
%rbp,%rcx
jb
0x400af8


13 GB/s
version from g++ / u64 / non-const
bufsize
:



0x400c00:
popcnt
0x8(%rbx,%rdx,8),%rcx

popcnt (%rbx,%rdx,8),%rax
add
%rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add
%rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add
%rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb
0x400c00



15 GB/s
version from clang++ / u64 / non-const
bufsize
:



0x400e50:
popcnt
(%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt
0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt
0x10(%r15,%rcx,8),%rdx

add %rsi,%rdx
popcnt
0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
%rbp,%rcx
jb
0x400e50


20 GB/s
version from g++ / u32&u64 / const
bufsize
:




0x400a68:
popcnt
(%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add
%rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add
%rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add
%rsi,%rcx
add %rcx,%rbp

cmp $0x100000,%rdx
jne
0x400a68


15 GB/s
version from clang++ / u32&u64 / const
bufsize
:



0x400dd0:
popcnt
(%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt
0x8(%r14,%rcx,8),%rsi

add %rdx,%rsi
popcnt
0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt
0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
$0x20000,%rcx
jb
0x400dd0



Interestingly,
the fastest (26 GB/s) version is also the longest! It seems to be the only solution that
uses lea. Some versions use jb to
jump, others use jne. But apart from that, all versions seem to
be comparable. I don't see where a 100% performance gap could originate from, but I am
not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very
short and good. Can anyone explain
this?



Lessons
learned



No matter what the answer to this
question will be; I have learned that in really hot loops every
detail can matter, even details that do not seem to have any association to
the hot code
. I have never thought about what type to use for a loop
variable, but as you see such a minor change can make a 100%
difference! Even the storage type of a buffer can make a huge difference, as we saw with
the insertion of the static keyword in front of the size
variable! In the future, I will always test various alternatives on various compilers
when writing really tight and hot loops that are crucial for system
performance.



The interesting thing is also that
the performance difference is still so high although I have already unrolled the loop
four times. So even if you unroll, you can still get hit by major performance
deviations. Quite interesting.


class="post-text" itemprop="text">
class="normal">Answer




Culprit: False Data
Dependency
(and the compiler isn't even aware of
it)




On Sandy/Ivy Bridge and Haswell
processors, the
instruction:



popcnt src,
dest


appears to have a
false dependency on the destination register dest. Even though
the instruction only writes to it, the instruction will wait until
dest is ready before executing. This false dependency is (now)
documented by Intel as erratum href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf"
rel="noreferrer">HSD146 (Haswell) and href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf"
rel="noreferrer">SKL029
(Skylake)



href="https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter">Skylake
fixed this for lzcnt and
tzcnt.
Cannon Lake (and Ice Lake) fixed
this for popcnt. />bsf/bsr have a true output
dependency: output unmodified for input=0. (But href="https://stackoverflow.com/questions/41351564/vs-unexpected-optimization-behavior-with-bitscanreverse64-intrinsic/41352456#41352456">no
way to take advantage of that with intrinsics - only AMD documents it and
compilers don't expose it.)



(Yes, these
instructions all run href="https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware">on
the same execution unit).




/>

This dependency doesn't just hold up the 4
popcnts from a single loop iteration. It can carry across loop
iterations making it impossible for the processor to parallelize different loop
iterations.



The
unsigned vs. uint64_t and other tweaks
don't directly affect the problem. But they influence the register allocator which
assigns the registers to the variables.



In your
case, the speeds are a direct result of what is stuck to the (false) dependency chain
depending on what the register allocator decided to
do.





  • 13 GB/s has
    a chain:
    popcnt-add-popcnt-popcnt
    → next iteration

  • 15 GB/s has a chain:
    popcnt-add-popcnt-add
    → next iteration

  • 20 GB/s has a chain:
    popcnt-popcnt → next
    iteration

  • 26 GB/s has a chain:
    popcnt-popcnt → next
    iteration



The difference
between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing.
Either way, the processor starts to hit other bottlenecks once you reach this
speed.



/>


To test this, I used inline assembly to
bypass the compiler and get exactly the assembly I want. I also split up the
count variable to break all other dependencies that might mess
with the benchmarks.



Here are the
results:



Sandy Bridge Xeon @ 3.5
GHz:
(full test code can be found at the
bottom)




  • GCC 4.6.3:
    g++ popcnt.cpp -std=c++0x -O3 -save-temps
    -march=native

  • Ubuntu
    12




Different
Registers: 18.6195
GB/s



.L4:

movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq
16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4,
%rax


popcnt %r8, %r8
add %r8,
%rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10,
%r10
add %r10, %rdi
popcnt %r11, %r11
add %r11,
%rsi

cmpq $131072, %rax

jne
.L4


Same Register:
8.49272
GB/s



.L9:

movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq
16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp

addq
$4, %rdx

# This time reuse "rax" for all the popcnts.

popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add
%rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp,
%rax

add %rax, %rdi

cmpq $131072,
%rdx
jne
.L9


Same Register with
broken chain: 17.8869
GB/s



.L14:

movq (%rbx,%rdx,8), %r9

movq 8(%rbx,%rdx,8), %r10
movq
16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4,
%rdx

# Reuse "rax" for all the popcnts.
xor %rax, %rax
# Break the cross-iteration dependency by zeroing "rax".
popcnt %r9,
%rax
add %rax, %rcx
popcnt %r10, %rax

add
%rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp,
%rax
add %rax, %rdi

cmpq $131072, %rdx
jne
.L14



/>

So what went wrong with the
compiler?



It seems that neither
GCC nor Visual Studio are aware that popcnt has such a false
dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter
of whether the compiler is aware of
it.



popcnt isn't
exactly the most used instruction. So it's not really a surprise that a major compiler
could miss something like this. There also appears to be no documentation anywhere that
mentions this problem. If Intel doesn't disclose it, then nobody outside will know until
someone runs into it by
chance.



(Update:
rel="noreferrer">As of version 4.9.2, GCC is aware of this false-dependency
and generates code to compensate it when optimizations are enabled. Major compilers from
other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this
microarchitectural erratum and will not emit code that compensates for
it.)




Why does the CPU
have such a false dependency?



We
can speculate: it runs on the same execution unit as bsf /
bsr which do have an output dependency.
( href="https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware">How
is POPCNT implemented in hardware?). For those instructions, Intel documents
the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually
gives a stronger guarantee to avoid breaking old software: output unmodified. AMD
documents this behaviour.



Presumably it was
somehow inconvenient to make some uops for this execution unit dependent on the output
but others not.



AMD processors do not appear to
have this false dependency.



/>


The full test code is below for
reference:



#include

#include
#include


int main(int argc, char* argv[])
{

using namespace std;
uint64_t
size=1<<20;


uint64_t* buffer = new
uint64_t[size/8];
char*
charbuffer=reinterpret_cast(buffer);
for (unsigned
i=0;i
uint64_t
count,duration;
chrono::time_point
startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 =
0;

uint64_t c2 = 0;
uint64_t c3 = 0;
startP
= chrono::system_clock::now();
for( unsigned k = 0; k < 10000;
k++){
for (uint64_t i=0;i uint64_t r0 = buffer[i
+ 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i +
2];
uint64_t r3 = buffer[i + 3];
__asm__(


"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5
\n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6,
%2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r"
(c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r"
(r3)

);
}
}
count = c0 + c1 + c2
+ c3;
endP = chrono::system_clock::now();

duration=chrono::duration_cast(endP-startP).count();

cout << "No Chain\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}
{

uint64_t c0 =
0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 =
0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k
< 10000; k++){
for (uint64_t i=0;i uint64_t
r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 =
buffer[i + 2];

uint64_t r3 = buffer[i + 3];

__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"

"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax
\n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add
%%rax, %3 \n\t"

: "+r" (c0), "+r" (c1), "+r" (c2), "+r"
(c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"

);
}
}
count = c0 + c1 + c2 + c3;
endP =
chrono::system_clock::now();

duration=chrono::duration_cast(endP-startP).count();

cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9)
<< " sec \t"

<< (10000.0*size)/(duration) << "
GB/s" << endl;
}
{
uint64_t c0 = 0;

uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;

startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000;
k++){
for (uint64_t i=0;i
uint64_t r0
= buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 =
buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(

"xor %%rax, %%rax \n\t" // <--- Break the chain.
"popcnt %4, %%rax
\n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add
%%rax, %1 \n\t"

"popcnt %6, %%rax \n\t"
"add %%rax, %2
\n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
:
"+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2),
"r" (r3)
: "rax"
);
}

}

count = c0 + c1 + c2 + c3;
endP =
chrono::system_clock::now();

duration=chrono::duration_cast(endP-startP).count();

cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9)
<< " sec \t"
<< (10000.0*size)/(duration) << " GB/s"
<< endl;
}


free(charbuffer);
}



/>

An equally interesting benchmark can be found here:
rel="noreferrer">http://pastebin.com/kbzgL8si

This
benchmark varies the number of popcnts that are in the (false)
dependency chain.



False Chain 0:
41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec
17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448
GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784
GB/s

False Chain 4: 41959360000 1.23572 sec 8.48557
GB/s

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...