c++ - Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt

itemprop="text">

I was looking for the fastest way to
popcount large arrays of data. I encountered a very
weird effect: Changing the loop variable from
unsigned to uint64_t made the
performance drop by 50% on my PC.

The
Benchmark

#include
            

#include 
#include
            

int main(int argc, char* argv[])
            {

 using namespace std;
 if (argc != 2) {
 cerr
            << "usage: array_size in MB" << endl;
 return -1;

            }


 uint64_t size = atol(argv[1])<<20;

            uint64_t* buffer = new uint64_t[size/8];
 char* charbuffer =
            reinterpret_cast(buffer);
 for (unsigned i=0; i            ++i)
 charbuffer[i] = rand()%256;

 uint64_t
            count,duration;
 chrono::time_point
            startP,endP;
 {

 startP =
            chrono::system_clock::now();
 count = 0;
 for( unsigned k = 0; k
            < 10000; k++){
 // Tight unrolled loop with unsigned
 for
            (unsigned i=0; i count +=
            _mm_popcnt_u64(buffer[i]);
 count += _mm_popcnt_u64(buffer[i+1]);

            count += _mm_popcnt_u64(buffer[i+2]);
 count +=
            _mm_popcnt_u64(buffer[i+3]);
 }

 }
 endP =
            chrono::system_clock::now();
 duration =
            chrono::duration_cast(endP-startP).count();

            cout << "unsigned\t" << count << '\t' << (duration/1.0E9)
            << " sec \t"
 << (10000.0*size)/(duration) << " GB/s"
            << endl;
 }
 {
 startP =
            chrono::system_clock::now();
 count=0;
 for( unsigned k = 0; k <
            10000; k++){

 // Tight unrolled loop with uint64_t
 for
            (uint64_t i=0;i count +=
            _mm_popcnt_u64(buffer[i]);
 count += _mm_popcnt_u64(buffer[i+1]);

            count += _mm_popcnt_u64(buffer[i+2]);
 count +=
            _mm_popcnt_u64(buffer[i+3]);
 }
 }
 endP =
            chrono::system_clock::now();
 duration =
            chrono::duration_cast(endP-startP).count();


            cout << "uint64_t\t" << count << '\t' << (duration/1.0E9)
            << " sec \t"
 << (10000.0*size)/(duration) << " GB/s"
            << endl;
 }


            free(charbuffer);
}

As
you see, we create a buffer of random data, with the size being
x megabytes where x is read from the
command line. Afterwards, we iterate over the buffer and use an unrolled version of the
x86 popcount intrinsic to perform the popcount. To get a more
precise result, we do the popcount 10,000 times. We measure the times for the popcount.
In the upper case, the inner loop variable is unsigned, in the
lower case, the inner loop variable is uint64_t. I thought that
this should make no difference, but the opposite is the
case.

The (absolutely crazy)
results

I compile it like this (g++ version:
Ubuntu 4.8.2-19ubuntu1):

g++ -O3
            -march=native -std=c++11 test.cpp -o
            test

Here are the
results on my rel="noreferrer">Haswell href="http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29#Desktop_processors"
rel="noreferrer">Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random
data):

unsigned
41959360000 0.401554 sec
26.113 GB/s

uint64_t
41959360000 0.759822 sec
13.8003 GB/s

As
you see, the throughput of the uint64_t version is
only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First,
I thought of a compiler bug, so I tried clang++ (Ubuntu href="http://en.wikipedia.org/wiki/Clang" rel="noreferrer">Clang version
3.4-1ubuntu3):

clang++ -O3
            -march=native -std=c++11 teest.cpp -o
            test

Result:
test 1

unsigned
41959360000 0.398293 sec 26.3267
GB/s

uint64_t 41959360000 0.680954 sec
15.3986
GB/s

So,
it is almost the same result and is still strange. But now it gets super
strange. I replace the buffer size that was read from input with a constant
1, so I
change:

uint64_t size =
            atol(argv[1]) <<
            20;

uint64_t
            size = 1 <<
            20;

Thus, the compiler
now knows the buffer size at compile time. Maybe it can add some optimizations! Here are
the numbers for
g++:

unsigned
41959360000 0.509156 sec
20.5944 GB/s

uint64_t
41959360000 0.508673 sec
20.6139 GB/s

Now,
both versions are equally fast. However, the unsigned
got even slower! It dropped from
26 to 20 GB/s, thus replacing a
non-constant by a constant value lead to a deoptimization.
Seriously, I have no clue what is going on here! But now to
clang++ with the new
version:

unsigned
41959360000 0.677009 sec
15.4884 GB/s

uint64_t
41959360000 0.676909 sec
15.4906 GB/s

Wait,
what? Now, both versions dropped to the slow
number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow
code in both cases for
Clang!

I asked a colleague with an href="http://en.wikipedia.org/wiki/Ivy_Bridge_%28microarchitecture%29"
rel="noreferrer">Ivy Bridge CPU to compile my benchmark. He got similar
results, so it does not seem to be Haswell. Because two compilers produce strange
results here, it also does not seem to be a compiler bug. We do not have an AMD CPU
here, so we could only test with Intel.

More
madness, please!

Take the first example (the
one with atol(argv[1])) and put a
static before the variable,
i.e.:

static uint64_t
            size=atol(argv[1])<<20;

Here
are my results in
g++:

unsigned 41959360000
0.396728 sec 26.4306
GB/s

uint64_t 41959360000 0.509484 sec
20.5811
GB/s

Yay,
yet another alternative. We still have the fast 26 GB/s with
u32, but we managed to get u64 at
least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the
u64 version became even faster than the
u32 version, yielding the fastest result of all.
Sadly, this only works for g++,
clang++ does not seem to care about
static.

My
question

Can you explain these results?
Especially:

How can there
be such a difference between u32 and
u64?

How can replacing a
non-constant by a constant buffer size trigger less optimal
code?

How can the insertion of the
static keyword make the u64 loop
faster? Even faster than the original code on my collegue's
computer!

I
know that optimization is a tricky territory, however, I never thought that such small
changes can lead to a 100% difference in execution time and
that small factors like a constant buffer size can again mix results totally. Of course,
I always want to have the version that is able to popcount 26 GB/s. The only reliable
way I can think of is copy paste the assembly for this case and use inline assembly.
This is the only way I can get rid of compilers that seem to go mad on small changes.
What do you think? Is there another way to reliably get the code with most
performance?

The
Disassembly

Here is the disassembly for the
various results:

26 GB/s version from
g++ / u32 / non-const
bufsize:

0x400af8:
lea
            0x1(%rdx),%eax

popcnt (%rbx,%rax,8),%r9
lea
            0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea
            0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add
            $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add
            %rcx,%rax

mov %edx,%ecx
add %rax,%r14
cmp
            %rbp,%rcx
jb
            0x400af8

13 GB/s
version from g++ / u64 / non-const
bufsize:

0x400c00:
popcnt
            0x8(%rbx,%rdx,8),%rcx

popcnt (%rbx,%rdx,8),%rax
add
            %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add
            %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add
            %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb
            0x400c00

15 GB/s
version from clang++ / u64 / non-const
bufsize:

0x400e50:
popcnt
            (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt
            0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt
            0x10(%r15,%rcx,8),%rdx

add %rsi,%rdx
popcnt
            0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
            %rbp,%rcx
jb
            0x400e50

20 GB/s
version from g++ / u32&u64 / const
bufsize:

0x400a68:
popcnt
            (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add
            %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add
            %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add
            %rsi,%rcx
add %rcx,%rbp

cmp $0x100000,%rdx
jne
            0x400a68

15 GB/s
version from clang++ / u32&u64 / const
bufsize:

0x400dd0:
popcnt
            (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt
            0x8(%r14,%rcx,8),%rsi

add %rdx,%rsi
popcnt
            0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt
            0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp
            $0x20000,%rcx
jb
            0x400dd0

Interestingly,
the fastest (26 GB/s) version is also the longest! It seems to be the only solution that
uses lea. Some versions use jb to
jump, others use jne. But apart from that, all versions seem to
be comparable. I don't see where a 100% performance gap could originate from, but I am
not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very
short and good. Can anyone explain
this?

Lessons
learned

No matter what the answer to this
question will be; I have learned that in really hot loops every
detail can matter, even details that do not seem to have any association to
the hot code. I have never thought about what type to use for a loop
variable, but as you see such a minor change can make a 100%
difference! Even the storage type of a buffer can make a huge difference, as we saw with
the insertion of the static keyword in front of the size
variable! In the future, I will always test various alternatives on various compilers
when writing really tight and hot loops that are crucial for system
performance.

The interesting thing is also that
the performance difference is still so high although I have already unrolled the loop
four times. So even if you unroll, you can still get hit by major performance
deviations. Quite interesting.

class="post-text" itemprop="text">

class="normal">Answer

Culprit: False Data
Dependency (and the compiler isn't even aware of
it)

On Sandy/Ivy Bridge and Haswell
processors, the
instruction:

popcnt src,
            dest

appears to have a
false dependency on the destination register dest. Even though
the instruction only writes to it, the instruction will wait until
dest is ready before executing. This false dependency is (now)
documented by Intel as erratum href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf"
rel="noreferrer">HSD146 (Haswell) and href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf"
rel="noreferrer">SKL029
(Skylake)

href="https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter">Skylake
fixed this for lzcnt and
tzcnt.
Cannon Lake (and Ice Lake) fixed
this for popcnt. />bsf/bsr have a true output
dependency: output unmodified for input=0. (But href="https://stackoverflow.com/questions/41351564/vs-unexpected-optimization-behavior-with-bitscanreverse64-intrinsic/41352456#41352456">no
way to take advantage of that with intrinsics - only AMD documents it and
compilers don't expose it.)

(Yes, these
instructions all run href="https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware">on
the same execution unit).

This dependency doesn't just hold up the 4
popcnts from a single loop iteration. It can carry across loop
iterations making it impossible for the processor to parallelize different loop
iterations.

The
unsigned vs. uint64_t and other tweaks
don't directly affect the problem. But they influence the register allocator which
assigns the registers to the variables.

In your
case, the speeds are a direct result of what is stuck to the (false) dependency chain
depending on what the register allocator decided to
do.

13 GB/s has
a chain:
popcnt-add-popcnt-popcnt
→ next iteration

15 GB/s has a chain:
popcnt-add-popcnt-add
→ next iteration

20 GB/s has a chain:
popcnt-popcnt → next
iteration

26 GB/s has a chain:
popcnt-popcnt → next
iteration

The difference
between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing.
Either way, the processor starts to hit other bottlenecks once you reach this
speed.

To test this, I used inline assembly to
bypass the compiler and get exactly the assembly I want. I also split up the
count variable to break all other dependencies that might mess
with the benchmarks.

Here are the
results:

Sandy Bridge Xeon @ 3.5
GHz: (full test code can be found at the
bottom)

GCC 4.6.3:
g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native

Ubuntu
12

Different
Registers: 18.6195
GB/s

.L4:

            movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq
            16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq $4,
            %rax


 popcnt %r8, %r8
 add %r8,
            %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10,
            %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11,
            %rsi

 cmpq $131072, %rax

 jne
            .L4

Same Register:
8.49272
GB/s

.L9:

            movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq
            16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp

 addq
            $4, %rdx

 # This time reuse "rax" for all the popcnts.

            popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add
            %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp,
            %rax

 add %rax, %rdi

 cmpq $131072,
            %rdx
 jne
            .L9

Same Register with
broken chain: 17.8869
GB/s

.L14:

            movq (%rbx,%rdx,8), %r9

 movq 8(%rbx,%rdx,8), %r10
 movq
            16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq $4,
            %rdx

 # Reuse "rax" for all the popcnts.
 xor %rax, %rax
            # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9,
            %rax
 add %rax, %rcx
 popcnt %r10, %rax

 add
            %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp,
            %rax
 add %rax, %rdi

 cmpq $131072, %rdx
 jne
            .L14

So what went wrong with the
compiler?

It seems that neither
GCC nor Visual Studio are aware that popcnt has such a false
dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter
of whether the compiler is aware of
it.

popcnt isn't
exactly the most used instruction. So it's not really a surprise that a major compiler
could miss something like this. There also appears to be no documentation anywhere that
mentions this problem. If Intel doesn't disclose it, then nobody outside will know until
someone runs into it by
chance.

(Update:
rel="noreferrer">As of version 4.9.2, GCC is aware of this false-dependency
and generates code to compensate it when optimizations are enabled. Major compilers from
other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this
microarchitectural erratum and will not emit code that compensates for
it.)

Why does the CPU
have such a false dependency?

We
can speculate: it runs on the same execution unit as bsf /
bsr which do have an output dependency.
( href="https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware">How
is POPCNT implemented in hardware?). For those instructions, Intel documents
the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually
gives a stronger guarantee to avoid breaking old software: output unmodified. AMD
documents this behaviour.

Presumably it was
somehow inconvenient to make some uops for this execution unit dependent on the output
but others not.

AMD processors do not appear to
have this false dependency.

The full test code is below for
reference:

#include
            
#include 
#include
            

int main(int argc, char* argv[])
            {

 using namespace std;
 uint64_t
            size=1<<20;


 uint64_t* buffer = new
            uint64_t[size/8];
 char*
            charbuffer=reinterpret_cast(buffer);
 for (unsigned
            i=0;i
 uint64_t
            count,duration;
 chrono::time_point
            startP,endP;
 {
 uint64_t c0 = 0;
 uint64_t c1 =
            0;

 uint64_t c2 = 0;
 uint64_t c3 = 0;
 startP
            = chrono::system_clock::now();
 for( unsigned k = 0; k < 10000;
            k++){
 for (uint64_t i=0;i uint64_t r0 = buffer[i
            + 0];
 uint64_t r1 = buffer[i + 1];
 uint64_t r2 = buffer[i +
            2];
 uint64_t r3 = buffer[i + 3];
 __asm__(


            "popcnt %4, %4 \n\t"
 "add %4, %0 \n\t"
 "popcnt %5, %5
            \n\t"
 "add %5, %1 \n\t"
 "popcnt %6, %6 \n\t"
 "add %6,
            %2 \n\t"
 "popcnt %7, %7 \n\t"
 "add %7, %3 \n\t"
 : "+r"
            (c0), "+r" (c1), "+r" (c2), "+r" (c3)
 : "r" (r0), "r" (r1), "r" (r2), "r"
            (r3)

 );
 }
 }
 count = c0 + c1 + c2
            + c3;
 endP = chrono::system_clock::now();

            duration=chrono::duration_cast(endP-startP).count();

            cout << "No Chain\t" << count << '\t' << (duration/1.0E9)
            << " sec \t"
 << (10000.0*size)/(duration) << " GB/s"
            << endl;
 }
 {

 uint64_t c0 =
            0;
 uint64_t c1 = 0;
 uint64_t c2 = 0;
 uint64_t c3 =
            0;
 startP = chrono::system_clock::now();
 for( unsigned k = 0; k
            < 10000; k++){
 for (uint64_t i=0;i uint64_t
            r0 = buffer[i + 0];
 uint64_t r1 = buffer[i + 1];
 uint64_t r2 =
            buffer[i + 2];

 uint64_t r3 = buffer[i + 3];

            __asm__(
 "popcnt %4, %%rax \n\t"
 "add %%rax, %0 \n\t"

            "popcnt %5, %%rax \n\t"
 "add %%rax, %1 \n\t"
 "popcnt %6, %%rax
            \n\t"
 "add %%rax, %2 \n\t"
 "popcnt %7, %%rax \n\t"
 "add
            %%rax, %3 \n\t"

 : "+r" (c0), "+r" (c1), "+r" (c2), "+r"
            (c3)
 : "r" (r0), "r" (r1), "r" (r2), "r" (r3)
 : "rax"

            );
 }
 }
 count = c0 + c1 + c2 + c3;
 endP =
            chrono::system_clock::now();

            duration=chrono::duration_cast(endP-startP).count();

            cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9)
            << " sec \t"

 << (10000.0*size)/(duration) << "
            GB/s" << endl;
 }
 {
 uint64_t c0 = 0;

            uint64_t c1 = 0;
 uint64_t c2 = 0;
 uint64_t c3 = 0;

            startP = chrono::system_clock::now();
 for( unsigned k = 0; k < 10000;
            k++){
 for (uint64_t i=0;i
 uint64_t r0
            = buffer[i + 0];
 uint64_t r1 = buffer[i + 1];
 uint64_t r2 =
            buffer[i + 2];
 uint64_t r3 = buffer[i + 3];
 __asm__(

            "xor %%rax, %%rax \n\t" // <--- Break the chain.
 "popcnt %4, %%rax
            \n\t"
 "add %%rax, %0 \n\t"
 "popcnt %5, %%rax \n\t"
 "add
            %%rax, %1 \n\t"

 "popcnt %6, %%rax \n\t"
 "add %%rax, %2
            \n\t"
 "popcnt %7, %%rax \n\t"
 "add %%rax, %3 \n\t"
 :
            "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
 : "r" (r0), "r" (r1), "r" (r2),
            "r" (r3)
 : "rax"
 );
 }

            }

 count = c0 + c1 + c2 + c3;
 endP =
            chrono::system_clock::now();

            duration=chrono::duration_cast(endP-startP).count();

            cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9)
            << " sec \t"
 << (10000.0*size)/(duration) << " GB/s"
            << endl;
 }


            free(charbuffer);
}

An equally interesting benchmark can be found here:
rel="noreferrer">http://pastebin.com/kbzgL8si

This
benchmark varies the number of popcnts that are in the (false)
dependency chain.

False Chain 0:
            41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec
            17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448
            GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784
            GB/s

False Chain 4: 41959360000 1.23572 sec 8.48557
            GB/s

Blog

Saturday, 25 November 2017