performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

Monday, 27 November 2017

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

This is related, but not the same, as this question: href="https://stackoverflow.com/q/18113995/3403507">Performance optimisations of
x86-64 assembly - Alignment and branch prediction and is slightly related to
my previous question: href="https://stackoverflow.com/questions/26798761/unsigned-64-bit-to-double-conversion-why-this-algorithm-from-g">Unsigned
64-bit to double conversion: why this algorithm from
g++

The following is a
not real-world test case. This primality testing algorithm
is not sensible. I suspect any real-world algorithm would never
execute such a small inner-loop quite so many times (num is a
prime of size about 2**50). In
C++11:

using nt = unsigned long
            long;
bool is_prime_float(nt num)
{
 for (nt n=2;
            n<=sqrt(num); ++n) {

 if ( (num%n)==0 ) { return false;
            }
 }
 return
            true;
}

Then
g++ -std=c++11 -O3 -S produces the following, with RCX
containing n and XMM6 containing
sqrt(num). See my previous post for the remaining code (which
is never executed in this example, as RCX never becomes large enough to be treated as a
signed negative).

jmp
            .L20
.p2align 4,,10

.L37:
pxor %xmm0,
            %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36
            // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx,
            %rax
divq %rcx
testq %rdx, %rdx

je .L30 //
            Failed divisibility test
addq $1, %rcx
jns .L37
// Further
            code to deal with case when ucomisd can't be
            used

I time this using
std::chrono::steady_clock. I kept getting weird performance
changes: from just adding or deleting other code. I eventually tracked this down to an
alignment issue. The command .p2align 4,,10 tried to align to a
2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to
balance between alignment and code size.

I wrote
a Python script to replace .p2align 4,,10 by a manually
controlled number of nop instructions. The following scatter
plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the
x-axis:

src="https://i.stack.imgur.com/cyip1.png" alt="Scatter
plot">

From objdump
with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a
laptop, Sandybridge i5-3210m, turboboost disabled, I found
that

For 0 byte padding,
slow performance (0.42 secs)

For 1-4 bytes padding
(offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the
plot).

For 5-20 bytes padding (offset 0x402f64 to
0x402f73) get fast performance (0.37s)

For 21-32 bytes
padding (offset 0x402f74 to 0x402f7f) slow performance (0.42
secs)

Then cycles on a 32 byte
sample

So a
16-byte alignment doesn't give the best performance-- it puts us in the slightly better
(or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19
gives the best performance.

Why am I seeing this performance difference? Why does this seem to violate the
rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation
manual)

I don't
see any branch-prediction problems. Could this be a uop cache
quirk??

By changing the C++ algorithm
to cache sqrt(num) in an 64-bit integer and then make the loop
purely integer based, I remove the problem-- alignment now makes no difference at
all.

Blog

Monday, 27 November 2017

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file