Monday, 27 November 2017

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

This is related, but not the same, as this question: href="https://stackoverflow.com/q/18113995/3403507">Performance optimisations of
x86-64 assembly - Alignment and branch prediction and is slightly related to
my previous question: href="https://stackoverflow.com/questions/26798761/unsigned-64-bit-to-double-conversion-why-this-algorithm-from-g">Unsigned
64-bit to double conversion: why this algorithm from
g++



The following is a
not real-world test case. This primality testing algorithm
is not sensible. I suspect any real-world algorithm would never
execute such a small inner-loop quite so many times (num is a
prime of size about 2**50). In
C++11:



using nt = unsigned long
long;
bool is_prime_float(nt num)
{
for (nt n=2;
n<=sqrt(num); ++n) {

if ( (num%n)==0 ) { return false;
}
}
return
true;
}


Then
g++ -std=c++11 -O3 -S produces the following, with RCX
containing n and XMM6 containing
sqrt(num). See my previous post for the remaining code (which
is never executed in this example, as RCX never becomes large enough to be treated as a
signed negative).



jmp
.L20
.p2align 4,,10

.L37:
pxor %xmm0,
%xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36
// Exit the loop
.L20:
xorl %edx, %edx
movq %rbx,
%rax
divq %rcx
testq %rdx, %rdx

je .L30 //
Failed divisibility test
addq $1, %rcx
jns .L37
// Further
code to deal with case when ucomisd can't be
used


I time this using
std::chrono::steady_clock. I kept getting weird performance
changes: from just adding or deleting other code. I eventually tracked this down to an
alignment issue. The command .p2align 4,,10 tried to align to a
2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to
balance between alignment and code size.



I wrote
a Python script to replace .p2align 4,,10 by a manually
controlled number of nop instructions. The following scatter
plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the
x-axis:




src="https://i.stack.imgur.com/cyip1.png" alt="Scatter
plot">



From objdump
with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a
laptop, Sandybridge i5-3210m, turboboost disabled, I found
that




  • For 0 byte padding,
    slow performance (0.42 secs)

  • For 1-4 bytes padding
    (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the
    plot).

  • For 5-20 bytes padding (offset 0x402f64 to
    0x402f73) get fast performance (0.37s)

  • For 21-32 bytes
    padding (offset 0x402f74 to 0x402f7f) slow performance (0.42
    secs)

  • Then cycles on a 32 byte
    sample




So a
16-byte alignment doesn't give the best performance-- it puts us in the slightly better
(or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19
gives the best performance.





Why am I seeing this performance difference? Why does this seem to violate the
rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation
manual)




I don't
see any branch-prediction problems. Could this be a uop cache
quirk??




By changing the C++ algorithm
to cache sqrt(num) in an 64-bit integer and then make the loop
purely integer based, I remove the problem-- alignment now makes no difference at
all.

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...