This is related, but not the same, as this question: href="https://stackoverflow.com/q/18113995/3403507">Performance optimisations of
x86-64 assembly - Alignment and branch prediction and is slightly related to
my previous question: href="https://stackoverflow.com/questions/26798761/unsigned-64-bit-to-double-conversion-why-this-algorithm-from-g">Unsigned
64-bit to double conversion: why this algorithm from
g++
The following is a
not real-world test case. This primality testing algorithm
is not sensible. I suspect any real-world algorithm would never
execute such a small inner-loop quite so many times (num
is a
prime of size about 2**50). In
C++11:
using nt = unsigned long
long;
bool is_prime_float(nt num)
{
for (nt n=2;
n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false;
}
}
return
true;
}
Then
g++ -std=c++11 -O3 -S
produces the following, with RCX
containing n
and XMM6 containing
sqrt(num)
. See my previous post for the remaining code (which
is never executed in this example, as RCX never becomes large enough to be treated as a
signed negative).
jmp
.L20
.p2align 4,,10
.L37:
pxor %xmm0,
%xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36
// Exit the loop
.L20:
xorl %edx, %edx
movq %rbx,
%rax
divq %rcx
testq %rdx, %rdx
je .L30 //
Failed divisibility test
addq $1, %rcx
jns .L37
// Further
code to deal with case when ucomisd can't be
used
I time this using
std::chrono::steady_clock
. I kept getting weird performance
changes: from just adding or deleting other code. I eventually tracked this down to an
alignment issue. The command .p2align 4,,10
tried to align to a
2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to
balance between alignment and code size.
I wrote
a Python script to replace .p2align 4,,10
by a manually
controlled number of nop
instructions. The following scatter
plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the
x-axis:
src="https://i.stack.imgur.com/cyip1.png" alt="Scatter
plot">
From objdump
with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a
laptop, Sandybridge i5-3210m, turboboost disabled, I found
that
- For 0 byte padding,
slow performance (0.42 secs) - For 1-4 bytes padding
(offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the
plot). - For 5-20 bytes padding (offset 0x402f64 to
0x402f73) get fast performance (0.37s) - For 21-32 bytes
padding (offset 0x402f74 to 0x402f7f) slow performance (0.42
secs) - Then cycles on a 32 byte
sample
So a
16-byte alignment doesn't give the best performance-- it puts us in the slightly better
(or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19
gives the best performance.
Why am I seeing this performance difference? Why does this seem to violate the
rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation
manual)
I don't
see any branch-prediction problems. Could this be a uop cache
quirk??
By changing the C++ algorithm
to cache sqrt(num)
in an 64-bit integer and then make the loop
purely integer based, I remove the problem-- alignment now makes no difference at
all.
No comments:
Post a Comment