Monday 15 January 2018

c++ - Why does this loop take 1.32 cycles per iteration

Consider this simple C++ function to calculate the href="https://en.wikipedia.org/wiki/Prefix_sum" rel="noreferrer">prefix sum
of an array:



void prefix_sum(const uint32_t* input, uint32_t*
output, size_t size) {
uint32_t total = 0;
for (size_t i = 0; i
< size; i++) {

total += input[i];
output[i] =
total;

}
}


The loop
compiles to the
following assembly on gcc
5.5:



.L5:
add ecx,
DWORD PTR [rdi+rax*4]

mov DWORD PTR [rsi+rax*4], ecx

add rax, 1
cmp rdx, rax
jne
.L5


I don't see
anything that would prevent this from running at 1 cycle per iteration, yet I
consistently measure it at 1.32 (+/- 0.01) cycles/iteration on my Skylake i7-6700HQ,
when running it against 8 KiB input/output
arrays.



The loop is served out of the uop cache
and doesn't cross any uop cache boundary and performance counters don't indicate any
front-end bottleneck.




It's 4 fused
uops1, and this CPU can sustain 4 fused
ops/cycle.



There are carried dependency chains
through ecx and rax, each of 1 cycle,
but these add uops can go to any of the 4 ALU ports, so seem
unlikely to conflict. The fused cmp needs to go to p6 which is
more of a concern, but I measure only 1.1 uops/iteration to p6. That would explain 1.1
cycles per iteration, but not 1.4. If I unroll the loop by 2x port pressure is much
lower: less than 0.7 uops to all of p0156, yet performance is still unexpectedly slow at
1.3 cycles per iteration.



There is one store per
iteration, but we can do one store per
cycle.



There is one load per iteration, but we
can do two of those per cycle.



There are two
complex AGUs per cycle, but we can do two of those per
cycle.




What's the bottleneck
here?



Interestingly I tried the href="http://3.18.198.23/predict" rel="noreferrer">Ithermal performance
predictor and it gets it almost exactly right: estimating 1.314 cycles versus
my measurement of 1.32.



/>

1 I confirmed macro and
micro-fusion fusion via the uops_issued.any counter which
counts in the fused domain and reads 4.0 fused uops per iteration for this
loop.

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...