Tuesday 31 October 2017

c - Can x86's MOV really be "free"? Why can't I reproduce this at all?

itemprop="text">

I keep seeing people claim that the
MOV instruction can be free in x86, because of register renaming.



For the life of me, I can't verify this in a
single test case. Every test case I try debunks
it.



For example, here's the code I'm compiling
with Visual
C++:




#include

#include
#include


int main(void)
{
unsigned int
k, l, j;
clock_t tstart = clock();
for (k = 0, j = 0, l = 0; j
< UINT_MAX; ++j)
{

++k;
k = j; // <--
comment out this line to remove the MOV instruction
l += j;

}
fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 /
CLOCKS_PER_SEC));
fflush(stderr);
return (int)(k + j +
l);
}



This
produces the following assembly code for the loop (feel free to produce this however you
want; you obviously don't need Visual
C++):



LOOP:
add
edi,esi
mov ebx,esi
inc esi
cmp
esi,FFFFFFFFh
jc
LOOP



Now I
run this program several times, and I observe a pretty consistent 2% difference when the
MOV instruction is
removed:



Without MOV With
MOV
1303 ms 1358 ms
1324 ms 1363 ms
1310 ms 1345
ms
1304 ms 1343 ms
1309 ms 1334 ms
1312 ms 1336
ms
1320 ms 1311 ms

1302 ms 1350 ms
1319 ms
1339 ms
1324 ms 1338
ms


So what gives? Why
isn't the MOV "free"? Is this loop too complicated for x86?
Is there a
single example out there that can demonstrate MOV being free like
people claim?
If so, what is it? And if not, why does everyone keep claiming
MOV is free?


itemprop="text">
class="normal">Answer




The throughput of the loop in
the question doesn't depend on the latency of MOV, or (on Haswell)
the benefit of not using an execution
unit.



The loop is still only 4
uops for the front-end to issue into the out-of-order core.
(mov still has to be tracked by the out-of-order core even if
it doesn't need an execution unit, but cmp/jc macro-fuses into
a single uop).




Intel CPUs since Core2
have had an issue width of 4 uops per clock, so the mov doesn't
stop it from executing at (close to) one iter per clock on Haswell. It would also run at
one per clock on Ivybridge (with mov-elimination), but not on
Sandybridge (no mov-elimination). On SnB, it would be about one iter per
1.333c cycles, bottlenecked on ALU throughput because the mov
would always need one
. (SnB/IvB have only three ALU ports, while Haswell
has four).



Note that special handling in the
rename stage has been a thing for x87 FXCHG (swap st0 with
st1) for much longer than MOV. Agner Fog lists FXCHG as 0
latency on PPro/PII/PIII (first-gen P6 core).



/>

The loop in the question has two interlocking
dependency chains (the add edi,esi depends on EDI and on the
loop counter ESI), which makes it more sensitive to imperfect scheduling. A 2% slowdown
vs. theoretical prediction because of seemingly-unrelated instructions isn't unusual,
and small variations in order of instructions can make this kind of difference. To run
at exactly 1c per iter, every cycle needs to run an INC and an ADD. Since all the INCs
and ADDs are dependent on the previous iteration, out-of-order execution can't catch up
by running two in a single cycle. Even worse, the ADD depends on the INC in the previous
cycle, which is what I meant by "interlocking", so losing a cycle in the INC dep chain
also stalls the ADD dep chain.



Also,
predicted-taken branches can only run on port6, so any cycle where port6
doesn't executed a cmp/jc is a cycle of lost throughput
. This happens
every time an INC or ADD steals a cycle on port6 instead of running on ports 0, 1, or 5.
IDK if this is the culprit, or if losing cycles in the INC/ADD dep chains themselves is
the problem, or maybe some of
both.




Adding the extra
MOV doesn't add any execution-port pressure, assuming it's eliminated 100%, but it does
stop the front-end from running ahead of the execution core
. (Only 3 of
the 4 uops in the loop need an execution unit, and your Haswell CPU can run INC and ADD
on any of its 4 ALU ports: 0, 1, 5, and 6. So the bottlenecks are:




  • the front-end max
    throughput of 4 uops per clock. (The loop without MOV is only 3 uops, so the front-end
    can run ahead).

  • taken-branch throughput of one per
    clock.

  • the dependency chain involving
    esi (INC latency of 1 per
    clock)

  • the dependency chain involving
    edi (ADD latency of 1 per clock, and also dependent on the INC
    from the previous
    iteration)




Without
the MOV, the front-end can issue the loop's three uops at 4 per clock until the
out-of-order core is full. (AFAICT, href="https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of">it
"unrolls" tiny loops in the loop-buffer (Loop Stream Detector: LSD), so a loop with ABC
uops can issue in an ABCA BCAB CABC ... pattern. The perf counter for
lsd.cycles_4_uops confirms that it mostly issues in groups of 4
when it issues any uops.)



href="https://stackoverflow.com/questions/40681331/how-are-x86-uops-scheduled-exactly">Intel
CPUs assign uops to ports as they issue into the out-of-order core. The
decision is based on counters that track how many uops for each port are already in the
scheduler (aka Reservation Station, RS). When there are lots of uops in the RS waiting
to execute, this works well and should usually avoid scheduling INC or ADD to port6. And
I guess also avoids scheduling the INC and ADD such that time is lost from either of
those dep chains. But if the RS is empty or nearly-empty, the counters won't stop an ADD
or INC from stealing a cycle on port6.



I thought
I was onto something here, but any sub-optimal scheduling should let the front-end catch
up and keep the back-end full. I don't think we should expect the front-end to cause
enough bubbles in the pipeline to explain a 2% drop below max throughput, since the tiny
loop should run from the loop buffer at a very consistent 4 per clock throughput. Maybe
there's something else going on.



/>




I used
lea to construct a loop that only has one
mov per clock, creating a perfect demonstration where
MOV-elimination succeeds 100%, or 0% of the time with mov
same,same
to demonstrate the latency bottleneck that
produces.



Since the macro-fused
dec/jnz is part of the dependency chain
involving the loop counter, imperfect scheduling can't delay it.
This is
different from the case where cmp/jc "forks off" from the
critical-path dependency chain every
iteration.



_start:
mov
ecx, 2000000000 ; each iteration decrements by 2, so this is 1G iters
align 16
; really align 32 makes more sense in case the uop-cache comes into play, but alignment
is actually irrelevant for loops that fit in the loop
buffer.
.loop:
mov eax, ecx
lea ecx, [rax-1] ; we vary
these two instructions


dec ecx ; dec/jnz macro-fuses
into one uop in the decoders, on Intel
jnz
.loop

.end:
xor edi,edi ; edi=0
mov eax,231 ;
__NR_exit_group from /usr/include/asm/unistd_64.h
syscall ;
sys_exit_group(0)



On
Intel SnB-family, LEA with one or two components in the addressing mode runs with 1c
latency (See rel="noreferrer">http://agner.org/optimize/, and other links in the href="https://stackoverflow.com/questions/tagged/x86" class="post-tag" title="show
questions tagged 'x86'" rel="tag">x86 tag
wiki).



I built and ran this as a static binary
on Linux, so user-space perf-counters for the whole process are measuring just the loop
with negligible startup / shutdown overhead. (perf stat is
really easy compared to putting perf-counter queries into the program
itself)



$ yasm -felf64
-Worphan-labels -gdwarf2 mov-elimination.asm && ld -o mov-elimination
mov-elimination.o &&
objdump -Mintel -drwC mov-elimination
&&
taskset -c 1 ocperf.py stat
-etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread
-r2 ./mov-elimination

Disassembly of section
.text:


00000000004000b0 <_start>:

4000b0: b9 00 94 35 77 mov ecx,0x77359400
4000b5: 66 66 2e 0f 1f 84 00 00 00
00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000c0
<_start.loop>:
4000c0: 89 c8 mov eax,ecx
4000c2: 8d 48 ff
lea ecx,[rax-0x1]
4000c5: ff c9 dec ecx
4000c7: 75 f7 jne 4000c0
<_start.loop>


00000000004000c9
<_start.end>:
4000c9: 31 ff xor edi,edi
4000cb: b8 e7 00 00
00 mov eax,0xe7
4000d0: 0f 05 syscall

perf stat
-etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/
-r2 ./mov-elimination

Performance counter stats for
'./mov-elimination' (2 runs):

513.242841 task-clock:u (msec) #
1.000 CPUs utilized ( +- 0.05% )

0 context-switches:u # 0.000
K/sec
1 page-faults:u # 0.002 K/sec
2,000,111,934 cycles:u #
3.897 GHz ( +- 0.00% )
4,000,000,161 instructions:u # 2.00 insn per cycle (
+- 0.00% )
1,000,000,157 branches:u # 1948.396 M/sec ( +- 0.00% )

3,000,058,589 uops_issued_any:u # 5845.300 M/sec ( +- 0.00% )
2,000,037,900
uops_executed_thread:u # 3896.865 M/sec ( +- 0.00% )

0.513402352
seconds time elapsed ( +- 0.05%
)



As
expected, the loop runs 1G times (branches ~= 1 billion). The
"extra" 111k cycles beyond 2G is overhead that's present in the other tests, too,
including the one with no mov. It's not from occasional failure
of mov-elimination, but it does scale with the iteration count so it's not just startup
overhead. It's probably from timer interrupts, since IIRC Linux
perf doesn't mess around with perf-counters while handling
interrupts, and just lets them keep counting. (perf virtualizes
the hardware performance counters so you can get per-process counts even when a thread
migrates across CPUs.) Also, timer interrupts on the sibling logical core that shares
the same physical core will perturb things a
bit.



The bottleneck is the loop-carried
dependency chain involving the loop counter. 2G cycles for 1G iters is 2 clocks per
iteration, or 1 clock per decrement. The confirms that the length of the dep chain is 2
cycles. This is only possible if mov has zero
latency
. (I know it doesn't prove that there isn't some other bottleneck.
It really only proves that the latency is at
most
2 cycles, if you don't believe my assertion that latency is the only
bottleneck. There's a resource_stalls.any perf counter, but it
doesn't have many options for breaking down which microarchitectural resource was
exhausted.)



The loop has 3 fused-domain uops:
mov, lea, and href="https://stackoverflow.com/questions/31771526/x86-64-assembly-loop-conditions-and-out-of-order/31778403#31778403">macro-fused
dec/jnz. The 3G
uops_issued.any count confirms that: It counts in the fused
domain, which is all of the pipeline from decoders to retirement, except for the
scheduler (RS) and execution units. (macro-fused instruction-pairs stay as single uop
everywhere. It's only for micro-fusion of stores or ALU+load that 1 fused-domain uop in
the
ROB
tracks the progress of two unfused-domain
uops.)



2G
uops_executed.thread (unfused-domain) tells us that all the
mov uops were eliminated (i.e. handled by the issue/rename
stage, and placed in the ROB in an already-executed state). They still take up
issue/retire bandwidth, and space in the uop cache, and code-size. They take up space in
the ROB, limiting out-of-order window size. A mov
instruction is never free. There are many possible microarchitectural bottlenecks
besides latency and execution ports, the most important usually being the 4-wide issue
rate of the front-end.



On Intel
CPUs, being zero latency is often a bigger deal than not needing an execution unit,
especially in Haswell and later where there are 4 ALU ports. (But only 3 of them can
handle vector uops, so non-eliminated vector moves would be a bottleneck more easily,
especially in code without many loads or stores taking front-end bandwidth (4
fused-domain uops per clock) away from ALU uops. Also, scheduling uops to execution
units isn't perfect (more like oldest-ready first), so uops that aren't on the critical
path can steal cycles from the critical
path.)




If we put a
nop or an xor edx,edx into the loop,
those would also issue but not execute on Intel SnB-family
CPUs.



Zero-latency mov-elimination can be useful
for zero-extending from 32 to 64 bits, and for 8 to 64. ( href="https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to">movzx
eax, bl
is eliminated, movzx eax, bx
isn't).



/>

Without
mov-elimination



All current CPUs
that support mov-elimination don't support it for mov
same,same
, so pick different registers for zero-extending
integers from 32 to 64-bit, or vmovdqa xmm,xmm to zero-extend
to YMM in a rare case where that's necessary. (Unless you need the
result in the register it's already in. Bouncing to a different reg and back is normally
worse.) And on Intel, the same applies for movzx eax,al for
example. (AMD Ryzen doesn't mov-eliminated movzx.) Agner Fog's instruction tables show
mov as always being eliminated on Ryzen,
but I guess he means that it can't fail between two different regs the way it can on
Intel.




We can use this limitation to
create a micro-benchmark that defeats it on
purpose.



mov ecx, ecx # CPUs can't
eliminate mov same,same
lea ecx, [rcx-1]

dec
ecx
jnz .loop

3,000,320,972 cycles:u # 3.898 GHz ( +-
0.00% )

4,000,000,238 instructions:u # 1.33 insn per cycle ( +-
0.00% )
1,000,000,234 branches:u # 1299.225 M/sec ( +- 0.00% )

3,000,084,446 uops_issued_any:u # 3897.783 M/sec ( +- 0.00% )
3,000,058,661
uops_executed_thread:u # 3897.750 M/sec ( +- 0.00%
)


This
takes 3G cycles for 1G iterations, because the length of the dependency chain is now 3
cycles.



The fused-domain uop
count didn't change, still 3G.




What
did change is that now the unfused-domain uop count is the same as fused-domain. All the
uops needed an execution unit; none of the mov instructions
were eliminated, so they all added 1c latency to the loop-carried dep
chain.



(When there are micro-fused uops, like
add eax, [rsi], the uops_executed
count can be higher than uops_issued. But
we don't have that.)



/>

Without the mov at
all:



lea ecx,
[rcx-1]


dec ecx
jnz
.loop


2,000,131,323 cycles:u # 3.896 GHz ( +- 0.00%
)
3,000,000,161 instructions:u # 1.50 insn per cycle

1,000,000,157 branches:u # 1947.876 M/sec
2,000,055,428 uops_issued_any:u #
3895.859 M/sec ( +- 0.00% )
2,000,039,061 uops_executed_thread:u # 3895.828
M/sec ( +- 0.00%
)



Now we're
back down to 2 cycle latency for the loop-carried dep
chain.



Nothing is
eliminated.






I
tested on a 3.9GHz i7-6700k Skylake. I get identical results on a Haswell i5-4210U (to
within 40k out of 1G counts) for all the perf events. That's about the same margin of
error as re-running on the same system.



Note
that if I ran perf as root1, and counted
cycles instead of cycles:u (user-space
only), it measure the CPU frequency as exactly 3.900 GHz. (IDK why Linux only obeys the
bios-settings for max turbo right after reboot, but then drops to 3.9GHz if I leave it
idle for a couple minutes. Asus Z170 Pro Gaming mobo, Arch Linux with kernel
4.10.11-1-ARCH. Saw the same thing with Ubuntu. Writing
balance_performance to each of
/sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
from /etc/rc.local fixes it, but writing
balance_power makes it drop back to 3.9GHz again
later.)




1: update: as a better
alternative to running sudo perf, I set sysctl
kernel.perf_event_paranoid = 0 in
/etc/syctl.d/99-local.conf



/>

You should get the same results on AMD Ryzen, since
it can eliminate integer mov. AMD Bulldozer-family can only
eliminate xmm register copies. (According to Agner Fog, ymm
register copies are an eliminated low-half and an ALU op for the high
half.)



For example, AMD Bulldozer and Intel
Ivybridge can sustain a throughput of 1 per clock
for



 movaps xmm0,
xmm1

movaps xmm2, xmm3
movaps xmm4, xmm5

dec
jnz
.loop


But Intel
Sandybridge can't eliminate moves, so it would bottleneck on 4 ALU uops for 3 execution
ports. If it was pxor xmm0,xmm0 instead of movaps, SnB could
also sustain one iteration per clock. (But Bulldozer-family couldn't, because
xor-zeroing still needs an execution unit on AMD, even though is independent of the old
value of the register. And Bulldozer-family only has 0.5c throughput for
PXOR.)



/>


Limitations of
mov-elimination



Two dependent MOV instructions
in a row exposes a difference between Haswell and
Skylake.



.loop:
mov
eax, ecx
mov ecx, eax

sub ecx, 2
jnz
.loop



Haswell:
minor run-to-run variability (1.746 to 1.749 c / iter), but this is
typical:



 1,749,102,925 cycles:u #
2.690 GHz
4,000,000,212 instructions:u # 2.29 insn per cycle

1,000,000,208 branches:u # 1538.062 M/sec
3,000,079,561 uops_issued_any:u #
4614.308 M/sec
1,746,698,502 uops_executed_core:u # 2686.531 M/sec

745,676,067 lsd_cycles_4_uops:u # 1146.896 M/sec




Not all
the MOV instructions are eliminated: about 0.75 of the 2 per iteration used an execution
port. Every MOV that executes instead of being eliminated adds 1c of latency to the
loop-carried dep chain, so it's not a coincidence that
uops_executed and cycles are very
similar. All the uops are part of a single dependency chain, so there's no parallelism
possible. cycles is always about 5M higher than
uops_executed regardless of run-to-run variation, so I guess
there are just 5M cycles being used up somewhere
else.



Skylake: more stable than HSW results, and
more mov-elimination: only 0.6666 MOVs of every 2 needed an execution
unit.



 1,666,716,605 cycles:u #
3.897 GHz
4,000,000,136 instructions:u # 2.40 insn per cycle

1,000,000,132 branches:u # 2338.050 M/sec
3,000,059,008 uops_issued_any:u #
7014.288 M/sec

1,666,548,206 uops_executed_thread:u # 3896.473
M/sec
666,683,358 lsd_cycles_4_uops:u # 1558.739
M/sec


On Haswell,
lsd.cycles_4_uops accounted for all of the uops. (0.745 * 4 ~=
3). So in almost every cycle where any uops are issued, a full group of 4 is issued
(from the loop-buffer. I probably should have looked at a different counter that doesn't
care where they came from, like uops_issued.stall_cycles to
count cycles where no uops issued).



But on SKL,
0.66666 * 4 = 2.66664 is less than 3, so in some cycles the
front-end issued fewer than 4 uops. (Usually it stalls until there's room in the
out-of-order core to issue a full group of 4, instead of issuing non-full
groups).



It's weird, IDK what the exact
microarchitectural limitation is. Since the loop is only 3 uops, each issue-group of 4
uops is more than a full iteration. So an issue group can contain up to 3 dependent
MOVs. Perhaps Skylake is designed to break that up sometimes, to allow more
mov-elimination?




update:
actually this is normal for 3-uop loops on Skylake.
uops_issued.stall_cycles shows that HSW and SKL issue a simple
3 uop loop with no mov-elimination the same way they issue this one. So better
mov-elimination is a side-effect of splitting up issue groups for some other reason.
(It's not a bottleneck because taken branches can't execute faster than 1 per clock
regardless of how fast they issue). I still don't know why SKL is different, but I don't
think it's anything to worry about.



/>

In a less extreme case, SKL and HSW
are the same, with both failing to eliminate 0.3333 of every 2 MOV
instructions:



.loop:

mov eax, ecx
dec eax
mov ecx, eax



sub ecx, 1
jnz
.loop





2,333,434,710 cycles:u # 3.897 GHz
5,000,000,185 instructions:u # 2.14 insn
per cycle
1,000,000,181 branches:u # 1669.905 M/sec


4,000,061,152 uops_issued_any:u # 6679.720 M/sec
2,333,374,781
uops_executed_thread:u # 3896.513 M/sec
1,000,000,942 lsd_cycles_4_uops:u #
1669.906 M/sec


All of
the uops issue in groups of 4. Any contiguous group of 4 uops will contain exactly two
MOV uops that are candidates for elimination. Since it clearly succeeds at eliminating
both in some cycles, IDK why it can't always do
that.






href="https://software.intel.com/en-us/articles/intel-sdm#optimization"
rel="noreferrer">Intel's optimization manual
says that overwriting the result of mov-elimination as early as possible frees up the
microarchitectural resources so it can succeed more often, at least for
movzx. See Example 3-25. Re-ordering Sequence to
Improve Effectiveness of Zero-Latency MOV
Instructions
.




So maybe
it's tracked internally with a limited-size table of ref-counts? Something has to stop
the physical register file entry from being freed when it's no longer needed as the
value of the original architectural register, if it's still needed as the value of the
mov destination. Freeing PRF entries as soon as possible is key, because href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/"
rel="noreferrer">PRF size can limit the out-of-order window to smaller than
ROB size.



I tried the examples on Haswell and
Skylake, and found that mov-elimination did in fact work significantly more of the time
when doing that, but that it was actually slightly slower in total cycles, instead of
faster. The example was intended to show the benefit on IvyBridge, which probably
bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the
dep chains and don't seem to be bothered by needing an ALU port for more of the
movzx instructions.



See
also href="https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures">Why
is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for
more research + guesswork about how mov-elimination works, and whether it could work for
xchg eax, ecx. (In practice xchg
reg,reg
is 3 ALU uops on Intel, but 2 eliminated uops on Ryzen. It's
interesting to guess whether Intel could have implemented it more
efficiently.)



/>

BTW, as a workaround for an erratum on Haswell,
Linux doesn't provide uops_executed.thread when hyperthreading
is enabled, only uops_executed.core. The other core was
definitely idle the whole time, not even timer interrupts, href="https://serverfault.com/a/489410/28665">because I took it offline with
echo 0 > /sys/devices/system/cpu/cpu3/online.
Unfortunately this can't be done before perf decides that HT is
enabled, and my Dell laptop doesn't have a BIOS option to disable HT. So I can't get
perf to use all 8 hardware PMU counters at once on that system,
only 4. :/



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...