assembly - Micro fusion and addressing modes

itemprop="text">

I have found something unexpected (to
me) using the href="https://software.intel.com/en-us/articles/intel-architecture-code-analyzer"
rel="noreferrer">Intel® Architecture Code Analyzer
(IACA).

The following instruction using
[base+index] addressing

addps xmm1, xmmword ptr
            [rsi+rax*1]

does not
micro-fuse according to IACA. However, if I use [base+offset]
like this

addps xmm1,
            xmmword ptr [rsi]

IACA
reports that it does fuse.

Section 2-11 of the
href="http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html"
rel="noreferrer">Intel optimization reference manual gives the following as
an example "of micro-fused micro-ops that can be handled by all
decoders"

FADD DOUBLE PTR [RDI +
            RSI*8]

and
Agner Fog's optimization
assembly manual also gives examples of micro-op fusion using
[base+index] addressing. See, for example, Section 12.2 "Same
example on Core2". So what's the correct answer?

Answer

In the decoders and uop-cache, addressing
mode doesn't affect micro-fusion (except that an instruction with an immediate operand
can't micro-fuse a RIP-relative addressing
mode).

But some combinations of uop and
addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel
SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename
stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop
count after un-lamination is what matters.

href="http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html"
rel="noreferrer">Intel's optimization manual describes un-lamination for
Sandybridge in Section 2.3.2.4: Micro-op Queue and the Loop Stream Detector
(LSD), but doesn't describe the changes for any later
microarchitectures.

UPDATE:
Now Intel manual has a detailed section to describe un-lamination for Haswell. See
section 2.3.5 Unlamination. And a brief description for SandyBridge is in section
2.4.2.4.

The rules, as best I
can tell from experiments on SnB, HSW, and
SKL:

SnB (and I assume
also IvB): indexed addressing modes are always un-laminated, others stay micro-fused.
IACA is (mostly?) correct.

HSW, SKL: These only keep an
indexed ALU instruction micro-fused if it has 2 operands and treats the dst register as
read-modify-write. Here "operands" includes flags, meaning that
adc and cmov don't micro-fuse. Most
VEX-encoded instructions also don't fuse since they generally have three operands (so
paddb xmm0, [rdi+rbx] fuses but vpaddb xmm0, xmm0, [rdi+rbx] doesn't). Finally, the occasional 2-operand instruction where the
first operand is write only, such as pabsb xmm0, [rax + rbx]
also do not fuse. IACA is wrong, applying the SnB
rules.

Related:
simple (non-indexed) addressing modes are the only ones that the dedicated store-address
unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid
indexed addressing modes for stores. (A good trick for this is to address your dst with
a single register, but src with dst+(initial_src-initial_dst).
Then you only have to increment the dst register inside a
loop.)

Note that some instructions never
micro-fuse at all (even in the decoders/uop-cache). e.g. shufps xmm, [mem], imm8, or vinsertf128 ymm, ymm, [mem], imm8, are
always 2 uops on SnB through Skylake, even though their register-source versions are
only 1 uop. This is typical for instructions with an imm8 control operand plus the usual
dest/src1, src2 register/memory operands, but there are a few other cases. e.g.
PSRLW/D/Q xmm,[mem] (vector shift count from a memory operand)
doesn't micro-fuse, and neither does PMULLD.

See
also rel="noreferrer">this post on Agner Fog's blog for discussion about issue
throughput limits on HSW/SKL when you read lots of registers: Lots of
micro-fusion with indexed addressing modes can lead to slowdowns vs. the same
instructions with fewer register operands: one-register addressing modes and
immediates. We don't know the cause yet, but I suspect some kind of
register-read limit, maybe related to reading lots of cold registers from the
PRF.

Test cases, numbers from real
measurements: These all micro-fuse in the decoders, AFAIK, even if
they're later
un-laminated.

#
            store
mov [rax], edi SnB/HSW/SKL: 1 fused-domain, 2 unfused. The store-address
            uop can run on port7.
mov [rax+rsi], edi SnB: unlaminated. HSW/SKL: stays
            micro-fused. (The store-address can't use port7, though).
mov [buf +rax*4],
            edi SnB: unlaminated. HSW/SKL: stays micro-fused.

# normal ALU
            stuff
add edx, [rsp+rsi] SnB: unlaminated. HSW/SKL: stays micro-fused.
            
# I assume the majority of traditional/normal ALU insns are like
            add

Three-input
instructions that HSW/SKL may have to
un-laminate

vfmadd213ps
            xmm0,xmm0,[rel buf] HSW/SKL: stays micro-fused: 1 fused, 2
            unfused.
vfmadd213ps xmm0,xmm0,[rdi] HSW/SKL: stays
            micro-fused
vfmadd213ps xmm0,xmm0,[0+rdi*4] HSW/SKL: un-laminated: 2 uops in
            fused & unfused-domains.
 (So indexed addressing mode is still the
            condition for HSW/SKL, same as documented by Intel for SnB)

# no
            idea why this one-source BMI2 instruction is unlaminated
# It's different from
            ADD in that its destination is write-only (and it uses a VEX encoding)
blsi
            edi, [rdi] HSW/SKL: 1 fused-domain, 2 unfused.

blsi edi, [rdi+rsi]
            HSW/SKL: 2 fused & unfused-domain.


adc eax, [rdi]
            same as cmov r, [rdi]
cmove ebx, [rdi] Stays micro-fused. (SnB?)/HSW: 2
            fused-domain, 3 unfused domain. 
 SKL: 1 fused-domain, 2
            unfused.

# I haven't confirmed that this micro-fuses in the
            decoders, but I'm assuming it does since a one-register addressing mode
            does.

adc eax, [rdi+rsi] same as cmov r,
            [rdi+rsi]

cmove ebx, [rdi+rax] SnB: untested, probably 3
            fused&unfused-domain.
 HSW: un-laminated to 3 fused&unfused-domain.
            
 SKL: un-laminated to 2
            fused&unfused-domain.

I
assume that Broadwell behaves like Skylake for
adc/cmov.

It's strange that HSW un-laminates
memory-source ADC and CMOV. Maybe Intel didn't get around to changing that from SnB
before they hit the deadline for shipping
Haswell.

Agner's insn table says
cmovcc r,m and adc r,m don't
micro-fuse at all on HSW/SKL, but that doesn't match my experiments. The cycle counts
I'm measuring match up with the the fused-domain uop issue count, for a 4 uops / clock
issue bottleneck. Hopefully he'll double-check that and correct the
tables.

Memory-dest
integer ALU:

add
            [rdi], eax SnB: untested (Agner says 2 fused-domain, 4 unfused-domain (load + ALU +
            store-address + store-data)
 HSW/SKL: 2 fused-domain, 4 unfused.
add
            [rdi+rsi], eax SnB: untested, probably 4 fused & unfused-domain
 HSW/SKL:
            3 fused-domain, 4 unfused. (I don't know which uop stays fused).
 HSW: About
            0.95 cycles extra store-forwarding latency vs. [rdi] for the same address used
            repeatedly. (6.98c per iter, up from 6.04c for [rdi])
 SKL: 0.02c extra
            latency (5.45c per iter, up from 5.43c for [rdi]), again in a tiny loop with dec
            ecx/jnz



adc [rdi], eax SnB:
            untested
 HSW: 4 fused-domain, 6 unfused-domain. (same-address throughput
            7.23c with dec, 7.19c with sub ecx,1)
 SKL: 4 fused-domain, 6 unfused-domain.
            (same-address throughput ~5.25c with dec, 5.28c with sub)
adc [rdi+rsi], eax
            SnB: untested
 HSW: 5 fused-domain, 6 unfused-domain. (same-address throughput
            = 7.03c)
 SKL: 5 fused-domain, 6 unfused-domain. (same-address throughput =
            ~5.4c with sub ecx,1 for the loop branch, or 5.23c with dec ecx for the loop
            branch.)

Yes, that's
right, adc [rdi],eax / dec ecx /
jnz runs faster than the same loop with
add instead of adc on SKL. I didn't
try using different addresses, since clearly SKL doesn't like repeated rewrites of the
same address (store-forwarding latency higher than expected. See also href="http://www.agner.org/optimize/blog/read.php?i=415#854" rel="noreferrer">this
post about repeated store/reload to the same address being slower than expected on
SKL.

Memory-destination
adc is so many uops because Intel P6-family (and apparently
SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction,
so it href="https://stackoverflow.com/questions/17395557/observing-stale-instruction-fetching-on-x86-with-self-modifying-code#comment68191840_18388700">needs
an extra uop to work around the problem-case where the load and add complete, and then
the store faults, but the insn can't just be restarted because CF has already been
updated. Interesting series of comments from Andy Glew
(@krazyglew).

Presumably fusion in the decoders
and un-lamination later saves us from href="https://stackoverflow.com/questions/26907523/branch-alignment-for-loops-involving-micro-coded-instructions-on-intel-snb-famil/27687691#27687691">needing
microcode ROM to produce more than 4 fused-domain uops from a single
instruction for adc [base+idx], reg.

Why SnB-family
un-laminates:

Sandybridge
simplified the internal uop format to save power and transistors (along with making the
major change to using a physical register file, instead of keeping input / output data
in the ROB). SnB-family CPUs only allow a limited number of input registers for a
fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs
(including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if
memory-destination add and adc are
taking full advantage of that, or if Intel had to get Haswell out the door with some
instructions

Nehalem and earlier have
a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track
micro-fused uops with 3 input registers (the non-memory register operand, base, and
index).

So
indexed stores and ALU+load instructions can still decode efficiently (not having to be
the first uop in a group), and don't take extra space in the uop cache, but otherwise
the advantages of micro-fusion are essentially gone for tuning tight loops.
"un-lamination" happens before the 4-fused-domain-uops-per-cycle
issue/retire width out-of-order core. The fused-domain performance
counters (uops_issued / uops_retired.retire_slots) count fused-domain uops after
un-lamination.

Intel's description of the
renamer (Section 2.3.3.1: Renamer) implies that it's the
issue/rename stage which actually does the un-lamination, so uops destined for
un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue /
loop-buffer (aka the IDQ).

TODO: test this. Make
a loop that should just barely fit in the loop buffer. Change something so one of the
uops will be un-laminated before issuing, and see if it still runs from the loop buffer
(LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf
counters to track where uops come from, so this should be
easy.

Harder TODO: if un-lamination
happens between reading from the uop cache and adding to the IDQ, test whether it can
ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage,
can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing
the first 4.)

(See the a previous version of this answer for some
guesses based on tuning some LUT code, with some notes on
vpgatherdd being about 1.7x more cycles than a
pinsrw
loop.)

Experimental testing on
SnB

The HSW/SKL numbers were measured on an
i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the
whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL
and Linux 4.8 on HSW, using ocperf.py. (The HSW laptop
NFS-mounted my SKL desktop's
/home.)

The SnB numbers were measured
as described below, on an i5-2500k which is no longer
working.

Confirmed by testing with performance
counters for uops and cycles.

I found href="http://www.bnikolic.co.uk/blog/hpc-prof-events.html" rel="noreferrer">a table
of PMU events for Intel Sandybridge, for use with Linux's
perf command. (Standard perf
unfortunately doesn't have symbolic names for most hardware-specific PMU events, like
uops.) I made use of it for a href="https://stackoverflow.com/a/31355086/224132">recent
answer.

href="https://github.com/andikleen/pmu-tools"
rel="noreferrer">ocperf.py provides symbolic names for these
uarch-specific PMU events, so you don't have to look up tables. Also, the same
symbolic name works across multiple uarches. I wasn't aware of it when I first wrote
this answer.

To test for uop micro-fusion, I
constructed a test program that is bottlenecked on the 4-uops-per-cycle fused-domain
limit of Intel CPUs. To avoid any execution-port contention, many of these uops are
nops, which still sit in the uop cache and go through the
pipeline the same as any other uop, except they don't get dispatched to an execution
port. (An xor x, same, or an eliminated move, would be the
same.)

Test program: yasm -f elf64 uop-test.s && ld uop-test.o -o uop-test

GLOBAL
            _start
_start:
 xor eax, eax
 xor ebx, ebx
 xor
            edx, edx
 xor edi, edi
 lea rsi, [rel mydata] ; load
            pointer

 mov ecx, 10000000
 cmp dword [rsp], 2 ; argc
            >= 2
 jge .loop_2reg

ALIGN
            32
.loop_1reg:
 or eax, [rsi + 0]
 or ebx, [rsi +
            4]
 dec ecx
 nop

 nop
 nop

            nop
 jg .loop_1reg
; xchg r8, r9 ; no effect on flags; decided to
            use NOPs instead

 jmp .out

ALIGN
            32
.loop_2reg:

 or eax, [rsi + 0 + rdi]
 or
            ebx, [rsi + 4 + rdi]
 dec ecx
 nop
 nop

            nop
 nop
 jg
            .loop_2reg

.out:

 xor edi, edi
 mov
            eax, 231 ; exit(0)
 syscall

SECTION
            .rodata
mydata:
db 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
            0xff

I also found that
the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't
a multiple of 4 uops. (i.e. it's abc,
abc, ...; not abca,
bcab, ...). Agner Fog's microarch doc unfortunately wasn't
clear on this limitation of the loop buffer. See href="https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of">Is
performance reduced when executing loops whose uop count is not a multiple of processor
width? for more investigation on HSW/SKL. SnB may be worse than HSW in this
case, but I'm not sure and don't still have working SnB
hardware.

I wanted to keep
macro-fusion (compare-and-branch) out of the picture, so I used
nops between the dec and the branch. I
used 4 nops, so with micro-fusion, the loop would be 8 uops,
and fill the pipeline with at 2 cycles per 1
iteration.

In the other version of the loop,
using 2-operand addressing modes that don't micro-fuse, the loop will be 10 fused-domain
uops, and run in 3 cycles.

Results
from my 3.3GHz Intel Sandybridge (i5 2500k). I didn't do anything to get
the cpufreq governor to ramp up clock speed before testing, because cycles are cycles
when you aren't interacting with memory. I've added annotations for the performance
counter events that I had to enter in
hex.

testing the 1-reg addressing
mode: no cmdline
arg

$ perf stat -e
            task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend
            ./uop-test


Performance counter stats for
            './uop-test':

 11.489620 task-clock (msec) # 0.961 CPUs
            utilized
 20,288,530 cycles # 1.766 GHz
 80,082,993 instructions #
            3.95 insns per cycle
 # 0.00 stalled cycles per insn
 60,190,182
            r1b1 ; UOPS_DISPATCHED: (unfused-domain. 1->umask 02 -> uops sent to execution
            ports from this thread)
 80,203,853 r10e ; UOPS_ISSUED:
            fused-domain
 80,118,315 r2c2 ; UOPS_RETIRED: retirement slots used
            (fused-domain)

 100,136,097 r1c2 ; UOPS_RETIRED: ALL
            (unfused-domain)
 220,440 stalled-cycles-frontend # 1.09% frontend cycles
            idle
 193,887 stalled-cycles-backend # 0.96% backend cycles
            idle

 0.011949917 seconds time
            elapsed

testing
the 2-reg addressing mode: with a cmdline
arg

$ perf stat -e
            task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend
            ./uop-test x


 Performance counter stats for './uop-test
            x':

 18.756134 task-clock (msec) # 0.981 CPUs utilized

            30,377,306 cycles # 1.620 GHz
 80,105,553 instructions # 2.64 insns per
            cycle
 # 0.01 stalled cycles per insn
 60,218,693 r1b1 ;
            UOPS_DISPATCHED: (unfused-domain. 1->umask 02 -> uops sent to execution ports from
            this thread)
 100,224,654 r10e ; UOPS_ISSUED: fused-domain

            100,148,591 r2c2 ; UOPS_RETIRED: retirement slots used
            (fused-domain)

 100,172,151 r1c2 ; UOPS_RETIRED: ALL
            (unfused-domain)
 307,712 stalled-cycles-frontend # 1.01% frontend cycles
            idle
 1,100,168 stalled-cycles-backend # 3.62% backend cycles
            idle

 0.019114911 seconds time
            elapsed

So, both
versions ran 80M instructions, and dispatched 60M uops to execution ports.
(or with a memory source dispatches to an ALU for the
or, and a load port for the load, regardless of whether it was
micro-fused or not in the rest of the pipeline. nop doesn't
dispatch to an execution port at all.) Similarly, both versions retire 100M
unfused-domain uops, because the 40M nops count
here.

The difference is in the counters for the
fused-domain.

The
1-register address version only issues and retires 80M fused-domain uops. This is the
same as the number of instructions. Each insn turns into one fused-domain
uop.

The 2-register address version issues 100M
fused-domain uops. This is the same as the number of unfused-domain uops, indicating
that no micro-fusion
happened.

I suspect that
you'd only see a difference between UOPS_ISSUED and UOPS_RETIRED(retirement slots used)
if branch mispredicts led to uops being cancelled after issue, but before
retirement.

And finally, the
performance impact is real. The non-fused version took 1.5x as many clock
cycles. This exaggerates the performance difference compared to most real cases. The
loop has to run in a whole number of cycles, and the 2 extra uops push it from 2 to 3.
Often, an extra 2 fused-domain uops will make less difference. And potentially no
difference, if the code is bottlecked by something other than
4-fused-domain-uops-per-cycle.

Still,
code that makes a lot of memory references in a loop might be faster if implemented with
a moderate amount of unrolling and incrementing multiple pointers which are used with
simple [base + immediate offset] addressing, instead of the
using [base + index] addressing
modes.

futher
stuff

RIP-relative with an
immediate can't micro-fuse. Agner Fog's testing shows that this is the
case even in the decoders / uop-cache, so they never fuse in the first place (rather
than being un-laminated).

IACA gets this wrong,
and claims that both of these
micro-fuse:

cmp dword [abs
            mydata], 0x1b ; fused counters != unfused counters (micro-fusion happened, and wasn't
            un-laminated). Uses 2 entries in the uop-cache, according to Agner Fog's
            testing
cmp dword [rel mydata], 0x1b ; fused counters ~= unfused counters
            (micro-fusion didn't
            happen)

RIP-rel
does micro-fuse (and stay fused) when there's no immediate,
e.g.:

or eax, dword [rel mydata] ;
            fused counters != unfused counters, i.e. micro-fusion
            happens

Micro-fusion doesn't increase the
latency of an instruction. The load can issue before the other input is
ready.

ALIGN
            32
.dep_fuse:
 or eax, [rsi + 0]
 or eax, [rsi +
            0]
 or eax, [rsi + 0]
 or eax, [rsi + 0]
 or eax, [rsi +
            0]
 dec ecx
 jg
            .dep_fuse

This
loop runs at 5 cycles per iteration, because of the eax dep
chain. No faster than a sequence of or eax, [rsi + 0 + rdi], or
mov ebx, [rsi + 0 + rdi] / or eax, ebx. (The unfused and the
mov versions both run the same number of uops.) Scheduling /
dep checking happens in the unfused-domain. Newly issued uops go into the scheduler (aka
Reservation Station (RS)) as well as the ROB. They leave the scheduler after dispatching
(aka being sent to an execution unit), but stay in the ROB until retirement. So the
out-of-order window for hiding load latency is at least the scheduler size ( href="http://www.realworldtech.com/haswell-cpu/3/" rel="noreferrer">54 unfused-domain
uops in Sandybridge, 60 in Haswell, 97 in
Skylake).

Micro-fusion doesn't have a shortcut
for the base and offset being the same register. A loop with or eax, [mydata + rdi+4*rdi] (where rdi is zeroed) runs as many uops and cycles as the loop
with or eax, [rsi+rdi]. This addressing mode could be used for
iterating over an array of odd-sized structs starting at a fixed address. This is
probably never used in most programs, so it's no surprise that Intel didn't spend
transistors on allowing this special-case of 2-register modes to micro-fuse. (And Intel
documents it as "indexed addressing modes" anyway, where a register and scale factor are
needed.)

Macro-fusion of a
cmp/jcc or
dec/jcc creates a uop that stays as a
single uop even in the unfused-domain. dec / nop / jge can
still run in a single cycle but is three uops instead of one.

Blog

Sunday, 7 January 2018