performance - Why are loops always compiled into "do...while" style (tail jump)?

itemprop="text">

When trying to understand assembly
(with compiler optimization on), I see this
behavior:

A very basic loop like
this

outside_loop;
while (condition)
            {

            statements;
}

Is
often compiled into
(pseudocode)


            ; outside_loop
 jmp loop_condition ;
            unconditional
loop_start:

            loop_statements
loop_condition:
 condition_check

            jmp_if_true loop_start
 ;
            outside_loop

However,
if the optimization is not turned on, it compiles to normally understandable
code:

loop_condition:

            condition_check
 jmp_if_false loop_end
 loop_statements

            jmp loop_condition ;
            unconditional
loop_end:

According
to my understanding, the compiled code is better resembled by
this:

goto condition;
do {

            statements;
 condition:
}
while
            (condition_check);

I
can't see a huge performance boost or code readability boost, so why is this often the
case? Is there a name for this loop style, for example "trailing condition
check"?

Answer

Related: asm loop basics: href="https://stackoverflow.com/questions/28665528/while-do-while-for-loops-in-assembly-language-emu8086">While,
Do While, For loops in Assembly Language
(emu8086)

Fewer instructions / uops inside the
loop = better. Structuring the code outside the loop to achieve this is
very often a good idea.

Sometimes this requires
"loop rotation" (peeling opposite parts of the first and last iterations so the actual
loop body has the conditional branch at the bottom). So you do some of the first
iteration and maybe skip the loop, then fall into the loop, then after it exits do the
last part of the final iteration. Sometimes this extra useful if the last iteration is a
special case, e.g. a store you need to skip. This lets you implement a
while(1) {... ; if(x)break; ...; } loop as a do-while, or put
one of the conditions of a multiple-condition loop at the
bottom.

Some of these optimizations are related
to or enable software pipelining, e.g. loading something for the next iteration. (OoO
exec on x86 makes that not very important these days but it's useful for in-order cores
like many
ARM.)

do{}while()
is the canonical / idiomatic structure for loops in asm on all architectures, get used
to it. IDK if there's a name for it; I would say such a loop has a "do
while structure". If you want names, you could call the while()
structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the
bottom is universal, and not even worth mentioning as a href="https://en.wikipedia.org/wiki/Loop_optimization" rel="nofollow noreferrer">Loop
Optimization. You always do
that.

This pattern is so widely used that on
CPUs that use static branch prediction for branches without an entry in the
branch-predictor caches, unknown forward conditional branches are predicted not-taken,
unknown backwards branches are predicted taken (because they're probably loop branches).
See Static branch prediction on newer Intel processors on Matt
Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch
PDF.

This answer ended up using x86 examples for
everything, but much of this applies across the board for all architectures. I wouldn't
be surprised if other superscalar / out-of-order implementations (like some ARM, or
POWER) also have limited branch-instruction throughput whether they're taken or not. But
fewer instructions inside the loop is nearly universal when all you have is a
conditional branch at the bottom, and no unconditional
branch.

If the loop might need to run zero
times, compilers more often put a test-and-branch outside the loop to
skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler
can't prove the loop condition is always true on the first
iteration).

BTW, href="http://www.cs.columbia.edu/~ecj2122/research/x86_loop_optimizations/x86_loop_optimizations.pdf"
rel="nofollow noreferrer">this paper calls transforming
while() to if(){ do{}while; } an
"inversion", but loop inversion usually means inverting a nested loop. (e.g. if the
source loops over a row-major multi-dimensional array in the wrong order, a clever
compiler could change for(i) for(j) a[j][i]++; into
for(j) for(i) a[j][i]++; if it can prove it's correct.) But I
guess you can look at the if() as a zero-or-one iteration loop.
Fun fact, compiler devs teaching their compilers how to invert a loop (to allow
auto-vectorization) for a (very) specific case is href="https://en.wikipedia.org/wiki/SPECint#Critique" rel="nofollow noreferrer">why
SPECint2006's libquantum benchmark is "broken". Most compilers can't invert
loops in the general case, just ones that looks almost exactly like the one in
SPECint2006...

You can help the compiler make more compact asm
(fewer instructions outside the loop) by writing do{}while()
loops in C when you know the caller isn't allowed to pass
size=0 or whatever else guarantees a loop runs at least
once.

(Actually 0 or negative for signed loop
bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if
you choose a narrower type than pointers; check your compiler's asm output to make sure
it isn't sign-extending a narrow loop counter inside the loop very time if you use it as
an array index. But note that signed can actually be helpful, because the compiler can
assume that i++ <= bound will eventually become false, href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html"
rel="nofollow noreferrer">because signed overflow is UB but unsigned isn't.
So with unsigned, while(i++ <= bound) is infinite if
bound = UINT_MAX.) I don't have a blanket recommendation for
when to use signed vs. unsigned; size_t is often a good choice
for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the
loop overhead (for a trivial saving in code size) but convince the compiler not to waste
any instructions zero or sign-extending, it can be
tricky.

I can't see a huge
performance
boost

Here's an
example where that optimization will give speedup of 2x on Intel CPUs before Haswell,
because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional
branches.

Required background knowledge for this
static performance analysis: Agner Fog's microarch guide (read the Sandybridge section).
Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in
places, though.) See also other x86 performance links in the href="https://stackoverflow.com/questions/tagged/x86" class="post-tag" title="show
questions tagged 'x86'" rel="tag">x86 tag wiki. See also href="https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all/44193770#44193770">Can
x86's MOV really be "free"? Why can't I reproduce this at all? for some static
analysis backed up by experiments with perf counters, and some explanation of fused vs.
unfused domain uops.

You could also use Intel's
href="https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it">IACA
software (Intel Architecture Code Analyzer) to do static analysis on these
loops.

; sum(int []) using SSE2
            PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar
            cleanup / unaligned handling / horizontal sum of XMM0 not shown.

;
            NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most
            CPUs
.looptop: ; while (edi cmp edi, esi ; 32-bit
            code so this can macro-fuse on Core2
 jae .done ; 1 uop, port5 only
            (macro-fused with cmp)
 paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load
            port
 add edi, 16 ; 1 uop, p015
 jmp .looptop ; 1 uop, p5
            only

 ; Sandybridge/Ivybridge ports each uop can
            use
.done: ;
            }

This is 4 total
fused-domain uops ( href="https://stackoverflow.com/questions/31771526/x86-64-assembly-loop-conditions-and-out-of-order/31778403">with
macro-fusion of the cmp/jae), so it can issue from
the front-end into the out-of-order core at one iteration per clock. But in the unfused
domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU
ports.

More importantly, port5 pressure is the
bottleneck: This loop can execute at only one iteration per 2
cycles because cmp/jae and jmp both need to run on port5. Other uops
stealing port5 could reduce practical throughput somewhat below
that.

Writing the loop
idiomatically for asm, we
get:

ALIGN 16
.looptop:
            ; do {
 paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port

            add edi, 16 ; 1 uop, p015

 cmp edi, esi ; 1 uop, port5 only
            (macro-fused with cmp)
 jb .looptop ; } while(edi <
            end_pointer);

Notice
right away, independent of everything else, that this is one fewer instruction in the
loop. This loop structure is at least slightly better on everything from simple
non-pipelined 8086 through href="https://en.wikipedia.org/wiki/Classic_RISC_pipeline" rel="nofollow
noreferrer">classic RISC (like early MIPS), especially for long-running
loops (assuming they don't bottleneck on memory
bandwidth).

Core2 and later should
run this at one iteration per clock, twice as fast as the
while(){}-structured loop, if memory isn't a bottleneck (i.e.
assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per
clock).

This is only 3 fused-domain uops, so can
issue at better than one per clock on anything since Core2, or just one per clock if
issue groups always end with a taken branch.

But
the important part is that port5 pressure is vastly reduced: only
cmp/jb needs it. The other uops will probably be scheduled to
port5 some of the time and steal cycles from loop-branch throughput, but this will be a
few % instead of a factor of 2. See href="https://stackoverflow.com/questions/40681331/how-are-x86-uops-scheduled-exactly">How
are x86 uops scheduled, exactly?.

Most
CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute
tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't
run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power
CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per
clock. They even have loop buffers; Core2 has a loop buffer after instruction-length
decode but before regular decode. Nehalem and later recycle uops in the queue that feeds
the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable
the loop buffer because of a partial-register merging
bug.)

However, there is a
loop-carried dependency chain on xmm0: Intel
CPUs have 1-cycle latency paddd, so we're right up against that
bottleneck, too. add esi, 16 is also 1 cycle latency. On
Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the
loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock,
so we need to unroll anyway for max throughput.) With floating point, you
definitely want to unroll with multiple accumulators. href="https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction">Why
does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?
(Unrolling FP loops with multiple
accumulators).

If I'd used an indexed addressing mode, like
paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can
macro-fuse on Sandybridge-family, but the indexed load href="https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes">would
un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use
the AVX form).

 ; index relative
            to the end of the array, with an index counting up towards zero
 add rdi, rsi
            ; edi = end_pointer
 xor eax, eax
 sub eax, esi ; eax = -length, so
            [rdi+rax] = first element

 .looptop: ; do {
 paddd xmm0,
            [rdi + rax]
 add eax, 16
 jl .looptop ; } while(idx+=16 < 0); //
            or JNC still
            works

(It's usually
better to unroll some to hide the overhead of pointer increments instead of using
indexed addressing modes, especially for stores, partly because indexed stores can't use
the port7 store AGU on Haswell+.)

On
Core2/Nehalem add/jl don't macro-fuse, so this is 3
fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD
K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory
operand is 1 m-op / uop.

On SnB,
paddd un-laminates from the load, but add/jl macro-fuse, so
again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so
probably fewer resource conflicts reducing throughput of the
loop.)

On HSW and later, this is 2 fused-domain
uops because an indexed load can stay micro-fused with PADDD, and
add/jl macro-fuses. (Predicted-taken branches run on port 6, so
there are never resource conflicts.)

Of course,
the loops can only run at best 1 iteration per clock because of taken branch throughput
limits even for tiny loops. This indexing trick is potentially useful if you had
something else to do inside the loop, too.

But all of these loops had no
unrolling

Yes, that exaggerates the effect of
loop overhead. But gcc doesn't unroll by default even at
-O3 (unless it decides to fully unroll).
It only unrolls with profile-guided optimization to let it know which loops are hot.
(-fprofile-use). You can enable
-funroll-all-loops, but I'd only recommend doing that on a
per-file basis for a compilation unit you know has one of your hot loops that needs it.
Or maybe even on a per-function basis with an __attribute__, if
there is one for optimization options like
that.

So this is highly relevant for
compiler-generated code. (But clang does default to unrolling
tiny loops by 4, or small loops by 2, and extremely importantly, using multiple
accumulators to hide latency.)

Benefits with very low iteration
count:

Consider what happens when the loop body
should run once or twice: There's a lot more jumping with anything other than
do{}while.

For
do{}while, execution is a straight-line with no taken branches
and one not-taken branch at the bottom. This is
excellent.

For an if() { do{}while; } that might run the loop zero times, it's two not-taken
branches. That's still very good. (Not-taken is slightly cheaper for the front-end than
taken when both are correctly
predicted).

For a jmp-to-the-bottom
jmp; do{}while(), it's one taken unconditional branch, one
taken loop condition, and then the loop branch is not-taken. This is kinda clunky but
modern branch predictors are very
good...

For a
while(){} structure, this is one not-taken loop exit, one taken
jmp at the bottom, then one taken loop-exit branch at the
top.

With more
iterations, each loop structure does one more taken branch.
while(){} also does one more not-taken branch per iteration, so
it quickly becomes obviously worse.

The latter
two loop structures have more jumping around for small trip
counts.

Jumping to the bottom also has a disadvantage for
non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run
for a while. Code fetch / prefetch is good at bringing code to the front-end in a
straight line, but if prediction didn't predict the branch early enough, you might have
a code miss for the jump to the bottom. Also, parallel decode will probably have (or
could have) decoded some of the top of the loop while decoding the
jmp to the
bottom.

Conditionally jumping over a
do{}while loop avoids all that: you only jump forwards into
code that hasn't been run yet in cases where the code you're jumping over shouldn't run
at all. It often predicts very well because a lot of code never actually takes 0 trips
through the loop. (i.e. it could have been a do{}while, the
compiler just didn't manage to prove
it.)

Jumping to the bottom also means the core
can't start working on the real loop body until after the front-end chases two taken
branches.

There are cases with complicated loop
conditions where it's easiest to write it this way, and the performance impact is small,
but compilers often avoid it.

Loops with multiple exit
conditions:

Consider a
memchr loop, or a strchr loop: they
have to stop at the end of the buffer (based on a count) or the end of an
implicit-length string (0 byte). But they also have to break
out of the loop if they find a match before the
end.

So you'll often see a structure
like

do {
 if ()
            break;

 blah blah;
}
            while(condition);

Or
just two conditions near the bottom. Ideally you can test multiple logical conditions
with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for
range checking, or combine that with an OR to href="https://stackoverflow.com/questions/35932273/how-to-access-a-char-array-and-change-lower-case-letters-to-upper-case-and-vice/35936844#35936844">check
for alphabetic characters of either case in 4 instructions) but sometimes you
can't and just need to use an if()break style loop-exit branch
as well as a normal backwards taken branch.

Sort of
off topic:

Blog

Monday, 20 November 2017