Sunday, 10 December 2017

assembly - Why does the number of uops per iteration increase with the stride of streaming loads?

itemprop="text">

Consider the following
loop:




.loop:

add rsi, OFFSET
mov eax, dword [rsi]
dec ebp
jg
.loop


where
OFFSET is some non-negative integer and
rsi contains a pointer to a buffer defined in the
bss section. This loop is the only loop in the code. That is,
it's not being initialized or touched before the loop. Presumably, on Linux, all of the
4K virtual pages of the buffer will be mapped on-demand to the same physical page.
Therefore, the only limit on the buffer size is the number of virtual pages. So we can
easily experiment with very large buffers.



The
loop consists of 4 instructions. Each instruction is decoded into a single uop in the
fused and unfused domain on Haswell. There is also a loop-carried dependency between the
successive instances of add rsi, OFFSET. Therefore, under idle
conditions where the load always hit in the L1D, the loop should execute at about 1
cycle per iteration. For small offsets (strides), this is expected thanks to the
IP-based L1 streaming prefetcher and the L2 streaming prefetcher. However, both
prefetchers can only prefetch within a 4K page and the maximum stride supported by the
L1 prefetcher is 2K. So for small strides, there should be about 1 L1 miss per 4K page.
As the stride increases, the total number of L1 misses and TLB misses will increase and
performance will deteriorate
accordingly.




The following graph
shows various interesting performance counters (per iteration) for strides between 0 and
128. Note that the number of iterations is constant for all experiments. Only the buffer
size changes to accommodate the specified stride. In addition, only user-mode
performance events are counted.



href="https://i.stack.imgur.com/L1VRO.png" rel="noreferrer"> src="https://i.stack.imgur.com/L1VRO.png" alt="enter image description
here">



The only weird thing here is
the number of retired uops is increasing with the stride. It goes from 3 uops per
iteration (as expected) to 11 for stride 128. Why is
that?



Things only get weirder with larger
strides, as the following graph shows. In this graph, the strides range from 32 to 8192
with 32-byte increments. First, the number of retired instructions increases linearly
from 4 to 5 at stride 4096 bytes after which it remains constant. The number of load
uops increases from 1 to 3 and the number of L1D load hits remains 1 per iteration. Only
the number of L1D load misses makes sense to me for all
strides.



href="https://i.stack.imgur.com/NBEX3.png" rel="noreferrer"> src="https://i.stack.imgur.com/NBEX3.png" alt="enter image description
here">




The two obvious
effects of larger strides
are:




  • The execution time
    increases and so more hardware interrupts will occur. However, I'm counting user-mode
    events, so interrupts should not interfere with my measurements. I've also repeated all
    experiments with taskset or nice and
    got the same results.

  • The number of page walks and page
    faults increases. (I've verified this but I'll omit the graphs for brevity.) Page faults
    are handled by the kernel in kernel-mode. According to href="https://stackoverflow.com/questions/32256250/what-happens-after-a-l2-tlb-miss">this
    answer, page walks are implemented using dedicated hardware (on Haswell?). Although the
    link that the answer is based on is
    dead.



To investigate
further, the following graph shows the number of uops from microcode assists. The number
of microcode assist uops per iteration increases until it reaches the maximum value at
stride 4096, just like with the other performance events. The number of microcode assist
uops per 4K virtual page is 506 for all strides. The "Extra UOPS" line plots the number
of retired uops minus 3 (the expected number of uops per
iteration).




href="https://i.stack.imgur.com/EdDPZ.png" rel="noreferrer"> src="https://i.stack.imgur.com/EdDPZ.png" alt="enter image description
here">



The graph shows that the
number of extra uops is slightly larger than half of the number of microcode assist uops
for all strides. I don't know what this means, but it could be related to page walks and
could be the reason for the observed
perturbation.



Why are the numbers of retired
instructions and uops per iteration increasing for larger strides even though the number
of static instructions per iteration is the same? Where is the interference coming
from?






The
following graphs plot the number of cycles per iteration against the number of retired
uops per iteration for different strides. The number of cycles increases much more
quickly than the number of retired uops. By using linear regression, I
found:




cycles = 0.1773
* stride + 0.8521
uops = 0.0672 * stride +
2.9277


Taking the
derivatives of both
functions:



d(cycles)/d(stride) =
0.1773
d(uops)/d(stride) =
0.0672



This
means that the number of cycles increase by 0.1773 and the number of retired uops
increase by 0.0672 with each 1 byte increment in stride. If interrupts and page faults
were indeed the (only) cause of perturbation, shouldn't both rates be very
close?



href="https://i.stack.imgur.com/Jh3OD.png" rel="noreferrer"> src="https://i.stack.imgur.com/Jh3OD.png" alt="enter image description
here">



href="https://i.stack.imgur.com/I17a3.png" rel="noreferrer"> src="https://i.stack.imgur.com/I17a3.png" alt="enter image description
here">


itemprop="text">
class="normal">Answer



The effect
you see repeatedly across many of the performance counters, where the value increases
linearly until stride 4096 after which it stays constant, makes total sense if you
assume the effect is purely due to increasing page faults with increasing stride. Page
faults affect the observed values because href="http://web.eece.maine.edu/~vweaver/projects/deterministic/deterministic_counters.pdf"
rel="nofollow noreferrer">many counters are not exact in the presence of
interrupts, page-faults and so on.



For example,
take the instructions counter which ramps from 4 to 5 as you
progress from stride 0 to 4096. We know from href="https://github.com/deater/deterministic/blob/master/static/determinism.intel"
rel="nofollow noreferrer">other sources that each page fault on Haswell
will count one extra instruction in user mode (and one extra in kernel mode as
well).




So the number of instructions
we expect is the base of 4 instructions in the loop, plus some fraction of an
instruction based on how many page faults we take per loop. If we assume each new 4 KiB
page causes a page fault, then the number of page faults per iteration
is:



MIN(OFFSET / 4096,
1)


Since each page
fault counts an extra instruction, we have then for the expected instruction
count:



4 + 1 * MIN(OFFSET / 4096,
1)



which is
in perfect agreement with your graph.



So then
the rough shape of the sloped graphed is explained for all the counters at once: with
the slope depending only on the amount of over-counting per page fault. Then the only
remaining question is why a page-fault effects each counter in the way you determined.
We've covered instructions already but let's take a peek at the
other
ones:



MEM_LOAD_UOPS.L1_MISS



You
get only 1 miss per page because only the load that touches the next page misses
anything (it takes a fault). I don't actually agree that is the L1 prefetcher that
results in no other misses: I think you'd get the same result if you turned off the
prefetchers. I think you get no more L1 misses since the same physical page backs every
virtual page and once you've added the TLB entry all lines are already in L1 (the very
first iteration will miss - but I guess you are doing many
iterations).



MEM_UOPS_RETIRED.ALL_LOADS




This
shows 3 uops (2 extra) per page-fault.



I'm not
100% sure how this event works in the presence of uop replay. Does it always count a
fixed number of uops based on the instruction, e.g., the number you'd see in Agner's
instruction -> uop tables? Or does it count the actual number of uops dispatched on
behalf of the instruction? This is usually the same, but loads replay their uops when
they miss at various cache levels.



For example,
I have found that on Haswell and Skylake2 when a load misses in
L1 but hits in L2, you see 2 uops total between the load ports (port2 and port3).
Presumably what happens is that the uop is dispatched with the assumption it will hit in
L1, and when this doesn't happen (the result is not ready when the scheduler expected
it), it gets replayed with new timing anticipating an L2 hit. This is "lightweight" in
that it doesn't require any kind of pipeline clear as no wrong-path instructions have
been executed.



Similarly for an L3 miss I have
observed 3 uops per load.



Given that, it seems
reasonable to assume the miss on the new page causes the load uop to be replayed twice
(as I have observed), and those uops show up in the
MEM_UOPS_RETIRED counter. One may reasonably argue that the
replayed uops are not retired, but in some sense retirement is more associated with
instructions than uops. Maybe this counter would be better described as "dispatched uops
associated with retired load
instructions".




UOPS_RETIRED.ALL
and IDQ.MS_UOPS



The
remaining weirdness is the large number of uops associated with each page. It seems
entirely possible that this is associated with the page-fault machinery. You could try a
similar test that misses in the TLB, but doesn't take the page-fault (make sure the
pages are already populated, e.g., using mmap with
MAP_POPULATE).



The
difference between MS_UOPS and
UOPS_RETIRED doesn't seem that odd since some uops may not
retired. Maybe also they count in different domains (I forget if
UOPS_RETIRED is fused or unfused
domain).



Maybe there is also leakage between
user and kernel mode counts in this
case.



Cycles versus uop
derivative




In the last part of your
question, you show that the "slope" of cycles versus offset is about 2.6x larger than
the slope of retired uops versus offset.



As
above, the effect here stops at 4096 and we expect again this effect is entirely due to
page-faults. So the difference in slope just means that a page fault costs 2.6x more
cycles than it does uops.



You
say:




If interrupts
and page faults were indeed the (only) cause of perturbation, shouldn't both rates be
very
close?





I
don't see why. The relationship between uops and cycles can vary widely, by perhaps
three order of magnitude: the CPU might execute four uops per cycle, or it might take
100s of cycles to execute a single uop (such as a cache-missing
load).



The value of 2.6 cycles per uop is right
in the middle of this big range and doesn't strike me as odd: it is a bit high
("inefficient" if you were talking about optimized application code) but here we are
talking about page fault handling which is a totally different thing, so we expect long
delays.



Studies into
over-counting



Anyone interested in
over-counting due to page-faults and other events might be interested in href="https://github.com/deater/deterministic" rel="nofollow noreferrer">this github
repository which has exhaustive tests for "determinism" of various PMU events,
and where many results of this nature have been noted, including on Haswell. It doesn't
however cover all the counters Hadi mentions here (otherwise we'd already have our
answer). href="http://web.eece.maine.edu/~vweaver/projects/deterministic/deterministic_counters.pdf"
rel="nofollow noreferrer">Here's the associated paper and some
easier-to-consume href="http://web.eece.maine.edu/~vweaver/projects/deterministic/ispass2013_deterministic_slides.pdf"
rel="nofollow noreferrer">associated slides - they mention in particular
that one extra instructions is incurred per page
fault.



Here's a quote for the results href="https://github.com/deater/deterministic/blob/master/static/determinism.intel"
rel="nofollow noreferrer">from Intel:




Conclusions on the
event determinism:
1. BR_INST_RETIRED.ALL (0x04C4)
a. Near branch
(no code segment change): Vince tested
BR_INST_RETIRED.CONDITIONAL and
concluded it as deterministic.
We verified that this applies to the near
branch event by using
BR_INST_RETIRED.ALL -
BR_INST_RETIRED.FAR_BRANCHES.
b. Far branch (with code segment change):
BR_INST_RETIRED.FAR_BRANCHES
counts interrupts and page-faults. In
particular, for all ring
(OS and user) levels the event counts 2 for each
interrupt or
page-fault, which occurs on interrupt/fault entry and exit
(IRET).

For Ring 3 (user) level, the counter counts 1 for the
interrupt/fault
exit. Subtracting the interrupts and faults (PerfMon event
0x01cb and
Linux Perf event - faults), BR_INST_RETIRED.FAR_BRANCHES remains a

constant of 2 for all the 17 tests by Perf (the 2 count appears
coming
from the Linux Perf for counter enabling and disabling).

Consequently, BR_INST_RETIRED.FAR_BRANCHES is deterministic.



So you expect one
extra instruction (in particular, a branch instruction), per
page-fault.




/>

1 In many cases this
"inexactness" is still deterministic - in that the over- or
under-counting always behaves in the same way in the presence of the external event, so
you may be able to correct for it if you also track how many of the relevant events have
happened.



2 I don't mean
to limit it to those two micro-architectures: they just happen to be the ones I've
tested.


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...