Don't write it using inline assembly which would make the compiler's job harder. GCC has a built-in extension (See gcc builtins docs for more details) for prefetch you should use instead:
__builtin_prefetch(const void*)
This will generate code using the prefetch instructions of your target, but with more scope for the compiler to be smart about it.
As a simple example of the difference between inline ASM and gcc's builtin consider the following two files, test1.c:
void foo(double *d, unsigned len) {
for (unsigned i = 0; i < len; ++i) {
__builtin_prefetch(&d[i]);
d[i] = d[i] * d[i];
}
}
And test2.c:
void foo(double *d, unsigned len) {
for (unsigned i = 0; i < len; ++i) {
asm("prefetcht0 (%0)"
: /**/
: "g"(&d[i])
: /**/
);
d[i] = d[i] * d[i];
}
}
(Note that if you benchmark that I'm 99% sure that a third version with no prefetch would be faster than both of the above, because you've got predictable access patterns and so the only thing that it really achieves is adding more bytes of instructions and a few more cycles)
If we compile both with -O3 on x86_64 and diff the generated output we see:
.file "test1.c" | .file "test2.c"
.text .text
.p2align 4,,15 .p2align 4,,15
.globl foo .globl foo
.type foo, @function .type foo, @function
foo: foo:
.LFB0: .LFB0:
.cfi_startproc .cfi_startproc
testl %esi, %esi # len testl %esi, %esi # len
je .L1 #, je .L1 #,
leal -1(%rsi), %eax #, D.1749 | leal -1(%rsi), %eax #, D.1745
leaq 8(%rdi,%rax,8), %rax #, D.1749 | leaq 8(%rdi,%rax,8), %rax #, D.1745
.p2align 4,,10 .p2align 4,,10
.p2align 3 .p2align 3
.L4: .L4:
movsd (%rdi), %xmm0 # MEM[base: _8, offset: 0B], D. | #APP
prefetcht0 (%rdi) # ivtmp.6 | # 3 "test2.c" 1
> prefetcht0 (%rdi) # ivtmp.6
> # 0 "" 2
> #NO_APP
> movsd (%rdi), %xmm0 # MEM[base: _8, offset: 0B], D.
addq $8, %rdi #, ivtmp.6 addq $8, %rdi #, ivtmp.6
mulsd %xmm0, %xmm0 # D.1748, D.1748 | mulsd %xmm0, %xmm0 # D.1747, D.1747
movsd %xmm0, -8(%rdi) # D.1748, MEM[base: _8, offset: | movsd %xmm0, -8(%rdi) # D.1747, MEM[base: _8, offset:
cmpq %rax, %rdi # D.1749, ivtmp.6 | cmpq %rax, %rdi # D.1745, ivtmp.6
jne .L4 #, jne .L4 #,
.L1: .L1:
rep ret rep ret
.cfi_endproc .cfi_endproc
.LFE0: .LFE0:
.size foo, .-foo .size foo, .-foo
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4" .ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",@progbits .section .note.GNU-stack,"",@progbits
Even in this simple case the compiler in question (GCC 4.8.4) has taken advantage of the fact that it's allowed to reorder things and chosen, presumably on the basis of an internal model of the target processors, to move the prefetch after the initial load has happened. If I had to guess it's slightly faster to do the load and prefetch in that order in some scenarios. Presumably the penalty for a miss and a hit is lower with this order. Or the ordering like this works better with branch predictions. It doesn't really matter why the compiler chose to do this though, the point is that it's exceedingly complex to fully understand the impact of even trivial changes to generated code on modern processors in real applications. By using builtin functions instead of inline assembly you benefit from the compiler's knowledge today and any improvements that show up in the future. Even if you spend two weeks studying and benchmarking this simple case the odds are fairly good that you'll not beat future compilers and you may even end up with a code base that can't benefit from future improvements.
Those problems are before we even begin to discuss portability of your code - with builtin functions they fall into one of two categories normally when on an architecture without support either graceful degradation or enabling emulation. Applications with lots of x86 inline assembly were harder to port to x86_64 when that came along.
No comments:
Post a Comment