Sunday 8 September 2019

assembly - Why does breaking the "output dependency" of LZCNT matter?



While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following benchmarks:



  xor ecx, ecx
_benchloop:
lzcnt eax, edx

add ecx, 1
jnz _benchloop


And:



  xor ecx, ecx
_benchloop:
xor eax, eax ; this shouldn't help, but it does
lzcnt eax, edx

add ecx, 1
jnz _benchloop


The second version is much faster. It shouldn't be. There's no reason why LZCNT should have an input dependency on its output. Unlike BSR/BSF, the xZCNT instructions always overwrite their output.



I'm running this on a 4770K, so LZCNT and TZCNT aren't being executed as BSR/BSF.



What's going on here?


Answer




This is simply a limitation in the micro-architecture of your Intel Haswell CPU and several previous1 CPUs. It has been fixed for tzcnt and lzcnt as of Skylake-S (client), but the issue remained for popcnt until it was fixed in Cannon Lake.



On those micro-architectures the destination operand for tzcnt, lzcnt and popcnt is treated as an input dependency even though, semantically, it is not. Now I doubt this is a really a "bug": if it was simply an unintended behavior/oversight, I expect it would have been fixed in one of the several new micro-architectures that have been released since it was introduced.



More likely is that it is a design compromise based on one or both of the following two factors:




  • The hardware for popcnt, lzcnt and tzcnt is likely all shared with the existing bsf and bsr instructions. Now bsf and bsr do have a dependency on the previous destination value in practice2 for the special case of all-bits-zero input, since Intel chips leave the destination unmodified in that case. So it is entirely possible that the simplest design for the combined hardware resulted in the other similar instructions executing on the same unit inheriting the same dependency.


  • The vast majority of x86 two operand ALU instructions have an dependency on the destination operand, since it is used as a source as well. The three affected instructions are somewhat unique in that they are unary operators, but unlike existing unary operators like not and neg which have a single operand used as source and destination, they have distinct source and destination operands, making them superficially similar to most 2-input instructions. Perhaps the renamer/scheduler circuitry just doesn't distinguish the special case of these unary-with-two-register-operand versus the vast majority of plain shared source/destination 2-input instructions which don't have this dependency.





In fact, for the case of popcnt Intel has issued various errata covering the false dependency issue such as HSD146 for Haswell Desktop and SKL029 for Skylake, which reads:




POPCNT Instruction May Take Longer to Execute Than Expected



Problem POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non -dependent instructions have executed.



Implication Software using the POPCNT instruction may experience lower performance than expected.




Workaround None identified




I always found this erratum unusual since it isn't really identifying any type of functional defect or non-conformance to specification which is the case for essentially all the other errata. Intel doesn't really document a specific performance model for the OoO execution engine and there are a ton of other performance "gotchas" that have appeared and disappeared over the years (many with a much larger impact that this very minor issue) that don't get documented in errata. Still, this perhaps provides some evidence that it can be considered a bug. Oddly, the erratum was never extended to include tzcnt or lzcnt which had the same issue when they were introduced.






1 Well tzcnt and lzcnt only appeared in Haswell, but the problem exists for popcnt as well which was introduced in Nehalem - but the false dependency problem perhaps only exists for Sandy Bridge or later.




2 In practice, although not documented in the ISA docs, since the result for all-zero input was undefined in the Intel manuals. Most or all Intel chips implemented the behavior as leaving the destination register unchanged in this case, however.


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...