Welcome to part 2 of a series that will be doing a deep-dive into Java's C2 compiler output. In this instalment we'll be discussing the challenges and limitations of microbenchmarking for compiler writers and seek to demonstrate that for tight computational loops, counting cycles across different architectures can give a more accurate view. We will compare three different ways to compare performance — manually counting cycles, using a fairly standard tool like llvm-mca, and a newer tool uiCA. If you'd like to understand how to dump and read the assembly from HotSpot, check out part 1, and if you're interested in how we will still use benchmarking to double check our results, part 3 will dive into this. Once you're done with this article, check out part 4 for where we put this into practice to analyse specific Java snippets.
When talking about how the JIT can make code faster, as opposed to how you can change your code to make it faster, we also lose out on some of the most profitable optimisations — better algorithm or datastructure choice. For this series, as we're focused on the JIT, we will be largely taking code at "face-value" and assuming it is code that reasonably could be encountered in production. Well-written code is often defined by its readability and maintainability — the more we can have developers focus on achieving bug-free code and selecting the best algorithms and the less they have to focus on whether a StringBuilder or a String.format call is going to be faster, the better.
When applying what you may learn in this series in the real-world, remember that working at a higher level of abstraction and looking to reduce the amount of work a program is doing (or having it do different work) is often better (and more maintainable!) than trying to get the work done faster.
When it comes to larger blocks of code or assessing/comparing algorithmic or application performance, real-world production metrics are the gold-standard. Well-written benchmarks can provide a reasonable alternative where this is not feasible, however that caveat (requiring the benchmark be well-written) can prove difficult to meet given that it is accepted that writing meaningful benchmarks is hard. It's perfectly OK to be a bit seat-of-your-pants if you're trying to measure a 10x difference between two approaches, but as you get down to making something 1–2% faster you absolutely must do everything right.
When you get down to the level of talking about assembly as we are, there's also another rather large elephant in the room: You and I have different computers, and neither of us is likely to access to a large performance lab full of a dozen different CPU generations and configurations. For many companies, it is considered a luxury to have a single server dedicated for benchmarking purposes (or able to be for short times).
These reasons mean that we're going to be largely eschewing the use of benchmarks here after our first deep-dive in part 4. We'll still be confirming our analyses with benchmarks in many cases, but primarily we'll be using a combination of the number of cache-lines accessed by different versions of a function with a cycle simulation that estimates how many CPU cycles are required for a given operation. For code that is primarily limited by memory latency or bandwidth, the exact instructions used can often be irrelevant (a main-memory access on a server can be >100ns, which at 3 GHz is >300 clock cycles (a TLB-miss can make this much worse)), but where data fits into the L1 cache this will provide us with the ability to look across architectures that we don't necessarily have access to for benchmarking.
Check out this example from the JMH sample repository. I've included the results when run on my laptop and relevant bits of source below. There's a comment at the top of the link that explains that int[] and ArrayList<Integer> have different performance due to boxing, but no mention of why sumArrayListShuffled may be 6x slower than sumArrayList.
1 2 3 4
Benchmark Mode Cnt Score Error Units sumArray avgt 3 3.290 ± 0.034 ms/op sumArrayList avgt 3 22.240 ± 3.455 ms/op sumArrayListShuffled avgt 3 142.465 ± 28.799 ms/op
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
@Setup
public void setup() {
Random random = new Random(1234);
for (int i = 0; i < N; i++) {
intArray[i] = random.nextInt();
intList.add(intArray[i]);
shuffledIntList.add(intArray[i]);
}
Collections.shuffle(shuffledIntList);
}
@Benchmark
public long sumArrayList() {
long sum = 0;
for (int i = 0; i < N; i++) {
sum += intList.get(i);
}
return sum;
}
// sumArray/sumArrayListShuffled is the same as the above, just with shuffledIntList.
// The interesting part is in the setup method.
This shows a fairly big gotcha with benchmarks and another reason that when microbenchmarking you've got to really understand what it is you're measuring. To some readers, it may be immediately obvious what the difference is and why there's such a large gap between the two — to others (like myself), it may take some more reflection on the matter.
As a hint, changing random.nextInt() to random.nextInt(100) gave the following output on my laptop:
1 2 3 4
Benchmark Mode Cnt Score Error Units sumArray avgt 3 3.309 ± 0.692 ms/op sumArrayList avgt 3 10.300 ± 2.720 ms/op sumArrayListShuffled avgt 3 10.608 ± 5.358 ms/op
Have a think about what could cause such a difference before checking out my personal answer to this quandary is below:
During the setup phase, each Integer will be allocated from the TLAB and placed sequentially in memory. As most cache-line sizes are 64 bytes and on my machine an Integer is 16, this means that each memory access is likely to fetch the next 4 elements from memory into the L1 cache, rather than just the next single result. This type of memory access (mostly sequential) is also friendlier to the processor's prefetcher as well as being better for the TLB (on my laptop, the L1–dTLB for 4K pages is 128 entries).
Though the TLAB may run out and require refill during this process, most elements will still be adjacent to one-another. You can verify this by running JMH with appropriate flags to have perf count cache-misses. The shuffled variant has memory accesses scattered through the heap, meaning most accesses will not be cached — that the shuffled variant is only 6x slower (on my machine) is a testament to the latency-hiding capabilities of modern CPUs. (The nextInt(100) variant "fixes" all this getting rid of the allocations & using the Integer cache)
Which one is more indicative of the real-world? Well, that depends on your specific example. If your integers are all going to be allocated sequentially, then the non-shuffled version is more indicative, while if they are truly random and shuffled (maybe allocated over time/from different threads), you're going to see performance more like the shuffled variant. It's worth noting that real-world data often has far more 0s and 1s and other small integers, meaning Java's Integer cache will be being used for many values, even if others are "random". The main take-away here is that when you're benchmarking, it's very easy to come up with numbers that will mislead you or confound you when you try to replicate the effects in the real-world (you did double-check your optimised code against production metrics, right?).
Here's a snippet we'll be optimising in part 4 of this series, it reads a little-endian integer from a byte[]. The assembly that pops out of C2 on my machine that we'll be analysing is also shown below.
1 2 3 4 5 6
public static int readLittleEndian(byte[] aData, int aIndex) {
return (aData[aIndex] & 0x000000FF) |
((aData[aIndex + 1] & 0x000000FF) << 8) |
((aData[aIndex + 2] & 0x000000FF) << 16) |
((aData[aIndex + 3] & 0x000000FF) << 24);
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
mov r10d, dword ptr [rsi+0xc] ; r10 = aData.length ; implicit exception: dispatches to npe0 cmp edx, r10d jae uncommon0 ; if (aIndex >= aData.length) goto uncommon0 movzx r9d, byte ptr [rsi+rdx*1+0x10] ; r9d = aData[aIndex] lea r11d, [rdx+0x3] ; r11 = aIndex + 3 cmp r11d, r10d jae uncommon1 ; if (aIndex + 3 >= aData.length) goto uncommon1 movsxd r10, edx ; r10 = aIndex movzx eax, byte ptr [rsi+r10*1+0x11] ; eax = aData[aIndex + 1] movsx r8d, byte ptr [rsi+r10*1+0x13] ; r8d = aData[aIndex + 3] movzx r10d, byte ptr [rsi+r10*1+0x12]; r10d = aData[aIndex + 2] shl eax, 0x8 ; eax = aData[aIndex + 1] << 8 or eax, r9d ; eax |= aData[aIndex]; shl r10d, 0x10 ; r10d = aData[aIndex + 2] << 16 or eax, r10d ; eax |= (aData[aIndex + 2] << 16) shl r8d, 0x18 ; r8d = aData[aIndex + 3] << 24 or eax, r8d ; eax |= (aData[aIndex + 3] << 24)
As a straw-man way to approach this problem that is simultaneously error-prone, labour-intensive, and inaccurate, let's first demonstrate how we might estimate the latency and throughput of this code on a Cypress Cove#CypressCove) core by-hand. We will assume all reads are in L1 and no branch mispredictions.
Id | Instruction | Size | Latency | Throughput | Deps | Delay |
---|---|---|---|---|---|---|
0 | mov r10d, dword ptr [rsi+0xc] | 4 | 5 | 0.5 | - | 5 |
1 | cmp edx, r10d | 3 | 1 | 0.25 | 0 | 6 |
2 | jae uncommon0 | 6 | Fused | - | 1 | 6 |
3 | movzx r9d, byte ptr [rsi+rdx*1+0x10] | 6 | 5 | 0.5 | - | 5 |
4 | lea r11d, [rdx+0x3] | 4 | 1 | 0.25 | - | 1 |
5 | cmp r11d, r10d | 3 | 1 | 0.25 | 0, 4 | 6 |
6 | jae uncommon1 | 6 | Fused | - | 5 | 6 |
7 | movsxd r10, edx | 3 | 1 | 0.25 | - | 1 |
8 | movzx eax, byte ptr [rsi+r10*1+0x11] | 6 | 5 | 0.5 | 7 | 6 |
9 | movsx r8d, byte ptr [rsi+r10*1+0x13] | 6 | 5 | 0.5 | 7 | 6 |
10 | movzx r10d, byte ptr [rsi+r10*1+0x12] | 6 | 5 | 0.5 | 7 | 6 |
11 | shl eax, 0x8 | 3 | 1 | 0.5 | 8 | 9 |
12 | or eax, r9d | 3 | 1 | 0.25 | 8, 11 | 10 |
13 | shl r10d, 0x10 | 4 | 1 | 0.5 | 10 | 7 |
14 | or eax, r10d | 3 | 1 | 0.25 | 12, 13 | 11 |
15 | shl r8d, 0x18 | 4 | 1 | 0.5 | 9 | 7 |
16 | or eax, r8d | 3 | 1 | 0.25 | 9, 14 | 12 |
If we naively assume that all instructions are dispatched at once and that the front-end pipeline adds no latency, we get a final latency of 12 cycles. The latency of a block is the number of cycles from when we start executing the block until the result of the block is available for use. In more complex examples, it's possible for the block to still be executing (for example, checking complex boundary conditions that are almost always ok) but for the result to be available already, and in these cases this series will always consider the latency of the result only. If the result is written out to memory, we will consider the latency an instruction reading the result will see, taking into account store-forwarding delays.
For our throughput estimation we'll also be naive and assume that all instructions compete for the same resources and simply total up the throughput of all instructions, giving us a score of 5.75 cycles/iteration (meaning if we ran this in a tight loop with no other overheads, we would be able to begin executing a new copy every 5.75 cycles).
Throughput is important for any code that will be run in a tight loop with predictable control flow, with latency being more important for code where the result is used as the decider for unpredictable control flow. As it will turn out later, for this specific example these rough guesses lead us to be fairly competitive with real tooling, though for this series we won't be doing any further hand-counting of cycles.
LLVM-MCA is part of the LLVM project and uses the same instruction timing information as is used for code optimisation alongside a simplified model of how the CPU dispatches, executes, and retires instructions. Notes from when the tool was introduced (search "known limitations") indicate that it is unaware of many micro-architectural details (some of these issues may have since been fixed), and this is evident when looking at the resulting timeline information it produces. Here is the output for the above assembly.
Looking at a trace above, we can see an example trace showing instructions being Issued (coming out of the decoder into the pipeline), Dispatched (sent to an execution unit), Executed (coming out of the execution unit as a finished result) before being Retired (results made persistent) in order. We are able to read a latency of 12 cycles off the timeline from first dispatch to final retire, with the tool also telling us the throughput of the above is 3 cycles/iteration.
Throughput was obtained by having the tool simulate many iterations of the above and dividing the total cycles by the number of iterations to obtain a throughput number. The tool does provide a direct Block RThroughput value, however this value does not align with the definition of throughput we will be using (see example data below):
1 2 3 4 5 6 7 8 9
Iterations: 100 10000 1000000 Instructions: 1700 170000 17000000 Total Cycles: 315 30015 3000015 Total uOps: 1700 170000 17000000 Dispatch Width: 6 6 6 uOps Per Cycle: 5.40 5.66 5.67 IPC: 5.40 5.66 5.67 Block RThroughput: 2.8 2.8 2.8
See above, each iteration adds 3 whole cycles, rather than the 2.8 reported.
Using reputable sources like Intel's optimisation reference, wikiChip and Agner Fog's microarchitecture guide, we can see that in a Sunny Cove core neither of the jae instructions in the above should consume a uop and that both should dispatch and retire as part of their corresponding cmp due to macro-fusion. From the above, however, llvm-mca does not seem to recognise this and shows the instructions executing sequentially. The tool also seems to indicate the ability to issue 6 instructions per cycle, however according to Chips & Cheese (another great source for micro-architectural details) and Agner Fog's testing, there exists a limitation of 5 instructions/cycle (only 4 if from the normal queue, but 5 if from the uOp cache (5 instructions consisting of up to 6 uOps)). Finally, cycle 8 shows 8 instructions retiring at once, while to the best of my knowledge Sunny Cove only supports 5 wide retirement. It seems there are some other open issues with LLVM-MCA that indicate there may be more shortcomings than are presented here.
While I don't feel qualified to do a cycle-by-cycle pipeline reconstruction by hand and have any chance of being correct, given the holes I can poke in this tool's output I don't feel able to place confidence in it.
The other tool we'll look at is uiCA, specifically using the version available online (the reader at home can choose to use it offline or online). This tool hasn't been around as long as llvm-mca (circa 2022 vs 2018) however seems to be more accurate. One limitation is that the tool only supports Intel CPUs and only up to Rocket Lake (released 2021), meaning that unless updated it will unfortunately age-out fairly soon. Regardless, it is still the most accurate tool I was able to find presently.
This timeline shows a latency of 14 cycles with the tool reporting a throughput of 5.22 cycles/iteration (this was run through the tool surrounded in a loop to ensure it was modelled using the Loop Stream Detector (LSD) — without this loop the model shows the decode stage in more detail). Looking at the results I'm unable to pick a hole in the result, so I am liable to trust this tool's output over llvm-mca.
The authors of uiCA claim that the simplifications made by the model used in llvm-mca cause errors to be on average around 25% — a margin compatible with the difference we observe here (52%). In our case, much of the error seems to come from the unlimited retire bandwidth llvm-mca models. Given the margins we're chasing would be swamped by such a large difference, I would err towards uiCA despite the narrower support for CPUs.
Relatedly & reassuringly uiCA also seems to know about errata that are easy to miss, such as the fact that move elimination was disabled on Ice Lake after release due to an erratum.
Compare:
Move elimination is not modelled in llvm-mca (another limitation), so we've nothing to compare.
Between throughput and latency, the easier to model is by hand is latency as it requires much less knowledge about the internals of a given architecture, which is reflected in the summary below where all 3 approaches roughly agree on the same latency despite the large difference in predicted throughput. Given the likely higher accuracy of uiCA, it will be used in this series to form performance predictions and to try and explain the differences (or lack there-of) between different code snippets.
Tool | Latency | Throughput |
---|---|---|
Manual | 12 | 5.75 |
llvm-mca | 12 | 2.8 |
uiCA | 14 | 5.2 |
Even when we do use a benchmark and find a difference between two code snippets, that doesn't necessarily tell us why there is a difference between them. Sometimes there will be extra instructions in one snippet vs the other that we may wish to point to as the culprit, but my experiences so far have shown mixed success with taking the "obvious" answer — sometimes the extra instructions are "free" as the CPU is bottlenecking elsewhere and they're simply consuming otherwise idle resources (and I've seen an example where an extra uop actually makes the code faster on some architectures due to quirks of the decoder). A future entry in this series will look at limitations to HotSpot's loop unrolling and cycle counting will be very useful to explain what's happening there.
© 2024-2025 James Venning, All Rights Reserved
Any trademarks are properties of their respective owners. All content and any views or opinions expressed are my own and not associated with my employer. This site is not affiliated with Oracle®.