Welcome back to part 4 of this series, diving into potential optimisations to the HotSpot's JIT. As a reminder, part 1 contains information on how to dump assembly out of HotSpot, part 2 details how we measure performance of assembly snippets, and part 3 goes over how to benchmark custom assembly.
This part will focus on the code generation for reading an integer from a byte[] or ByteBuffer.
Let's start with some benchmarks (error-prone as they may be) and compare at a surface level a few different ways to read a little-endian integer from a specific index of a byte[].
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
@State(Scope.Benchmark)
public class Article4 {
private static final Unsafe UNSAFE = Unsafe.getUnsafe();
private static final byte[] DATA = new byte[100];
private static final ByteBuffer BUFFER_LE = ByteBuffer.wrap(DATA).order(ByteOrder.LITTLE_ENDIAN);
public static int readIntegerLE_normal(byte[] aData, int aIndex) {
return (aData[aIndex] & 0x000000FF) |
((aData[aIndex + 1] & 0x000000FF) << 8) |
((aData[aIndex + 2] & 0x000000FF) << 16) |
((aData[aIndex + 3] & 0x000000FF) << 24);
}
public static int readIntegerLE_wrapAndGet(byte[] aData, int aIndex) {
return ByteBuffer.wrap(aData).order(ByteOrder.LITTLE_ENDIAN).getInt(aIndex);
}
public static int readIntegerLE_unsafe(byte[] aData, int aIndex) {
if (aIndex < 0 || aIndex > aData.length - 4) {
throw new ArrayIndexOutOfBoundsException(aIndex);
}
return UNSAFE.getInt(aData, Unsafe.ARRAY_BYTE_BASE_OFFSET + (long) aIndex * Unsafe.ARRAY_BYTE_INDEX_SCALE);
}
public static int readIntegerLE_ByteBuffer_stock(ByteBuffer aData, int aIndex) {
return aData.getInt(aIndex); // Assumes aData set to little endian already
}
public static int readIntegerLE_ByteBuffer_rewrap(ByteBuffer aData, int aIndex) {
return ByteBuffer.wrap(aData.array()).order(ByteOrder.LITTLE_ENDIAN).getInt(aIndex);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int readIntegerLE_normal() {
return readIntegerLE_normal(DATA, 8);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int readIntegerLE_wrapAndGet() {
return readIntegerLE_wrapAndGet(DATA, 8);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int readIntegerLE_unsafe() {
return readIntegerLE_unsafe(DATA, 8);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int readIntegerLE_ByteBuffer_stock() {
return readIntegerLE_ByteBuffer_stock(BUFFER_LE, 8);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int readIntegerLE_ByteBuffer_rewrap() {
return readIntegerLE_ByteBuffer_rewrap(BUFFER_LE, 8);
}
On my (Lunar Lake) laptop, this gave the following results:
1 2 3 4 5 6
Benchmark Mode Cnt Score Error Units readIntegerLE_ByteBuffer_rewrap avgt 5 1.570 ยฑ 0.032 ns/op readIntegerLE_ByteBuffer_stock avgt 5 1.688 ยฑ 0.048 ns/op readIntegerLE_normal avgt 5 1.820 ยฑ 0.018 ns/op readIntegerLE_unsafe avgt 5 1.575 ยฑ 0.025 ns/op readIntegerLE_wrapAndGet avgt 5 1.578 ยฑ 0.040 ns/op
Before we dive into the assembly in too much detail, let's talk briefly about our results.
It is worth mentioning that on my laptop a method that just returns an integer takes 0.77 ns to run, so subtracting this out as a floor gives a larger percentage difference between these strategies (approximating what might happen if we inlined them โ barring issues that will be detailed in a future instalment where I will show that some of these are hurt by optimisations like loop unrolling):
Benchmark | Time | Relative |
---|---|---|
BB rewrap | 0.80 ns | 130% |
Unsafe | 0.805 ns | 130% |
Wrap and Get | 0.808 ns | 130% |
ByteBuffer | 0.918 ns | 114% |
Normal | 1.050 ns | 100% |
We will discuss the results in the order of easiest to most complex to explain. We'll look at the variant using Unsafe, then the "normal" variant, then the one with an existing ByteBuffer, before finally turning to our last two that create a new ByteBuffer.
In theory, there is no reason for Unsafe to be faster in this case than the "normal" way of doing things. This variant should include all the same error checking as is standard (the bounds check dereferences the array, so we should have the same sort of implicit null check alongside our index check), so an ideal JIT should be able to produce competitive code either way.
1 2 3 4 5 6 7
public static int readIntegerLE_unsafe(byte[] aData, int aIndex) {
if (aIndex < 0 || aIndex > aData.length - 4) {
throw new ArrayIndexOutOfBoundsException(aIndex);
}
return UNSAFE.getInt(aData, Unsafe.ARRAY_BYTE_BASE_OFFSET + (long) aIndex);
}
Let's dig in to what this becomes and see if we can convince ourselves that this is all that's required:
1 2 3 4 5 6 7 8
test edx, edx jl uncommon0 mov ebp, dword ptr [rsi+0xc] add ebp, 0xfffffffc ; ebp -= 4 cmp edx, ebp jg uncommon1 movsxd r10, edx mov eax, dword ptr [rsi+r10*1+0x10]
This looks mostly good. The inclusion of movsxd is unfortunate as it's an unnecessary instruction in the dependency chain โ edx, holding aIndex, has already been checked to be non-negative so a sign-extension is redundant. The use of r10 in this instruction is icing on the cake, as it necessitates a REX prefix on both the movsxd and the following mov, whereas using eax or any other "low" register would avoid the REX on the following mov (poor register selection leading to increased code size is a frequent issue in C2 generated code โ this doesn't show up on small benchmarks, but in real applications increased code size can cause more frequent i-cache misses). Without this extra instruction we'd get >10% more throughput on some architectures (1.34 cycles/iteration vs 1.5 cycles/iteration on Rocket Lake) on the core of this code (that part that may end up inlined into a hot-loop, if we weren't benchmarking 1โat-a-time).
As a reminder, here was our common-sense implementation:
1 2 3 4 5 6
public static int readIntegerLE_normal(byte[] aData, int aIndex) {
return (aData[aIndex] & 0x000000FF) |
((aData[aIndex + 1] & 0x000000FF) << 8) |
((aData[aIndex + 2] & 0x000000FF) << 16) |
((aData[aIndex + 3] & 0x000000FF) << 24);
}
Before we dive into this implementation, I want to look at the same sort of function run through a C++ compiler. I spent a few hours porting this code to C++ and ran the result through Clang (via Godbolt using Clang 19.1.0 /w -O3).
1 2 3 4 5 6
int readIntegerLE_normal(char *aData, int aIndex) {
return (aData[aIndex] & 0x000000FF) |
((aData[aIndex + 1] & 0x000000FF) << 8) |
((aData[aIndex + 2] & 0x000000FF) << 16) |
((aData[aIndex + 3] & 0x000000FF) << 24);
}
1 2 3 4
readIntegerLE_normal(char*, int): movsxd rax, esi mov eax, dword ptr [rdi + rax] ret
This is very promising, what we have is basically the same as our Unsafe variant sans error-checking. In Clang's case, as our index is signed and we don't have any code to prove it's above zero, the sign-extension is unavoidable. Looking back at older versions (thanks again to Godbolt) it looks like this was working-by-default (at -O3) since mid-2017 (LLVM 5.0), with pre-5.0 versions not able to recognise this with default options (I haven't bothered to check if this was a new feature introduced with 5.0 or whether prior versions may have been able to do this with additional optimisation passes enabled). I'm going to count that as 8 years ago.
Now that we've got that out of the way let's see what Java (OpenJDK 23) makes of this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
mov r10d, dword ptr [rsi+0xc] ; r10 = aData.length ; implicit exception: dispatches to npe0 cmp edx, r10d jae uncommon0 ; if (aIndex >= aData.length) goto UncommonTrapBlob movzx r9d, byte ptr [rsi+rdx*1+0x10] ; r9d = aData[aIndex] lea r11d, [rdx+0x3] ; r11 = aIndex + 3 cmp r11d, r10d jae uncommon1 ; if (aIndex + 3 >= aData.length) goto UncommonTrapBlob movsxd r10, edx ; r10 = aIndex movzx eax, byte ptr [rsi+r10*1+0x11] ; eax = aData[aIndex + 1] movsx r8d, byte ptr [rsi+r10*1+0x13] ; r8d = aData[aIndex + 3] movzx r10d, byte ptr [rsi+r10*1+0x12]; r10d = aData[aIndex + 2] shl eax, 0x8 ; eax = aData[aIndex + 1] << 8 or eax, r9d ; eax |= aData[aIndex]; shl r10d, 0x10 ; r10d = aData[aIndex + 2] << 16 or eax, r10d ; eax |= (aData[aIndex + 2] << 16) shl r8d, 0x18 ; r8d = aData[aIndex + 3] << 24 or eax, r8d ; eax |= (aData[aIndex + 3] << 24)
Oh dear. It looks like this pattern isn't recognised by C2 at all. We've ended up with 4 reads from successive addresses and a whole slew of operations to assemble them together. I'd also poke at the register allocator's choice to use "higher"/REX-requiring registers while non-REX registers are still free, but this is a minor point in comparison. We also still have the unnecessary sign-extension of edx into r10.
I will also note the compiler's strange choice to use movsx over movzx for the final load into r8d. This doesn't have an impact on correctness, I'm just personally curious as to why we selected this instruction.
In terms of latency, as seen below, the main delay (unsurprisingly) comes from the dependency chain involved in assembling the result. This is actually the snippet used in part 2 to introduce uiCA.
Now we're closing in on the surprising results (that being wrapping an array, and re-wrapping the ByteBuffer) โ we're still not quite there, but that's coming up as soon as we've look at the ByteBuffer implementation.
1 2 3
public static int readIntegerLE_ByteBuffer_stock(ByteBuffer aData, int aIndex) {
return aData.getInt(aIndex); // Assumes aData set to little endian already
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
mov r10d, dword ptr [rsi+0x8] ; implicit exception: dispatches to npe0 cmp r10d, 0x2b2dc8 ; {metadata('java/nio/HeapByteBuffer')} jne uncommon0 mov ebp, dword ptr [rsi+0x24] ; ebp = aData.segment test ebp, ebp jne uncommon1 ; if aData.segment != 0 goto uncommon1 mov r11d, dword ptr [rsi+0x30] ; r11d = aData.hb mov ebp, dword ptr [rsi+0x1c] ; ebp = aData.limit add ebp, 0xfffffffd ; edp = aData.limit - 3 test ebp, ebp jl uncommon2 ; if (aData.limit < 3) goto UncommonTrapBlob cmp edx, ebp jae uncommon3 ; if (aIndex >= (aData.limit - 3)) goto UncommonTrapBlob mov r10, qword ptr [rsi+0x10] ; r10 = aData.address movzx r8d, byte ptr [rsi+0x2d] ; r8d = aData.bigEndian movsxd rbp, edx ; rbp = aIndex add rbp, r10 ; rbp = aData.address + aIndex test r11d, r11d je uncommon4 ; if (aData.hb == null) goto UncommonTrapBlob lea r10, [r12+r11*8] ; r10 = aData.hb (decompressed) mov eax, dword ptr [r10+rbp*1] ; eax = aData.hb[aData.adress + aIndex] test r8d, r8d jne big_endian epilogue: ; ... big_endian: bswap eax jmp epilogue
At first glance this code may look a lot slower than the previous code, and arguably it is slower โ 15% slower than the Unsafe variant in this specific scenario, but importantly the critical path is not as long as it appears. The litany of checks this code performs add only to our total latency by taking up space โ they don't form part of the dependency chain for our result. We can also see that, like our Unsafe variant at the start, we do have our read happening via a single memory operation (this is because after doing all these checks, ByteBuffer will then use Unsafe to do the read).
I will still slightly nitpick this and say that a peephole optimisation was missed where we could eliminate test ebp, ebp, as the previous add will already set flags for us (jl would be substituted for js if you wanted to completely mimic microarchitectural behaviour, depending on how aware the peephole optimiser was of the original intention), which would allow saving a uop on architectures like more recent Zen versions.
The short answer to the question posed above is that the JIT is able to promote the allocation to the stack, effectively eliminating it, and thanks to the inliner resolving all the method calls we are able to "see" the ByteBuffer the entire time it's been alive, allowing the JIT to know the values of the fields and eliminate the branches doing the error checking.
1 2 3
public static int readIntegerLE_wrapAndGet(byte[] aData, int aIndex) {
return ByteBuffer.wrap(aData).order(ByteOrder.LITTLE_ENDIAN).getInt(aIndex);
}
1 2 3 4 5 6 7
mov r11d, dword ptr [rsi+0xc] ; implicit exception: dispatches to npe0 lea ebp, [r11-0x3] test ebp, ebp jl uncommon0 cmp edx, ebp jae uncommon1 mov eax, dword ptr [rsi+rdx*1+0x10]
Note that this really requires cooperation from the inliner. In a production setting as your callstacks get deeper, it can come to pass that that inliner decides to stop inlining something in here and suddenly your nice fast code will begin spewing temporaries everywhere and grind to a halt. Still, the code benchmarks nicely.
Re-wrapping the array from a ByteBuffer benefits similarly from the removal of all the extra branches and is nice and fast.
1 2 3
public static int readIntegerLE_ByteBuffer_rewrap(ByteBuffer aData, int aIndex) {
return ByteBuffer.wrap(aData.array()).getInt(aIndex);
}
1 2 3 4 5 6 7 8 9 10 11 12
mov r10d, dword ptr [rsi+0x30] ; r10d = aData.array ; implicit exception: dispatches to npe0 mov r8d, dword ptr [r12+r10*8+0xc]; r8d = aData.array.length ; implicit exception: dispatches to npe1 movzx ebp, byte ptr [rsi+0x2c] ; ebp = aData.isReadOnly test ebp, ebp jne uncommon0 ; if (aData.isReadOnly) goto UncommonTrapBlob lea ebp, [r8-0x3] ; ebp = aData.array.length test ebp, ebp jl uncommon1 ; if (aData.array.length < 3) goto UncommonTrapBlob cmp edx, ebp jae uncommon2 ; if (aIndex >= aData.array.length - 3) goto UncommonTrapBlob shl r10, 0x3 ; r10 = aData.array (uncompressed) mov eax, dword ptr [r10+rdx*1+0x10] ; eax = ((int *) &aData.array)[aIndex]
This time we've avoided sign extending our index, but we've got an unnecessary shl on our critical path as well as adding some extra loads for having used a ByteBuffer to begin with.
Choosing to use DONT_INLINE lets us focus on the benchmarked code more easily, rather than spending time nitpicking how the optimiser has chosen to unroll the benchmark loop. Here's a quick comparison showing how my laptop performs with DONT_INLINE vs without โ remember that which benchmark is "better" depends entirely on the context in which this code will exist in your production codebase. In this example, DONT_INLINE let us have our array/buffer and index both coming from parameters, whereas now they're coming from constants (as the inliner can "see" what we're doing).
Benchmark | `DONT_INLINE` | Regular |
---|---|---|
BB rewrap | 0.80 ns | 0.564 ns |
Unsafe | 0.805 ns | 0.312 ns |
Wrap and Get | 0.808 ns | 0.339 ns |
ByteBuffer | 0.918 ns | 0.708 ns |
Normal | 1.050 ns | 0.610 ns |
Ok, so we've had a look at a few different ways to read an integer out of a byte[]. Overall the differences between the "winning" 3 methods are all within measurement error, so I'd consider them all to be equally fast. If I was using this in production for normal code, I'd choose the "normal" implementation or the "stock" ByteBuffer version, as these follow the principle of least surprise. If I was writing a performance sensitive portion, I'd choose the Unsafe variant as it is the most likely to perform reliably (not leaning as heavily on the the inliner being in a good mood โ the others potentially introduce an allocation on the hot path that we may not want to trust the optimiser to always remove).
Benchmark | Time | Relative |
---|---|---|
BB rewrap | 0.80 ns | 130% |
Unsafe | 0.805 ns | 130% |
Wrap and Get | 0.808 ns | 130% |
ByteBuffer | 0.918 ns | 114% |
Normal | 1.050 ns | 100% |
What's the takeaway otherwise? Well, I guess it's that the optimiser can do some pretty surprising and powerful things, though sometimes can miss smaller micro-optimisations due to it lacking a peephole optimiser and sometimes lacking the ability to determine that a variable does not require sign extension.
ยฉ 2024-2025 James Venning, All Rights Reserved
Any trademarks are properties of their respective owners. All content and any views or opinions expressed are my own and not associated with my employer. This site is not affiliated with Oracleยฎ.