Home About

Overly Enthusiastic

I like learning how stuff works.

  1. An Exploration of Java's JIT
  2. โ€บ
  3. Reading an Integer in Java
Deep Dive

Reading an Integer in Java

Welcome back to part 4 of this series, diving into potential optimisations to the HotSpot's JIT. As a reminder, part 1 contains information on how to dump assembly out of HotSpot, part 2 details how we measure performance of assembly snippets, and part 3 goes over how to benchmark custom assembly.

This part will focus on the code generation for reading an integer from a byte[] or ByteBuffer.

Reading an int from a byte[]

Let's start with some benchmarks (error-prone as they may be) and compare at a surface level a few different ways to read a little-endian integer from a specific index of a byte[].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
@State(Scope.Benchmark)
public class Article4 {
    private static final Unsafe UNSAFE = Unsafe.getUnsafe();
    private static final byte[] DATA = new byte[100];
    private static final ByteBuffer BUFFER_LE = ByteBuffer.wrap(DATA).order(ByteOrder.LITTLE_ENDIAN);

    public static int readIntegerLE_normal(byte[] aData, int aIndex) {
        return (aData[aIndex] & 0x000000FF) |
                ((aData[aIndex + 1] & 0x000000FF) << 8) |
                ((aData[aIndex + 2] & 0x000000FF) << 16) |
                ((aData[aIndex + 3] & 0x000000FF) << 24);
    }

    public static int readIntegerLE_wrapAndGet(byte[] aData, int aIndex) {
        return ByteBuffer.wrap(aData).order(ByteOrder.LITTLE_ENDIAN).getInt(aIndex);
    }

    public static int readIntegerLE_unsafe(byte[] aData, int aIndex) {
        if (aIndex < 0 || aIndex > aData.length - 4) {
            throw new ArrayIndexOutOfBoundsException(aIndex);
        }

        return UNSAFE.getInt(aData, Unsafe.ARRAY_BYTE_BASE_OFFSET + (long) aIndex * Unsafe.ARRAY_BYTE_INDEX_SCALE);
    }

    public static int readIntegerLE_ByteBuffer_stock(ByteBuffer aData, int aIndex) {
        return aData.getInt(aIndex); // Assumes aData set to little endian already
    }

    public static int readIntegerLE_ByteBuffer_rewrap(ByteBuffer aData, int aIndex) {
        return ByteBuffer.wrap(aData.array()).order(ByteOrder.LITTLE_ENDIAN).getInt(aIndex);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int readIntegerLE_normal() {
        return readIntegerLE_normal(DATA, 8);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int readIntegerLE_wrapAndGet() {
        return readIntegerLE_wrapAndGet(DATA, 8);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int readIntegerLE_unsafe() {
        return readIntegerLE_unsafe(DATA, 8);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int readIntegerLE_ByteBuffer_stock() {
        return readIntegerLE_ByteBuffer_stock(BUFFER_LE, 8);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int readIntegerLE_ByteBuffer_rewrap() {
        return readIntegerLE_ByteBuffer_rewrap(BUFFER_LE, 8);
    }
๐Ÿ“‹
Copied!

On my (Lunar Lake) laptop, this gave the following results:

1
2
3
4
5
6
Benchmark                       Mode  Cnt  Score   Error  Units
readIntegerLE_ByteBuffer_rewrap avgt    5  1.570 ยฑ 0.032  ns/op
readIntegerLE_ByteBuffer_stock  avgt    5  1.688 ยฑ 0.048  ns/op
readIntegerLE_normal            avgt    5  1.820 ยฑ 0.018  ns/op
readIntegerLE_unsafe            avgt    5  1.575 ยฑ 0.025  ns/op
readIntegerLE_wrapAndGet        avgt    5  1.578 ยฑ 0.040  ns/op
๐Ÿ“‹
Copied!

Before we dive into the assembly in too much detail, let's talk briefly about our results.

  • All three "winners" are within measurement error on my laptop, meaning there is likely only a few cycles between them (1 cycle is approximately 0.33 ns for my laptop, we are well within this).
  • The normal variant appears the slowest, so is likely unoptimised (we will confirm this later).
  • The stock ByteBuffer variant, when inspecting the source, does appear to be using Unsafe under the hood, though with a fair bit of overheard.
  • The inliner is definitely coming into play for rewrap and wrapAndGet โ€” we'll talk about why I would avoid these variants in production later.

It is worth mentioning that on my laptop a method that just returns an integer takes 0.77 ns to run, so subtracting this out as a floor gives a larger percentage difference between these strategies (approximating what might happen if we inlined them โ€” barring issues that will be detailed in a future instalment where I will show that some of these are hurt by optimisations like loop unrolling):

Benchmark Time Relative
BB rewrap 0.80 ns 130%
Unsafe 0.805 ns 130%
Wrap and Get 0.808 ns 130%
ByteBuffer 0.918 ns 114%
Normal 1.050 ns 100%

We will discuss the results in the order of easiest to most complex to explain. We'll look at the variant using Unsafe, then the "normal" variant, then the one with an existing ByteBuffer, before finally turning to our last two that create a new ByteBuffer.

Why is Unsafe fast?

In theory, there is no reason for Unsafe to be faster in this case than the "normal" way of doing things. This variant should include all the same error checking as is standard (the bounds check dereferences the array, so we should have the same sort of implicit null check alongside our index check), so an ideal JIT should be able to produce competitive code either way.

1
2
3
4
5
6
7
public static int readIntegerLE_unsafe(byte[] aData, int aIndex) {
    if (aIndex < 0 || aIndex > aData.length - 4) {
        throw new ArrayIndexOutOfBoundsException(aIndex);
    }

    return UNSAFE.getInt(aData, Unsafe.ARRAY_BYTE_BASE_OFFSET + (long) aIndex);
}
๐Ÿ“‹
Copied!

Let's dig in to what this becomes and see if we can convince ourselves that this is all that's required:

1
2
3
4
5
6
7
8
    test   edx, edx
    jl     uncommon0
    mov    ebp, dword ptr [rsi+0xc]
    add    ebp, 0xfffffffc                   ; ebp -= 4
    cmp    edx, ebp
    jg     uncommon1
    movsxd r10, edx
    mov    eax, dword ptr [rsi+r10*1+0x10]
๐Ÿ“‹
Copied!
readIntegerLE_unsafe (Rocket Lake)Cycle-5-4-3-2-10123456789test edx, edxIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjl uncommon0mov ebp, dword ptr [rsi+0xc]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredadd ebp, 0xfffffffcIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredcmp edx, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjg uncommon1movsxd r10, edxIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmov eax, dword ptr [rsi+r10*1+0x10]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredIssuedDispatchedExecutedRetired

This looks mostly good. The inclusion of movsxd is unfortunate as it's an unnecessary instruction in the dependency chain โ€” edx, holding aIndex, has already been checked to be non-negative so a sign-extension is redundant. The use of r10 in this instruction is icing on the cake, as it necessitates a REX prefix on both the movsxd and the following mov, whereas using eax or any other "low" register would avoid the REX on the following mov (poor register selection leading to increased code size is a frequent issue in C2 generated code โ€” this doesn't show up on small benchmarks, but in real applications increased code size can cause more frequent i-cache misses). Without this extra instruction we'd get >10% more throughput on some architectures (1.34 cycles/iteration vs 1.5 cycles/iteration on Rocket Lake) on the core of this code (that part that may end up inlined into a hot-loop, if we weren't benchmarking 1โ€“at-a-time).

Why is "normal" slow?

As a reminder, here was our common-sense implementation:

1
2
3
4
5
6
public static int readIntegerLE_normal(byte[] aData, int aIndex) {
    return (aData[aIndex] & 0x000000FF) |
            ((aData[aIndex + 1] & 0x000000FF) << 8) |
            ((aData[aIndex + 2] & 0x000000FF) << 16) |
            ((aData[aIndex + 3] & 0x000000FF) << 24);
}
๐Ÿ“‹
Copied!

Before we dive into this implementation, I want to look at the same sort of function run through a C++ compiler. I spent a few hours porting this code to C++ and ran the result through Clang (via Godbolt using Clang 19.1.0 /w -O3).

1
2
3
4
5
6
int readIntegerLE_normal(char *aData, int aIndex) {
    return (aData[aIndex] & 0x000000FF) |
            ((aData[aIndex + 1] & 0x000000FF) << 8) |
            ((aData[aIndex + 2] & 0x000000FF) << 16) |
            ((aData[aIndex + 3] & 0x000000FF) << 24);
}
๐Ÿ“‹
Copied!
1
2
3
4
readIntegerLE_normal(char*, int):
    movsxd  rax, esi
    mov     eax, dword ptr [rdi + rax]
    ret
๐Ÿ“‹
Copied!

This is very promising, what we have is basically the same as our Unsafe variant sans error-checking. In Clang's case, as our index is signed and we don't have any code to prove it's above zero, the sign-extension is unavoidable. Looking back at older versions (thanks again to Godbolt) it looks like this was working-by-default (at -O3) since mid-2017 (LLVM 5.0), with pre-5.0 versions not able to recognise this with default options (I haven't bothered to check if this was a new feature introduced with 5.0 or whether prior versions may have been able to do this with additional optimisation passes enabled). I'm going to count that as 8 years ago.

Now that we've got that out of the way let's see what Java (OpenJDK 23) makes of this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    mov    r10d, dword ptr [rsi+0xc]      ; r10 = aData.length ; implicit exception: dispatches to npe0
    cmp    edx, r10d
    jae    uncommon0                      ; if (aIndex >= aData.length) goto UncommonTrapBlob
    movzx  r9d, byte ptr [rsi+rdx*1+0x10] ; r9d = aData[aIndex]
    lea    r11d, [rdx+0x3]                ; r11 = aIndex + 3
    cmp    r11d, r10d
    jae    uncommon1                      ; if (aIndex + 3 >= aData.length) goto UncommonTrapBlob
    movsxd r10, edx                       ; r10 = aIndex
    movzx  eax, byte ptr [rsi+r10*1+0x11] ; eax = aData[aIndex + 1]
    movsx  r8d, byte ptr [rsi+r10*1+0x13] ; r8d = aData[aIndex + 3]
    movzx  r10d, byte ptr [rsi+r10*1+0x12]; r10d = aData[aIndex + 2]
    shl    eax, 0x8                       ; eax = aData[aIndex + 1] << 8
    or     eax, r9d                       ; eax |= aData[aIndex];
    shl    r10d, 0x10                     ; r10d = aData[aIndex + 2] << 16
    or     eax, r10d                      ; eax |= (aData[aIndex + 2] << 16)
    shl    r8d, 0x18                      ; r8d = aData[aIndex + 3] << 24
    or     eax, r8d                       ; eax |= (aData[aIndex + 3] << 24)
๐Ÿ“‹
Copied!

Oh dear. It looks like this pattern isn't recognised by C2 at all. We've ended up with 4 reads from successive addresses and a whole slew of operations to assemble them together. I'd also poke at the register allocator's choice to use "higher"/REX-requiring registers while non-REX registers are still free, but this is a minor point in comparison. We also still have the unnecessary sign-extension of edx into r10.

I will also note the compiler's strange choice to use movsx over movzx for the final load into r8d. This doesn't have an impact on correctness, I'm just personally curious as to why we selected this instruction.

In terms of latency, as seen below, the main delay (unsurprisingly) comes from the dependency chain involved in assembling the result. This is actually the snippet used in part 2 to introduce uiCA.

readIntegerLE_normal (Rocket Lake)Cycle-5-4-3-2-101234567891011121314mov r10d, dword ptr [rsi+0xc]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredcmp edx, r10dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjae uncommon0movzx r9d, byte ptr [rsi+rdx*1+0x10]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredlea r11d, ptr [rdx+0x3]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredcmp r11d, r10dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjae uncommon1movsxd r10, edxIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmovzx eax, byte ptr [rsi+r10*1+0x11]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredmovsx r8d, byte ptr [rsi+r10*1+0x13]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredmovzx r10d, byte ptr [rsi+r10*1+0x12]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredshl eax, 0x8IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredor eax, r9dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredshl r10d, 0x10IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredor eax, r10dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredshl r8d, 0x18IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredor eax, r8dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredIssuedDispatchedExecutedRetired

Why is using a ByteBuffer slow?

Now we're closing in on the surprising results (that being wrapping an array, and re-wrapping the ByteBuffer) โ€” we're still not quite there, but that's coming up as soon as we've look at the ByteBuffer implementation.

1
2
3
public static int readIntegerLE_ByteBuffer_stock(ByteBuffer aData, int aIndex) {
	return aData.getInt(aIndex); // Assumes aData set to little endian already
}
๐Ÿ“‹
Copied!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
	mov    r10d, dword ptr [rsi+0x8]     ; implicit exception: dispatches to npe0
	cmp    r10d, 0x2b2dc8                ;   {metadata('java/nio/HeapByteBuffer')}
	jne    uncommon0
	mov    ebp, dword ptr [rsi+0x24]     ; ebp = aData.segment
	test   ebp, ebp
	jne    uncommon1                     ; if aData.segment != 0 goto uncommon1
	mov    r11d, dword ptr [rsi+0x30]    ; r11d = aData.hb
	mov    ebp, dword ptr [rsi+0x1c]     ; ebp = aData.limit
	add    ebp, 0xfffffffd               ; edp = aData.limit - 3
	test   ebp, ebp
	jl     uncommon2                     ; if (aData.limit < 3) goto UncommonTrapBlob
	cmp    edx, ebp
	jae    uncommon3                     ; if (aIndex >= (aData.limit - 3)) goto UncommonTrapBlob
	mov    r10, qword ptr [rsi+0x10]     ; r10 = aData.address
	movzx  r8d, byte ptr [rsi+0x2d]      ; r8d = aData.bigEndian
	movsxd rbp, edx                      ; rbp = aIndex
	add    rbp, r10                      ; rbp = aData.address + aIndex
	test   r11d, r11d
	je     uncommon4                     ; if (aData.hb == null) goto UncommonTrapBlob
	lea    r10, [r12+r11*8]              ; r10 = aData.hb (decompressed)
	mov    eax, dword ptr [r10+rbp*1]    ; eax = aData.hb[aData.adress + aIndex]
	test   r8d, r8d
	jne    big_endian
epilogue:
	; ...
big_endian:
	bswap  eax
	jmp    epilogue
๐Ÿ“‹
Copied!

At first glance this code may look a lot slower than the previous code, and arguably it is slower โ€” 15% slower than the Unsafe variant in this specific scenario, but importantly the critical path is not as long as it appears. The litany of checks this code performs add only to our total latency by taking up space โ€” they don't form part of the dependency chain for our result. We can also see that, like our Unsafe variant at the start, we do have our read happening via a single memory operation (this is because after doing all these checks, ByteBuffer will then use Unsafe to do the read).

I will still slightly nitpick this and say that a peephole optimisation was missed where we could eliminate test ebp, ebp, as the previous add will already set flags for us (jl would be substituted for js if you wanted to completely mimic microarchitectural behaviour, depending on how aware the peephole optimiser was of the original intention), which would allow saving a uop on architectures like more recent Zen versions.

readIntegerLE_ByteBuffer_stock (Rocket Lake)Cycle-5-4-3-2-101234567891011121314mov r10d, dword ptr [rsi+0x8]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredcmp r10d, 0x2b2dc8IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjnz uncommon0mov ebp, dword ptr [rsi+0x24]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredtest ebp, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjnz uncommon1mov r11d, dword ptr [rsi+0x30]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmov ebp, dword ptr [rsi+0x1c]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredadd ebp, 0xfffffffdIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredtest ebp, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjl uncommon2cmp edx, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjnb uncommon3mov r10, qword ptr [rsi+0x10]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmovzx r8d, byte ptr [rsi+0x2d]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmovsxd rbp, edxIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredadd rbp, r10IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredtest r11d, r11dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredjz uncommon4lea r10, ptr [r12+r11*8]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmov eax, dword ptr [r10+rbp*1]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredtest r8d, r8dIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredjnz big_endianIssuedDispatchedExecutedRetired

Why is wrapping an array in a ByteBuffer not slow?

The short answer to the question posed above is that the JIT is able to promote the allocation to the stack, effectively eliminating it, and thanks to the inliner resolving all the method calls we are able to "see" the ByteBuffer the entire time it's been alive, allowing the JIT to know the values of the fields and eliminate the branches doing the error checking.

1
2
3
public static int readIntegerLE_wrapAndGet(byte[] aData, int aIndex) {
	return ByteBuffer.wrap(aData).order(ByteOrder.LITTLE_ENDIAN).getInt(aIndex);
}
๐Ÿ“‹
Copied!
1
2
3
4
5
6
7
mov    r11d, dword ptr [rsi+0xc]     ; implicit exception: dispatches to npe0
lea    ebp, [r11-0x3]
test   ebp, ebp
jl     uncommon0
cmp    edx, ebp
jae    uncommon1
mov    eax, dword ptr [rsi+rdx*1+0x10]
๐Ÿ“‹
Copied!
readIntegerLE_wrapAndGet (Rocket Lake)Cycle-5-4-3-2-10123456mov r11d, dword ptr [rsi+0xc]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredtest ebp, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredjl uncommon0cmp edx, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredjnb uncommon1mov eax, dword ptr [rsi+rdx*1+0x10]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredIssuedDispatchedExecutedRetired

Note that this really requires cooperation from the inliner. In a production setting as your callstacks get deeper, it can come to pass that that inliner decides to stop inlining something in here and suddenly your nice fast code will begin spewing temporaries everywhere and grind to a halt. Still, the code benchmarks nicely.

Ok, what about re-wrapping?

Re-wrapping the array from a ByteBuffer benefits similarly from the removal of all the extra branches and is nice and fast.

1
2
3
public static int readIntegerLE_ByteBuffer_rewrap(ByteBuffer aData, int aIndex) {
	return ByteBuffer.wrap(aData.array()).getInt(aIndex);
}
๐Ÿ“‹
Copied!
1
2
3
4
5
6
7
8
9
10
11
12
mov    r10d, dword ptr [rsi+0x30]    ; r10d = aData.array ; implicit exception: dispatches to npe0
mov    r8d, dword ptr [r12+r10*8+0xc]; r8d = aData.array.length ; implicit exception: dispatches to npe1
movzx  ebp, byte ptr [rsi+0x2c]      ; ebp = aData.isReadOnly
test   ebp, ebp
jne    uncommon0                     ; if (aData.isReadOnly) goto UncommonTrapBlob
lea    ebp, [r8-0x3]                 ; ebp = aData.array.length
test   ebp, ebp
jl     uncommon1                     ; if (aData.array.length < 3) goto UncommonTrapBlob
cmp    edx, ebp
jae    uncommon2                     ; if (aIndex >= aData.array.length - 3) goto UncommonTrapBlob
shl    r10, 0x3                      ; r10 = aData.array (uncompressed)
mov    eax, dword ptr [r10+rdx*1+0x10] ; eax = ((int *) &aData.array)[aIndex]
๐Ÿ“‹
Copied!
readIntegerLE_ByteBuffer_rewrap (Rocket Lake)Cycle-5-4-3-2-101234567891011121314mov r10d, dword ptr [rsi+0x30]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredmov r8d, dword ptr [r12+r10*8+0xc]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredmovzx ebp, byte ptr [rsi+0x2c]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredtest ebp, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredjnz uncommon0lea ebp, ptr [r8-0x3]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredtest ebp, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjl uncommon1cmp edx, ebpIIssuedAwaiting DispatchDDispatchedExecutingEExecutedRRetiredjnb uncommon2shl r10, 0x3IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredmov eax, dword ptr [r10+rdx*1+0x10]IIssuedAwaiting DispatchDDispatchedExecutingEExecutedAwaiting RetirementRRetiredIssuedDispatchedExecutedRetired

This time we've avoided sign extending our index, but we've got an unnecessary shl on our critical path as well as adding some extra loads for having used a ByteBuffer to begin with.

What's with DONT_INLINE?

Choosing to use DONT_INLINE lets us focus on the benchmarked code more easily, rather than spending time nitpicking how the optimiser has chosen to unroll the benchmark loop. Here's a quick comparison showing how my laptop performs with DONT_INLINE vs without โ€” remember that which benchmark is "better" depends entirely on the context in which this code will exist in your production codebase. In this example, DONT_INLINE let us have our array/buffer and index both coming from parameters, whereas now they're coming from constants (as the inliner can "see" what we're doing).

Benchmark `DONT_INLINE` Regular
BB rewrap 0.80 ns 0.564 ns
Unsafe 0.805 ns 0.312 ns
Wrap and Get 0.808 ns 0.339 ns
ByteBuffer 0.918 ns 0.708 ns
Normal 1.050 ns 0.610 ns

Conclusion

Ok, so we've had a look at a few different ways to read an integer out of a byte[]. Overall the differences between the "winning" 3 methods are all within measurement error, so I'd consider them all to be equally fast. If I was using this in production for normal code, I'd choose the "normal" implementation or the "stock" ByteBuffer version, as these follow the principle of least surprise. If I was writing a performance sensitive portion, I'd choose the Unsafe variant as it is the most likely to perform reliably (not leaning as heavily on the the inliner being in a good mood โ€” the others potentially introduce an allocation on the hot path that we may not want to trust the optimiser to always remove).

Benchmark Time Relative
BB rewrap 0.80 ns 130%
Unsafe 0.805 ns 130%
Wrap and Get 0.808 ns 130%
ByteBuffer 0.918 ns 114%
Normal 1.050 ns 100%

What's the takeaway otherwise? Well, I guess it's that the optimiser can do some pretty surprising and powerful things, though sometimes can miss smaller micro-optimisations due to it lacking a peephole optimiser and sometimes lacking the ability to determine that a variable does not require sign extension.

Back to Series Overview

ยฉ 2024-2025 James Venning, All Rights Reserved

Any trademarks are properties of their respective owners. All content and any views or opinions expressed are my own and not associated with my employer. This site is not affiliated with Oracleยฎ.