Home About

Overly Enthusiastic

I like learning how stuff works.

  1. An Exploration of Java's JIT
  2. A Brief Guide to HotSpot Assembly
Background

A (not so) Brief Guide to HotSpot Assembly

This series focuses on analysing assembly output by HotSpot's C2 compiler, also known as "opto", and looking for headroom where performance can be further tuned to bring Java's performance closer to native code. The first article in this series — this article — will cover how to pop the hood on C2 to dump out the assembly it's producing and highlight a few HotSpot-specific parts that may be unfamiliar. To get the most out of this article, ideally the reader will already be familiar with x86–64 assembly.

Dumping Assembly

Dumping assembly from HotSpot requires you add a library called hsdis to your installation. Luckily, Chris Newland provides pre-built binaries for all major platforms. I'll be using x64 Linux for this article (and will also pretend that 32–bit JVMs do not exist) as that's the most common server architecture I deal with. The overall takeaways from this series are applicable to non-x64 platforms, however minutia will, of course, vary.

Once you've got hsdis downloaded and put into your Java bin folder you can start dumping assembly right away, but you're likely to end up with more output than you're after unless you target specific classes or functions. To demonstrate dumping assembly we'll use the following sample program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
package com.overlyenthusiastic.examples;

import java.util.Random;

public class DumpAssembly {
    @SuppressWarnings("ResultOfMethodCallIgnored")
    public static void main(String[] aArgs) {
        final Random myRandom = new Random(1234);
        final String[] myStrings = new String[10];
        for (int i = 0; i < myStrings.length; i++) {
            myStrings[i] = String.valueOf(myRandom.nextInt());
        }

        for (int i = 0; i < 50_000; i++) {
            ExampleFunctions.stringLength(myStrings[myRandom.nextInt(myStrings.length)]);
        }
    }
}
📋
Copied!
1
2
3
4
5
6
7
8
9
10
package com.overlyenthusiastic.examples;

@SuppressWarnings("UnusedReturnValue")
public enum ExampleFunctions {
    ;

    public static int stringLength(String aString) {
        return aString.length();
    }
}
📋
Copied!

The way in which we use random data is largely not relevant, rather just that we call ExampleFunctions::stringLength at least -XX:Tier4CompileThreshold times (for my JDK, that defaults to 15000) to ensure it gets compiled.

With this sample program, we'll dump the assembly by creating a compiler directives file (I called mine compiler_directives, but all that matters is we supply it on the command line (see below)) like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
  {
    match: "com.overlyenthusiastic.examples.DumpAssembly::*",
    inline: [
      "-*::*"
    ]
  },
  {
    match: "com.overlyenthusiastic.examples.ExampleFunctions::*",
    c2: {
      PrintNMethods: true
    }
  }
]
📋
Copied!

This file prevents inlining into our caller, which inhibits optimisations from removing the return value and enables clean output to be captured, as well as only selectively printing the C2 compilations for our ExampleFunctions class. You're also able to print output from C1 if you're interested in seeing that output. Passing this file to the VM is fairly straightforward — though it's important to remember to select whichever GC is most relevant to you in production when you're dealing with more complex examples than this one.

1
2
3
4
-XX:+UnlockDiagnosticVMOptions
-XX:CompilerDirectivesFile=./samples/src/main/resources/compiler_directives
-XX:PrintAssemblyOptions=intel
-XX:+UseG1GC
📋
Copied!

Capturing from Production vs Benchmarks

If you are interested in specific code from a production workload (for example, as highlighted by a profiler), it can be tempting to pull the code under test out into a benchmark or harness for study. Type profile pollution, where common utilities (Objects.equal, Objects.hashCode, anything receiving different lambdas, etc) or other classes receive many different classes of object, can lead to knock-on effects with the optimiser in real applications that will not always occur in a benchmark or toy program. For this reason, it is always preferable to capture from the system under test at least initially to ensure that whatever benchmark or harness you have closely matches production. Common benchmarking wisdom, to run each benchmark in a fresh and "unpolluted" JVM, can provide erroneous findings vs a more production-like setup for business critical code.

As an example, consider a function that may accept a List. If in production it is receiving a mix of list types (new ArrayList, List.of(), List.of(x), List.of(x, x, x) and Arrays.asList all result in distinct types) it may fall into megamorphic patterns with virtual dispatch and no-inlining for many of its invocations. A benchmark prepared for the function may pass a single type of List in for all invocations and therefore not display behaviour indicative of production.

Keep an eye out for assembly that has been generated as part of an on-stack replacement when collecting measurements, as such functions can be as confusing to the optimiser as it is to the reader. It is generally possible in real applications to disable this feature without loss of performance, though this is mostly because the feature is often not required once applications are fully warmed up (if it's causing you a problem collecting your measurements in production, it often means it is doing something for your application and may alter performance to turn off).

Initial Terminology

Object References, OOPs, cOOPs

A reference may also be called an "OOP", or "Ordinary Object Pointer", with a compressed reference being called a "cOOP" or "Compressed Ordinary Object Pointer".

When the heap size is less than 32 GB and it has not been explicitly disabled (with -XX:-UseCompressedOops), Java will store compressed references in the heap instead of the full value. This means that a field holding an object reference will take up 4 bytes instead of 8, resulting in lower memory usage for an application.

A compressed reference is just a pointer to a Java object with the lower 3 bits shifted away (as Java objects are always 8–byte aligned, these bits are always zero). The minimum object alignment (and therefore the shift value and maximum heap size for compressed references) is configurable on the command line (using -XX:ObjectAlignmentInBytes=8), but is typically not changed. This series will always assume that compressed oops are in-use (as this is the more "complex" scenario) and that the default object alignment is being used. You can read more about compressed references here.

Object Header

Every object in the Java heap begins with an 8–byte "mark word", followed by a reference to a structure describing the object type (the "klass word", (sometimes "klass", or "class")) which can be either 4 or 8 bytes, depending on if -XX:+UseCompressedClassPointers is enabled (defaults to true if we are using compressed OOPs and cannot be enabled if compressed OOPs are disabled).

For the purposes of this series, the layout of an object header is unimportant (though we will assume 12 byte headers). Note that the size of the header shrinks and the layout changes with Liliput, but in general this different will not be relevant for our analyses.

Reading HotSpot Assembly

Given our stringLength function from before:

1
2
3
public static int stringLength(String aString) {
    return aString.length();
}
📋
Copied!

And a class layout diagram for java.lang.String (note that in HotSpot parlance, class is often spelled klass or klazz)

java/lang/String0x000x03Mark Word0x040x07Mark Word (cont)0x080x0BCompressed Klass Pointer0x0C0x0Fbyte[] value0x100x13int hash0x14byte coder0x15boolean hashIsZero0x160x17<padding / leftover space>

The raw assembly for this looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
----------------------------------- Assembly -----------------------------------

Compiled method (c2) 289  218       4       com.overlyenthusiastic.examples.ExampleFunctions::stringLength (5 bytes)
 total in heap  [0x00007f36a00d8d88,0x00007f36a00d8f40] = 440
 relocation     [0x00007f36a00d8e60,0x00007f36a00d8e78] = 24
 main code      [0x00007f36a00d8e80,0x00007f36a00d8f08] = 136
 stub code      [0x00007f36a00d8f08,0x00007f36a00d8f20] = 24
 oops           [0x00007f36a00d8f20,0x00007f36a00d8f28] = 8
 metadata       [0x00007f36a00d8f28,0x00007f36a00d8f40] = 24
 immutable data [0x00007f3610079000,0x00007f3610079080] = 128
 dependencies   [0x00007f3610079000,0x00007f3610079008] = 8
 nul chk table  [0x00007f3610079008,0x00007f3610079020] = 24
 scopes pcs     [0x00007f3610079020,0x00007f3610079060] = 64
 scopes data    [0x00007f3610079060,0x00007f3610079080] = 32

[Disassembly]
--------------------------------------------------------------------------------
[Constant Pool (empty)]

--------------------------------------------------------------------------------

[Verified Entry Point]
  # {method} {0x00007f3630407f68} 'stringLength' '(Ljava/lang/String;)I' in 'com/overlyenthusiastic/examples/ExampleFunctions'
  # parm0:    rsi:rsi   = 'java/lang/String'
  #           [sp+0x20]  (sp of caller)
  0x00007f36a00d8e80:   mov    DWORD PTR [rsp-0x14000],eax
  0x00007f36a00d8e87:   push   rbp
  0x00007f36a00d8e88:   sub    rsp,0x10
  0x00007f36a00d8e8c:   cmp    DWORD PTR [r15+0x20],0x1
  0x00007f36a00d8e94:   jne    0x00007f36a00d8efe
  0x00007f36a00d8e9a:   mov    r11d,DWORD PTR [rsi+0x14]    ; implicit exception: dispatches to 0x00007f36a00d8ec0
  0x00007f36a00d8e9e:   mov    r10d,DWORD PTR [r12+r11*8+0xc]; implicit exception: dispatches to 0x00007f36a00d8ed4
  0x00007f36a00d8ea3:   movsx  r8d,BYTE PTR [rsi+0x10]
  0x00007f36a00d8ea8:   sarx   eax,r10d,r8d
  0x00007f36a00d8ead:   add    rsp,0x10
  0x00007f36a00d8eb1:   pop    rbp
  0x00007f36a00d8eb2:   cmp    rsp,QWORD PTR [r15+0x448]    ;   {poll_return}
  0x00007f36a00d8eb9:   ja     0x00007f36a00d8ee8
  0x00007f36a00d8ebf:   ret    
  0x00007f36a00d8ec0:   mov    esi,0xfffffff6
  0x00007f36a00d8ec5:   xchg   ax,ax
  0x00007f36a00d8ec7:   call   0x00007f369fb87b60           ; ImmutableOopMap {}
                                                            ;*invokevirtual length {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - com.overlyenthusiastic.examples.ExampleFunctions::stringLength@1 (line 8)
                                                            ;   {runtime_call UncommonTrapBlob}
  0x00007f36a00d8ecc:   nop    DWORD PTR [rax+rax*1+0x144]  ;   {other}
  0x00007f36a00d8ed4:   mov    esi,0xfffffff6
  0x00007f36a00d8ed9:   xchg   ax,ax
  0x00007f36a00d8edb:   call   0x00007f369fb87b60           ; ImmutableOopMap {}
                                                            ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.String::length@4 (line 1593)
                                                            ; - com.overlyenthusiastic.examples.ExampleFunctions::stringLength@1 (line 8)
                                                            ;   {runtime_call UncommonTrapBlob}
  0x00007f36a00d8ee0:   nop    DWORD PTR [rax+rax*1+0x1000158];   {other}
  0x00007f36a00d8ee8:   movabs r10,0x7f36a00d8eb2           ;   {internal_word}
  0x00007f36a00d8ef2:   mov    QWORD PTR [r15+0x460],r10
  0x00007f36a00d8ef9:   jmp    0x00007f369fb88be0           ;   {runtime_call SafepointBlob}
  0x00007f36a00d8efe:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x00007f36a00d8f03:   jmp    0x00007f36a00d8e9a
[Exception Handler]
  0x00007f36a00d8f08:   jmp    0x00007f369fc4e8e0           ;   {no_reloc}
[Deopt Handler Code]
  0x00007f36a00d8f0d:   call   0x00007f36a00d8f12
  0x00007f36a00d8f12:   sub    QWORD PTR [rsp],0x5
  0x00007f36a00d8f17:   jmp    0x00007f369fb87e80           ;   {runtime_call DeoptimizationBlob}
  0x00007f36a00d8f1c:   hlt    
  0x00007f36a00d8f1d:   hlt    
  0x00007f36a00d8f1e:   hlt    
  0x00007f36a00d8f1f:   hlt    
--------------------------------------------------------------------------------
[/Disassembly]
📋
Copied!
Stack Banging
Prolog
Entry Barrier
Method Body
Epilog
Safepoint
Error Handling

Future articles in this series will look at cleaned & more heavily annotated assembly, but here we will walk through the raw output and discuss the method prologue/epilogue, identify the safepoint code, and discuss thread-local and heap access.

Stack Banging

Execution for a compiled HotSpot method starts at the [Verified Entry Point]. The first instruction of this method, mov DWORD PTR [rsp-0x14000], eax, writes a dud value a fair way down the stack in order to verify that at least that much space is still available to use. If this access turns out to be invalid, the runtime will catch the resulting signal and issue a StackOverflowError for the current thread. You can read more about this here. This instruction is unrelated to our user code and generally speaking is unimportant from a performance point-of-view as no other instructions depend on it.

Prologue

In the below, we see the stack being prepared as well as an "nmethod entry barrier" (nmethod is just HotSpot speak for a "native" Java method — one that's been compiled). This barrier is used when the runtime needs to patch embedded object references (direct references to heap objects in the machine code — these may be compressed or "raw") during a safepoint. Once execution resumes, threads encountering the barrier that need to patch embedded object pointers can flush their instruction caches to ensure they don't see a stale object reference.

1
2
3
4
5
6
7
8
9
10
11
12
; Create space for our stack
  0x00007f36a00d8e87:   push   rbp
  0x00007f36a00d8e88:   sub    rsp,0x10
; Read the entry barrier flag
  0x00007f36a00d8e8c:   cmp    DWORD PTR [r15+0x20],0x1
  0x00007f36a00d8e94:   jne    0x00007f36a00d8efe
  0x00007f36a00d8e9a:   ; ... actual function start ...
; ... ...
; ... ...
; ... ...
  0x00007f36a00d8ef9:   jmp    0x00007f369fb88be0           ;   {runtime_call SafepointBlob}
  0x00007f36a00d8efe:   call   Stub::nmethod_entry_barrie
📋
Copied!

The r15 register always holds a pointer to the current thread's information, the structure of which is found here and here, among other places. In the JDK version used above, the offset 0x20 holds the field Thread::_nmethod_disarmed_guard_value (in future, more annotated examples, this information will be supplied in-line).

Epilogue

The end of a Java function starts with a fairly standard looking stack adjustment, but continues with another read through the r15 register labelled poll_return. This read is checking for whether this thread has been asked to safepoint, and if so this then jumps off to a little bit of code that actually responds to the request if one is pending. As we head off into the safepoint code, we save the return address into [r15+0x460] (this is the field JavaThread::_saved_exception_pc) so the runtime knows where it came from and where it needs to go back to.

1
2
3
4
5
6
7
8
9
  0x00007f36a00d8ead:   add    rsp,0x10
  0x00007f36a00d8eb1:   pop    rbp
  0x00007f36a00d8eb2:   cmp    rsp,QWORD PTR [r15+0x448]    ;   {poll_return}
  0x00007f36a00d8eb9:   ja     0x00007f36a00d8ee8
  0x00007f36a00d8ebf:   ret   
  ...
  0x00007f36a00d8ee8:   movabs r10,0x7f36a00d8eb2           ;   {internal_word}
  0x00007f36a00d8ef2:   mov    QWORD PTR [r15+0x460],r10
  0x00007f36a00d8ef9:   jmp    0x00007f369fb88be0           ;   {runtime_call SafepointBlob}
📋
Copied!

Implicit null-pointer checks

Moving onto the user portion of the method, you may have noticed an "implicit exception" on the first line of this section. This instruction is loading the value field on String, and doing so without an explicit null check. As part of compilation, the JIT will create a look-up table for the runtime that basically says "if you crash with a null-pointer here, don't actually crash, just transfer execution over here". This is an overhead free (when you're not throwing) way to implement null-checks and moves the landing pad / handler code far away from the hotter code, which is more cache friendly. The handler will load up a reason code into the esi register before jumping into the next part we'll talk about, the uncommon trap blob.

1
2
3
4
5
6
7
8
9
  0x00007f36a00d8e9a:   mov    r11d,DWORD PTR [rsi+0x14]    ; implicit exception: dispatches to 0x00007f36a00d8ec0
  ; ... ... ...
  ; ... ... ...
  0x00007f36a00d8ec0:   mov    esi,0xfffffff6
  0x00007f36a00d8ec5:   xchg   ax,ax
  0x00007f36a00d8ec7:   call   0x00007f369fb87b60           ; ImmutableOopMap {}
                                                            ;*invokevirtual length {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - com.overlyenthusiastic.examples.ExampleFunctions::stringLength@1 (line 8)
                                                            ;   {runtime_call UncommonTrapBlob}
📋
Copied!

Uncommon Trap Blobs

The null-pointer check above doesn't actually end with the JIT'd code throwing a NullPointerException, instead it ends with a call into UncommonTrapBlob. If the exception is thrown during the method's warm-up, it will instead be an explicit null-check followed by an explicit runtime call to throw a NullPointerException. Calls to UncommonTrapBlob are fairly common and indicate that the JIT considers this path unlikely to execute, and so has not actually generated code to handle it. Instead, it simply fills in a small reason code and transfers control to the runtime to handle the uncommon operation, which may result in de-optimisation (the JIT'd code may be discarded) of the current method for re-profiling.

Heap Access

1
2
  0x00007f36a00d8e9a:   mov    r11d,DWORD PTR [rsi+0x14]    ; implicit exception: dispatches to 0x00007f36a00d8ec0
  0x00007f36a00d8e9e:   mov    r10d,DWORD PTR [r12+r11*8+0xc]; implicit exception: dispatches to 0x00007f36a00d8ed4
📋
Copied!

Refocusing on the user code once more, these 2 memory access demonstrate the 2 different ways memory is accessed in Java. The first has rsi holding a raw pointer to an object, which can be directly accessed to load the 4 byte compressed reference from the string. The second shows how we read from the compressed reference. The r12 register holds the address of the heap base (most commonly 0), while r11*8 decompresses the reference into an offset into the heap, with 0xC being the offset of the length field in a byte[]. These two lines together therefore represent the inlined operation aValue.value.length. The JVM will prefer to place the heap base at 0, but will move it if it is unable to be placed there due to platform restrictions or if the requested to place it elsewhere on the command line.

Though the heap-base cannot move at runtime, and the value will commonly be 0, this is only partially relied on by the JIT. The register will still be included in address calculations (even when not required) and unavailable for allocation, however will sometimes be used by the JIT as a "zero" register for eg zero-ing newly allocated memory.

Field Assignment

For our second example, let's check out what it's like to assign to a field. We'll only be briefly touching over this so that the reader is familiar with what may be seen in larger examples later in the series.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public static void setPrimitiveFields(Foo aFoo) {
    aFoo.theField = 123;
    aFoo.theDoubleField = 43.23;
}

public static void setStringField(Foo aFoo) {
    aFoo.theStringField = "foosball";
}

public static void setBoxedPrimitiveField(Foo aFoo) {
    aFoo.theBoxedInteger = 3;
}

public static class Foo {
    public int theField;
    public double theDoubleField;
    public String theStringField;
    public Integer theBoxedInteger;
}
📋
Copied!
com/overlyenthusiastic/examples/Foo0x000x07Mark Word0x080x0BCompressed Klazz Pointer0x0C0x0Fint theField0x100x17double theDoubleField0x180x1BString theStringField0x1C0x1FInteger theBoxedInteger

Starting with setPrimitiveFields, we get the following code (displayed here cleaned & annotated — as all future examples in the series will be):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
constant_pool:
    0x3d_0a_d7_a3_70_9d_45_40             ; 43.23 as a `double`
; [Verified Entry Point]
;  {method} 'setPrimitiveFields' '(Lcom/overlyenthusiastic/examples/ExampleFunctions$Foo;)V'
;  parm0:    rsi:rsi   = 'com/overlyenthusiastic/examples/ExampleFunctions$Foo'
;            [sp+0x20]  (sp of caller)
    mov    dword ptr [rsp-0x14000], eax
    push   rbp
    sub    rsp, 0x10
    cmp    dword ptr [r15+0x20], 0x1      ; Thread::_nmethod_disarmed_guard_value
    jne    entry_barrier
method_start:
    mov    dword ptr [rsi+0xc], 0x7b      ; `aValue.theField` = 123 ; implicit exception dispatches to @npe0
    vmovsd xmm0, qword ptr [rip+0xb7]     ; xmm0 = 43.23 (via rip-relative reference to @constant_pool)
    vmovsd qword ptr [rsi+0x10], xmm0     ; `aValue.theDoubleField` = xmm0
    add    rsp, 0x10
    pop    rbp
safepoint0_check:
    cmp    rsp, qword ptr [r15+0x448]     ; JavaThread::_poll_data::_polling_word
    ja     safepoint0
    ret
npe0:
    mov    esi, 0xfffffff6
    call   Stub::UncommonTrapBlob
safepoint0:
    movabs r10, safepoint0_check
    mov    qword ptr [r15+0x460], r10     ; JavaThread::_saved_exception_pc
    jmp    Stub::SafepointBlob
entry_barrier:
    call   Stub::nmethod_entry_barrier
    jmp    method_start
; [Exception Handler]
    jmp    Stub::ExceptionHandler
; [Deopt Handler Code]
    call   deopt
deopt:
    sub    qword ptr [rsp], 0x5
    jmp    Stub::DeoptimizationBlob
📋
Copied!
Stack Banging
Prologue
Entry Barrier
Method Body
Epilogue
Safepoint
Error Handling

As before, we have a stack-check, a prologue, an entry barrier, an implicit null check, an epilogue, and a safepoint. The "active" portion of the function is simply:

1
2
3
    mov    dword ptr [rsi+0xc], 0x7b      ; implicit exception: dispatches to npe0
    vmovsd xmm0, qword ptr [rip+0xb7]     ; rip-relative reference to constant_pool
    vmovsd qword ptr [rsi+0x10], xmm0
📋
Copied!

Which contains our first reference to a constant pool created alongside the code in which our double value is held. There isn't anything terribly interesting going on here, so let's move right along to what happens when we try to assign our String field:

1
2
3
public static void setStringField(Foo aFoo) {
    aFoo.theStringField = "foosball";
}
📋
Copied!

This produces the following assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
; [Verified Entry Point]
; {method} 'setStringField' '(Lcom/overlyenthusiastic/examples/ExampleFunctions$Foo;)V'
;  parm0:    rsi:rsi   = 'com/overlyenthusiastic/examples/ExampleFunctions$Foo'
;            [sp+0x20]  (sp of caller)
    mov    dword ptr [rsp-0x14000], eax
    push   rbp
    sub    rsp, 0x10
    cmp    dword ptr [r15+0x20], 0x1      ; Thread::_nmethod_disarmed_guard_value
    jne    entry_barrier
method_start:
    mov    rbx, rsi
    test   rsi, rsi
    je     npe0                           ; if (aFoo == null) goto npe0
    cmp    byte ptr [r15+0x38], 0x0       ; Thread::_gc_data::_satb_mark_queue::_active
    jne    satb_marking                   ; if (satb_marking_active) goto satb_marking
satb_marking_finished:
    mov    dword ptr [rbx+0x18], 0xe1337efb ; aFoo.theStringField = "foosball" {coop("foosball"{0x00000007099bf7d8})}
    mov    r10, rbx                       ; r10 = aFoo
    movabs r11, 0x7099bf7d8               ;   {oop("foosball"{0x00000007099bf7d8})}
    xor    r11, r10
    shr    r11, 0x15
    test   r11, r11                       ; Is aFoo in the same GC 2MB region as constant?
    je     epilogue
    shr    r10, 0x9                       ; r10 = aFoo / 512
    movabs rdi, 0x7fefaa944000
    add    rdi, r10
    cmp    byte ptr [rdi], 0x2
    jne    card_marking                   ; if cardTable[aFoo / 512] != 2, goto label4
epilogue:
    add    rsp, 0x10
    pop    rbp
safepoint0_check:
    cmp    rsp, qword ptr [r15+0x448]     ; JavaThread::_poll_data::_polling_word
    ja     safepoint0
    ret
satb_marking:
    mov    r11d, dword ptr [rsi+0x18]     ; r11 = aFoo.theStringField
    test   r11d, r11d
    je     satb_marking_finished          ; If aFoo.theStringField == null, do nothing
    mov    r10, qword ptr [r15+0x28]      ; Thread::_gc_data::_satb_mark_queue::_index
    mov    rdi, r11                       ; rdi = aFoo.theStringField (coop)
    shl    rdi, 0x3                       ; rdi = aFoo.theStringField (oop)
    test   r10, r10
    je     satb_marking_refill
    mov    r11, qword ptr [r15+0x30]      ; Thread::_gc_data::_satb_mark_queue::_buf
    mov    qword ptr [r11+r10*1-0x8], rdi ; add old value to mark queue
    add    r10, 0xfffffffffffffff8        ; r10 -= 8;
    mov    qword ptr [r15+0x28], r10      ; Thread::_gc_data::_satb_mark_queue::_index
    jmp    satb_marking_finished
card_marking:
    mov    r10, qword ptr [r15+0x40]      ; Thread::_gc_data::_dirty_card_queue::_index
    mov    r11, qword ptr [r15+0x48]      ; Thread::_gc_data::_dirty_card_queue::_buf
    lock add dword ptr [rsp-0x40], 0x0
    cmp    byte ptr [rdi], 0x0
    je     epilogue
    mov    byte ptr [rdi], r12b
    test   r10, r10
    jne    card_marking_no_refill
    mov    rsi, r15
    movabs r10, qword G1BarrierSetRuntime::write_ref_field_post_entry
    call   r10
    jmp    epilogue
card_marking_no_refill:
    mov    qword ptr [r11+r10*1-0x8], rdi
    add    r10, 0xfffffffffffffff8
    mov    qword ptr [r15+0x40], r10      ; Thread::_gc_data::_dirty_card_queue::_index
    jmp    epilogue
npe0:
    mov    esi, 0xfffffff6
    call   Stub::UncommonTrapBlob
satb_marking_refill:
    mov    rsi, r15
    movabs r10, qword G1BarrierSetRuntime::write_ref_field_pre_entry
    call   r10
    jmp    satb_marking_finished
safepoint0:
    movabs r10, qword safepoint0_check
    mov    qword ptr [r15+0x460], r10     ; JavaThread::_saved_exception_pc
    jmp    Stub::SafepointBlob
entry_barrier:
    call   Stub::nmethod_entry_barrier
    jmp    method_start
; [Exception Handler]
    jmp    Stub::ExceptionHandler
; [Deopt Handler Code]
    call   deopt
deopt:
    sub    qword ptr [rsp], 0x5
    jmp    Stub::DeoptimizationBlob
📋
Copied!
Stack Banging
Prologue
Entry Barrier
Method Body
Epilogue
Safepoint
Error Handling
SATB Marking
Card Marking

There's a surprisingly large amount of code here and a lot of branches. Fortunately, these will often be fairly predictable and much of the code won't actually run except during specific GC phases. What we're seeing here is a write barrier consisting of pre-write SATB marking and a post-write card table update (non-G1GC collectors may have less (or no) work present during field updates (CMS/Epsilon), or may have even slower field updates (like ZGC) due to more requiring more complex barriers). We'll briefly touch on these, but as we are focusing on the JIT rather than the GC we'll largely assume these are overheads and ignore them.

SATB Marking

Before a stored reference can be overwritten, depending on the phase of the collector, we may need to save the reference to a SATB queue. SATB is short for "snapshot-at-the-beginning", and is used to ensure that application writes that occur concurrently with GC marking don't result in objects erroneously being marked as dead. The GC is trying to create an atomic "snapshot" of what was live at the start of the marking process and must therefore save overwritten references to a look-aside buffer so they are not missed.

You can check out the source that shows this barrier being emitted on GitHub, but we won't really be diving deeply into this assembly here. This series focuses on the C2 JIT rather than the GC, so we won't dive into them too deeply. Primarily this series will assume they are just a price we must pay for the conveniences Java provides except where the JIT has an opportunity to omit such barrier code — then we'll bring up the barriers again. When measuring performance impact we will assume the barrier is always the cheapest kind it may be.

Card Marking

In addition to SATB marking, "dirty cards" are used after a field has been written to indicate which potentially old regions may reference young regions. As before, the source can be found on GitHub and this series will primarily ignore the contents of these barriers and only comment when the barrier could have been omitted but wasn't. Again, when measuring performance we will assume the fast-path is always taken.

Object Allocation

To finish up, let's briefly look at object allocation in HotSpot. We'll do the simplest I can think of, auto-boxing a double (auto-boxing an int will result in the integer cache being used, same for long).

1
2
3
public static Double createDouble(double aValue) {
    return aValue;
}
📋
Copied!

The assembly for this looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
; [Verified Entry Point]
;  {method} 'createDouble' '(D)Ljava/lang/Double;' in 'com/overlyenthusiastic/examples/ExampleFunctions'
;  parm0:    xmm0:xmm0   = double
;            [sp+0x20]  (sp of caller)
    mov    dword ptr [rsp-0x14000], eax
    push   rbp
    sub    rsp, 0x10
    cmp    dword ptr [r15+0x20], 0x1      ; Thread::_nmethod_disarmed_guard_value
    jne    entry_barrier
method_start:
    vmovq  rbp, xmm0                      ; stash `aValue` in callee saved register
    mov    rax, qword ptr [r15+0x1b8]     ; Thread::_tlab::_top
    mov    r10, rax
    add    r10, 0x18
    cmp    r10, qword ptr [r15+0x1c8]     ; Thread::_tlab::_end
    jae    tlab_allocation_failure        ; If this happens, `aValue` will be preserved in rbp
    mov    qword ptr [r15+0x1b8], r10     ; Thread::_tlab::_top
    prefetchw byte ptr [r10+0x100]
    mov    qword ptr [rax], 0x1           ; mark word
    mov    dword ptr [rax+0x8], 0x18cd48  ; compressed class ptr {metadata('java/lang/Double')}
    mov    dword ptr [rax+0xc], r12d      ; zero out the 4 bytes of padding
allocation_success:
    vmovq  xmm0, rbp
    vmovsd qword ptr [rax+0x10], xmm0     ; Set the boxed value
    add    rsp, 0x10
    pop    rbp
safepoint0_check:
    cmp    rsp, qword ptr [r15+0x448]     ; JavaThread::_poll_data::_polling_word
    ja     safepoint0
    ret
tlab_allocation_failure:
    movabs rsi, 0x7fd6f618cd48            ;   {metadata('java/lang/Double')}
    call   Runtime::newInstance           ; ImmutableOopMap {}
                                          ;*new {reexecute=0 rethrow=0 return_oop=1}
                                          ; - java.lang.Double::valueOf@0 (line 924)
                                          ;   {runtime_call _new_instance_Java}
    jmp    allocation_success
allocation_failure:
    mov    rsi, rax
    add    rsp, 0x10
    pop    rbp
    jmp    Strub::RethrowJava
safepoint0:
    movabs r10, qword safepoint0_check
    mov    qword ptr [r15+0x460], r10     ; JavaThread::_saved_exception_pc
    jmp    Stub::SafepointBlob
entry_barrier:
    call   Stub::nmethod_entry_barrier
    jmp    method_start
; [Exception Handler]
    jmp    Stub::ExceptionHandler
; [Deopt Handler Code]
    call   deopt
deopt:
    sub    qword ptr [rsp], 0x5
    jmp    Stub::DeoptimizationBlob
📋
Copied!
Stack Banging
Prologue
Entry Barrier
Method Body
Epilogue
Safepoint
Error Handling
TLAB Allocation
Fallback Allocation

TLAB Allocation

The TLAB, or Thread-Local Allocation-Buffer, is a thread-local contiguous buffer of free memory that the current thread can use for allocations by simply doing a pointer bump. I won't cover TLAB Allocations in great detail, as others have already covered them before. The only time we'll talk about them is when we could coalesce multiple allocations into a single TLAB allocation or avoid zero-ing parts of an allocation where we have a guaranteed write later. Note that there is no communication with the runtime (beyond modification of the TLAB pointers) during a TLAB allocation.

We can see 3 main parts of an allocation:

  • TLAB allocation attempt
  • If successful, object header initialisation (mark/class word) followed by zero-ing of any parts of the object the JIT can't definitely see it will initialise otherwise (this includes zero-ing array bodies in some cases)
  • If the TLAB allocation fails, a fall-back call to the runtime that will refill the TLAB and allocate our object.

GC related allocation stalls (eg gc triggered by memory pressure) will typically only occur after a failed TLAB allocation. For most allocations, especially small allocations, it can be expected that a TLAB allocation will succeed.

Next Steps

For the rest of this series we'll only be looking at the "user" portion of functions that have been fed through a clean-up script and hand-annotated to make them easier to follow. It's recommended to have reference material like Intel's manuals handy, or access to other sources that mirror that information if you're looking to follow through some of the AVX sections in future (unless you've got a far better memory than me). An online assembler/disassembler can also be useful and there are multiple sources that can be used to estimate the throughput and latency of non-memory referencing instructions on various architectures alongside tools like uiCA that can do a uArch simulation of specific architectures. Libraries like JOL (or the JetBrains plugin for the same) can be useful to assist in determine which fields in memory are being referenced by a given instruction.

In part 2 we'll be looking discussing how to estimate the performance of a given assembly snippet, with part 3 leaning back on benchmarking to confirm the theory shown in part 2. Part 4 of this series is where we'll finally start discussing some potential alternatives to C2 code.

Back to Series Overview

© 2024-2025 James Venning, All Rights Reserved

Any trademarks are properties of their respective owners. All content and any views or opinions expressed are my own and not associated with my employer. This site is not affiliated with Oracle®.