This series focuses on analysing assembly output by HotSpot's C2 compiler, also known as "opto", and looking for headroom where performance can be further tuned to bring Java's performance closer to native code. The first article in this series — this article — will cover how to pop the hood on C2 to dump out the assembly it's producing and highlight a few HotSpot-specific parts that may be unfamiliar. To get the most out of this article, ideally the reader will already be familiar with x86–64 assembly.
Dumping assembly from HotSpot requires you add a library called hsdis to your installation. Luckily, Chris Newland provides pre-built binaries for all major platforms. I'll be using x64 Linux for this article (and will also pretend that 32–bit JVMs do not exist) as that's the most common server architecture I deal with. The overall takeaways from this series are applicable to non-x64 platforms, however minutia will, of course, vary.
Once you've got hsdis downloaded and put into your Java bin folder you can start dumping assembly right away, but you're likely to end up with more output than you're after unless you target specific classes or functions. To demonstrate dumping assembly we'll use the following sample program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
package com.overlyenthusiastic.examples;
import java.util.Random;
public class DumpAssembly {
@SuppressWarnings("ResultOfMethodCallIgnored")
public static void main(String[] aArgs) {
final Random myRandom = new Random(1234);
final String[] myStrings = new String[10];
for (int i = 0; i < myStrings.length; i++) {
myStrings[i] = String.valueOf(myRandom.nextInt());
}
for (int i = 0; i < 50_000; i++) {
ExampleFunctions.stringLength(myStrings[myRandom.nextInt(myStrings.length)]);
}
}
}
1 2 3 4 5 6 7 8 9 10
package com.overlyenthusiastic.examples;
@SuppressWarnings("UnusedReturnValue")
public enum ExampleFunctions {
;
public static int stringLength(String aString) {
return aString.length();
}
}
The way in which we use random data is largely not relevant, rather just that we call ExampleFunctions::stringLength at least -XX:Tier4CompileThreshold times (for my JDK, that defaults to 15000) to ensure it gets compiled.
With this sample program, we'll dump the assembly by creating a compiler directives file (I called mine compiler_directives, but all that matters is we supply it on the command line (see below)) like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
[
{
match: "com.overlyenthusiastic.examples.DumpAssembly::*",
inline: [
"-*::*"
]
},
{
match: "com.overlyenthusiastic.examples.ExampleFunctions::*",
c2: {
PrintNMethods: true
}
}
]
This file prevents inlining into our caller, which inhibits optimisations from removing the return value and enables clean output to be captured, as well as only selectively printing the C2 compilations for our ExampleFunctions class. You're also able to print output from C1 if you're interested in seeing that output. Passing this file to the VM is fairly straightforward — though it's important to remember to select whichever GC is most relevant to you in production when you're dealing with more complex examples than this one.
1 2 3 4
-XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=./samples/src/main/resources/compiler_directives -XX:PrintAssemblyOptions=intel -XX:+UseG1GC
If you are interested in specific code from a production workload (for example, as highlighted by a profiler), it can be tempting to pull the code under test out into a benchmark or harness for study. Type profile pollution, where common utilities (Objects.equal, Objects.hashCode, anything receiving different lambdas, etc) or other classes receive many different classes of object, can lead to knock-on effects with the optimiser in real applications that will not always occur in a benchmark or toy program. For this reason, it is always preferable to capture from the system under test at least initially to ensure that whatever benchmark or harness you have closely matches production. Common benchmarking wisdom, to run each benchmark in a fresh and "unpolluted" JVM, can provide erroneous findings vs a more production-like setup for business critical code.
As an example, consider a function that may accept a List. If in production it is receiving a mix of list types (new ArrayList, List.of(), List.of(x), List.of(x, x, x) and Arrays.asList all result in distinct types) it may fall into megamorphic patterns with virtual dispatch and no-inlining for many of its invocations. A benchmark prepared for the function may pass a single type of List in for all invocations and therefore not display behaviour indicative of production.
Keep an eye out for assembly that has been generated as part of an on-stack replacement when collecting measurements, as such functions can be as confusing to the optimiser as it is to the reader. It is generally possible in real applications to disable this feature without loss of performance, though this is mostly because the feature is often not required once applications are fully warmed up (if it's causing you a problem collecting your measurements in production, it often means it is doing something for your application and may alter performance to turn off).
A reference may also be called an "OOP", or "Ordinary Object Pointer", with a compressed reference being called a "cOOP" or "Compressed Ordinary Object Pointer".
When the heap size is less than 32 GB and it has not been explicitly disabled (with -XX:-UseCompressedOops), Java will store compressed references in the heap instead of the full value. This means that a field holding an object reference will take up 4 bytes instead of 8, resulting in lower memory usage for an application.
A compressed reference is just a pointer to a Java object with the lower 3 bits shifted away (as Java objects are always 8–byte aligned, these bits are always zero). The minimum object alignment (and therefore the shift value and maximum heap size for compressed references) is configurable on the command line (using -XX:ObjectAlignmentInBytes=8), but is typically not changed. This series will always assume that compressed oops are in-use (as this is the more "complex" scenario) and that the default object alignment is being used. You can read more about compressed references here.
Every object in the Java heap begins with an 8–byte "mark word", followed by a reference to a structure describing the object type (the "klass word", (sometimes "klass", or "class")) which can be either 4 or 8 bytes, depending on if -XX:+UseCompressedClassPointers is enabled (defaults to true if we are using compressed OOPs and cannot be enabled if compressed OOPs are disabled).
For the purposes of this series, the layout of an object header is unimportant (though we will assume 12 byte headers). Note that the size of the header shrinks and the layout changes with Liliput, but in general this different will not be relevant for our analyses.
Given our stringLength function from before:
1 2 3
public static int stringLength(String aString) {
return aString.length();
}
And a class layout diagram for java.lang.String (note that in HotSpot parlance, class is often spelled klass or klazz)
The raw assembly for this looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
----------------------------------- Assembly ----------------------------------- Compiled method (c2) 289 218 4 com.overlyenthusiastic.examples.ExampleFunctions::stringLength (5 bytes) total in heap [0x00007f36a00d8d88,0x00007f36a00d8f40] = 440 relocation [0x00007f36a00d8e60,0x00007f36a00d8e78] = 24 main code [0x00007f36a00d8e80,0x00007f36a00d8f08] = 136 stub code [0x00007f36a00d8f08,0x00007f36a00d8f20] = 24 oops [0x00007f36a00d8f20,0x00007f36a00d8f28] = 8 metadata [0x00007f36a00d8f28,0x00007f36a00d8f40] = 24 immutable data [0x00007f3610079000,0x00007f3610079080] = 128 dependencies [0x00007f3610079000,0x00007f3610079008] = 8 nul chk table [0x00007f3610079008,0x00007f3610079020] = 24 scopes pcs [0x00007f3610079020,0x00007f3610079060] = 64 scopes data [0x00007f3610079060,0x00007f3610079080] = 32 [Disassembly] -------------------------------------------------------------------------------- [Constant Pool (empty)] -------------------------------------------------------------------------------- [Verified Entry Point] # {method} {0x00007f3630407f68} 'stringLength' '(Ljava/lang/String;)I' in 'com/overlyenthusiastic/examples/ExampleFunctions' # parm0: rsi:rsi = 'java/lang/String' # [sp+0x20] (sp of caller) 0x00007f36a00d8e80: mov DWORD PTR [rsp-0x14000],eax 0x00007f36a00d8e87: push rbp 0x00007f36a00d8e88: sub rsp,0x10 0x00007f36a00d8e8c: cmp DWORD PTR [r15+0x20],0x1 0x00007f36a00d8e94: jne 0x00007f36a00d8efe 0x00007f36a00d8e9a: mov r11d,DWORD PTR [rsi+0x14] ; implicit exception: dispatches to 0x00007f36a00d8ec0 0x00007f36a00d8e9e: mov r10d,DWORD PTR [r12+r11*8+0xc]; implicit exception: dispatches to 0x00007f36a00d8ed4 0x00007f36a00d8ea3: movsx r8d,BYTE PTR [rsi+0x10] 0x00007f36a00d8ea8: sarx eax,r10d,r8d 0x00007f36a00d8ead: add rsp,0x10 0x00007f36a00d8eb1: pop rbp 0x00007f36a00d8eb2: cmp rsp,QWORD PTR [r15+0x448] ; {poll_return} 0x00007f36a00d8eb9: ja 0x00007f36a00d8ee8 0x00007f36a00d8ebf: ret 0x00007f36a00d8ec0: mov esi,0xfffffff6 0x00007f36a00d8ec5: xchg ax,ax 0x00007f36a00d8ec7: call 0x00007f369fb87b60 ; ImmutableOopMap {} ;*invokevirtual length {reexecute=0 rethrow=0 return_oop=0} ; - com.overlyenthusiastic.examples.ExampleFunctions::stringLength@1 (line 8) ; {runtime_call UncommonTrapBlob} 0x00007f36a00d8ecc: nop DWORD PTR [rax+rax*1+0x144] ; {other} 0x00007f36a00d8ed4: mov esi,0xfffffff6 0x00007f36a00d8ed9: xchg ax,ax 0x00007f36a00d8edb: call 0x00007f369fb87b60 ; ImmutableOopMap {} ;*arraylength {reexecute=0 rethrow=0 return_oop=0} ; - java.lang.String::length@4 (line 1593) ; - com.overlyenthusiastic.examples.ExampleFunctions::stringLength@1 (line 8) ; {runtime_call UncommonTrapBlob} 0x00007f36a00d8ee0: nop DWORD PTR [rax+rax*1+0x1000158]; {other} 0x00007f36a00d8ee8: movabs r10,0x7f36a00d8eb2 ; {internal_word} 0x00007f36a00d8ef2: mov QWORD PTR [r15+0x460],r10 0x00007f36a00d8ef9: jmp 0x00007f369fb88be0 ; {runtime_call SafepointBlob} 0x00007f36a00d8efe: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)} 0x00007f36a00d8f03: jmp 0x00007f36a00d8e9a [Exception Handler] 0x00007f36a00d8f08: jmp 0x00007f369fc4e8e0 ; {no_reloc} [Deopt Handler Code] 0x00007f36a00d8f0d: call 0x00007f36a00d8f12 0x00007f36a00d8f12: sub QWORD PTR [rsp],0x5 0x00007f36a00d8f17: jmp 0x00007f369fb87e80 ; {runtime_call DeoptimizationBlob} 0x00007f36a00d8f1c: hlt 0x00007f36a00d8f1d: hlt 0x00007f36a00d8f1e: hlt 0x00007f36a00d8f1f: hlt -------------------------------------------------------------------------------- [/Disassembly]
Future articles in this series will look at cleaned & more heavily annotated assembly, but here we will walk through the raw output and discuss the method prologue/epilogue, identify the safepoint code, and discuss thread-local and heap access.
Execution for a compiled HotSpot method starts at the [Verified Entry Point]. The first instruction of this method, mov DWORD PTR [rsp-0x14000], eax, writes a dud value a fair way down the stack in order to verify that at least that much space is still available to use. If this access turns out to be invalid, the runtime will catch the resulting signal and issue a StackOverflowError for the current thread. You can read more about this here. This instruction is unrelated to our user code and generally speaking is unimportant from a performance point-of-view as no other instructions depend on it.
In the below, we see the stack being prepared as well as an "nmethod entry barrier" (nmethod is just HotSpot speak for a "native" Java method — one that's been compiled). This barrier is used when the runtime needs to patch embedded object references (direct references to heap objects in the machine code — these may be compressed or "raw") during a safepoint. Once execution resumes, threads encountering the barrier that need to patch embedded object pointers can flush their instruction caches to ensure they don't see a stale object reference.
1 2 3 4 5 6 7 8 9 10 11 12
; Create space for our stack 0x00007f36a00d8e87: push rbp 0x00007f36a00d8e88: sub rsp,0x10 ; Read the entry barrier flag 0x00007f36a00d8e8c: cmp DWORD PTR [r15+0x20],0x1 0x00007f36a00d8e94: jne 0x00007f36a00d8efe 0x00007f36a00d8e9a: ; ... actual function start ... ; ... ... ; ... ... ; ... ... 0x00007f36a00d8ef9: jmp 0x00007f369fb88be0 ; {runtime_call SafepointBlob} 0x00007f36a00d8efe: call Stub::nmethod_entry_barrie
The r15 register always holds a pointer to the current thread's information, the structure of which is found here and here, among other places. In the JDK version used above, the offset 0x20 holds the field Thread::_nmethod_disarmed_guard_value (in future, more annotated examples, this information will be supplied in-line).
The end of a Java function starts with a fairly standard looking stack adjustment, but continues with another read through the r15 register labelled poll_return. This read is checking for whether this thread has been asked to safepoint, and if so this then jumps off to a little bit of code that actually responds to the request if one is pending. As we head off into the safepoint code, we save the return address into [r15+0x460] (this is the field JavaThread::_saved_exception_pc) so the runtime knows where it came from and where it needs to go back to.
1 2 3 4 5 6 7 8 9
0x00007f36a00d8ead: add rsp,0x10 0x00007f36a00d8eb1: pop rbp 0x00007f36a00d8eb2: cmp rsp,QWORD PTR [r15+0x448] ; {poll_return} 0x00007f36a00d8eb9: ja 0x00007f36a00d8ee8 0x00007f36a00d8ebf: ret ... 0x00007f36a00d8ee8: movabs r10,0x7f36a00d8eb2 ; {internal_word} 0x00007f36a00d8ef2: mov QWORD PTR [r15+0x460],r10 0x00007f36a00d8ef9: jmp 0x00007f369fb88be0 ; {runtime_call SafepointBlob}
Moving onto the user portion of the method, you may have noticed an "implicit exception" on the first line of this section. This instruction is loading the value field on String, and doing so without an explicit null check. As part of compilation, the JIT will create a look-up table for the runtime that basically says "if you crash with a null-pointer here, don't actually crash, just transfer execution over here". This is an overhead free (when you're not throwing) way to implement null-checks and moves the landing pad / handler code far away from the hotter code, which is more cache friendly. The handler will load up a reason code into the esi register before jumping into the next part we'll talk about, the uncommon trap blob.
1 2 3 4 5 6 7 8 9
0x00007f36a00d8e9a: mov r11d,DWORD PTR [rsi+0x14] ; implicit exception: dispatches to 0x00007f36a00d8ec0 ; ... ... ... ; ... ... ... 0x00007f36a00d8ec0: mov esi,0xfffffff6 0x00007f36a00d8ec5: xchg ax,ax 0x00007f36a00d8ec7: call 0x00007f369fb87b60 ; ImmutableOopMap {} ;*invokevirtual length {reexecute=0 rethrow=0 return_oop=0} ; - com.overlyenthusiastic.examples.ExampleFunctions::stringLength@1 (line 8) ; {runtime_call UncommonTrapBlob}
The null-pointer check above doesn't actually end with the JIT'd code throwing a NullPointerException, instead it ends with a call into UncommonTrapBlob. If the exception is thrown during the method's warm-up, it will instead be an explicit null-check followed by an explicit runtime call to throw a NullPointerException. Calls to UncommonTrapBlob are fairly common and indicate that the JIT considers this path unlikely to execute, and so has not actually generated code to handle it. Instead, it simply fills in a small reason code and transfers control to the runtime to handle the uncommon operation, which may result in de-optimisation (the JIT'd code may be discarded) of the current method for re-profiling.
1 2
0x00007f36a00d8e9a: mov r11d,DWORD PTR [rsi+0x14] ; implicit exception: dispatches to 0x00007f36a00d8ec0 0x00007f36a00d8e9e: mov r10d,DWORD PTR [r12+r11*8+0xc]; implicit exception: dispatches to 0x00007f36a00d8ed4
Refocusing on the user code once more, these 2 memory access demonstrate the 2 different ways memory is accessed in Java. The first has rsi holding a raw pointer to an object, which can be directly accessed to load the 4 byte compressed reference from the string. The second shows how we read from the compressed reference. The r12 register holds the address of the heap base (most commonly 0), while r11*8 decompresses the reference into an offset into the heap, with 0xC being the offset of the length field in a byte[]. These two lines together therefore represent the inlined operation aValue.value.length. The JVM will prefer to place the heap base at 0, but will move it if it is unable to be placed there due to platform restrictions or if the requested to place it elsewhere on the command line.
Though the heap-base cannot move at runtime, and the value will commonly be 0, this is only partially relied on by the JIT. The register will still be included in address calculations (even when not required) and unavailable for allocation, however will sometimes be used by the JIT as a "zero" register for eg zero-ing newly allocated memory.
For our second example, let's check out what it's like to assign to a field. We'll only be briefly touching over this so that the reader is familiar with what may be seen in larger examples later in the series.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
public static void setPrimitiveFields(Foo aFoo) {
aFoo.theField = 123;
aFoo.theDoubleField = 43.23;
}
public static void setStringField(Foo aFoo) {
aFoo.theStringField = "foosball";
}
public static void setBoxedPrimitiveField(Foo aFoo) {
aFoo.theBoxedInteger = 3;
}
public static class Foo {
public int theField;
public double theDoubleField;
public String theStringField;
public Integer theBoxedInteger;
}
Starting with setPrimitiveFields, we get the following code (displayed here cleaned & annotated — as all future examples in the series will be):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
constant_pool: 0x3d_0a_d7_a3_70_9d_45_40 ; 43.23 as a `double` ; [Verified Entry Point] ; {method} 'setPrimitiveFields' '(Lcom/overlyenthusiastic/examples/ExampleFunctions$Foo;)V' ; parm0: rsi:rsi = 'com/overlyenthusiastic/examples/ExampleFunctions$Foo' ; [sp+0x20] (sp of caller) mov dword ptr [rsp-0x14000], eax push rbp sub rsp, 0x10 cmp dword ptr [r15+0x20], 0x1 ; Thread::_nmethod_disarmed_guard_value jne entry_barrier method_start: mov dword ptr [rsi+0xc], 0x7b ; `aValue.theField` = 123 ; implicit exception dispatches to @npe0 vmovsd xmm0, qword ptr [rip+0xb7] ; xmm0 = 43.23 (via rip-relative reference to @constant_pool) vmovsd qword ptr [rsi+0x10], xmm0 ; `aValue.theDoubleField` = xmm0 add rsp, 0x10 pop rbp safepoint0_check: cmp rsp, qword ptr [r15+0x448] ; JavaThread::_poll_data::_polling_word ja safepoint0 ret npe0: mov esi, 0xfffffff6 call Stub::UncommonTrapBlob safepoint0: movabs r10, safepoint0_check mov qword ptr [r15+0x460], r10 ; JavaThread::_saved_exception_pc jmp Stub::SafepointBlob entry_barrier: call Stub::nmethod_entry_barrier jmp method_start ; [Exception Handler] jmp Stub::ExceptionHandler ; [Deopt Handler Code] call deopt deopt: sub qword ptr [rsp], 0x5 jmp Stub::DeoptimizationBlob
As before, we have a stack-check, a prologue, an entry barrier, an implicit null check, an epilogue, and a safepoint. The "active" portion of the function is simply:
1 2 3
mov dword ptr [rsi+0xc], 0x7b ; implicit exception: dispatches to npe0 vmovsd xmm0, qword ptr [rip+0xb7] ; rip-relative reference to constant_pool vmovsd qword ptr [rsi+0x10], xmm0
Which contains our first reference to a constant pool created alongside the code in which our double value is held. There isn't anything terribly interesting going on here, so let's move right along to what happens when we try to assign our String field:
1 2 3
public static void setStringField(Foo aFoo) {
aFoo.theStringField = "foosball";
}
This produces the following assembly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
; [Verified Entry Point] ; {method} 'setStringField' '(Lcom/overlyenthusiastic/examples/ExampleFunctions$Foo;)V' ; parm0: rsi:rsi = 'com/overlyenthusiastic/examples/ExampleFunctions$Foo' ; [sp+0x20] (sp of caller) mov dword ptr [rsp-0x14000], eax push rbp sub rsp, 0x10 cmp dword ptr [r15+0x20], 0x1 ; Thread::_nmethod_disarmed_guard_value jne entry_barrier method_start: mov rbx, rsi test rsi, rsi je npe0 ; if (aFoo == null) goto npe0 cmp byte ptr [r15+0x38], 0x0 ; Thread::_gc_data::_satb_mark_queue::_active jne satb_marking ; if (satb_marking_active) goto satb_marking satb_marking_finished: mov dword ptr [rbx+0x18], 0xe1337efb ; aFoo.theStringField = "foosball" {coop("foosball"{0x00000007099bf7d8})} mov r10, rbx ; r10 = aFoo movabs r11, 0x7099bf7d8 ; {oop("foosball"{0x00000007099bf7d8})} xor r11, r10 shr r11, 0x15 test r11, r11 ; Is aFoo in the same GC 2MB region as constant? je epilogue shr r10, 0x9 ; r10 = aFoo / 512 movabs rdi, 0x7fefaa944000 add rdi, r10 cmp byte ptr [rdi], 0x2 jne card_marking ; if cardTable[aFoo / 512] != 2, goto label4 epilogue: add rsp, 0x10 pop rbp safepoint0_check: cmp rsp, qword ptr [r15+0x448] ; JavaThread::_poll_data::_polling_word ja safepoint0 ret satb_marking: mov r11d, dword ptr [rsi+0x18] ; r11 = aFoo.theStringField test r11d, r11d je satb_marking_finished ; If aFoo.theStringField == null, do nothing mov r10, qword ptr [r15+0x28] ; Thread::_gc_data::_satb_mark_queue::_index mov rdi, r11 ; rdi = aFoo.theStringField (coop) shl rdi, 0x3 ; rdi = aFoo.theStringField (oop) test r10, r10 je satb_marking_refill mov r11, qword ptr [r15+0x30] ; Thread::_gc_data::_satb_mark_queue::_buf mov qword ptr [r11+r10*1-0x8], rdi ; add old value to mark queue add r10, 0xfffffffffffffff8 ; r10 -= 8; mov qword ptr [r15+0x28], r10 ; Thread::_gc_data::_satb_mark_queue::_index jmp satb_marking_finished card_marking: mov r10, qword ptr [r15+0x40] ; Thread::_gc_data::_dirty_card_queue::_index mov r11, qword ptr [r15+0x48] ; Thread::_gc_data::_dirty_card_queue::_buf lock add dword ptr [rsp-0x40], 0x0 cmp byte ptr [rdi], 0x0 je epilogue mov byte ptr [rdi], r12b test r10, r10 jne card_marking_no_refill mov rsi, r15 movabs r10, qword G1BarrierSetRuntime::write_ref_field_post_entry call r10 jmp epilogue card_marking_no_refill: mov qword ptr [r11+r10*1-0x8], rdi add r10, 0xfffffffffffffff8 mov qword ptr [r15+0x40], r10 ; Thread::_gc_data::_dirty_card_queue::_index jmp epilogue npe0: mov esi, 0xfffffff6 call Stub::UncommonTrapBlob satb_marking_refill: mov rsi, r15 movabs r10, qword G1BarrierSetRuntime::write_ref_field_pre_entry call r10 jmp satb_marking_finished safepoint0: movabs r10, qword safepoint0_check mov qword ptr [r15+0x460], r10 ; JavaThread::_saved_exception_pc jmp Stub::SafepointBlob entry_barrier: call Stub::nmethod_entry_barrier jmp method_start ; [Exception Handler] jmp Stub::ExceptionHandler ; [Deopt Handler Code] call deopt deopt: sub qword ptr [rsp], 0x5 jmp Stub::DeoptimizationBlob
There's a surprisingly large amount of code here and a lot of branches. Fortunately, these will often be fairly predictable and much of the code won't actually run except during specific GC phases. What we're seeing here is a write barrier consisting of pre-write SATB marking and a post-write card table update (non-G1GC collectors may have less (or no) work present during field updates (CMS/Epsilon), or may have even slower field updates (like ZGC) due to more requiring more complex barriers). We'll briefly touch on these, but as we are focusing on the JIT rather than the GC we'll largely assume these are overheads and ignore them.
Before a stored reference can be overwritten, depending on the phase of the collector, we may need to save the reference to a SATB queue. SATB is short for "snapshot-at-the-beginning", and is used to ensure that application writes that occur concurrently with GC marking don't result in objects erroneously being marked as dead. The GC is trying to create an atomic "snapshot" of what was live at the start of the marking process and must therefore save overwritten references to a look-aside buffer so they are not missed.
You can check out the source that shows this barrier being emitted on GitHub, but we won't really be diving deeply into this assembly here. This series focuses on the C2 JIT rather than the GC, so we won't dive into them too deeply. Primarily this series will assume they are just a price we must pay for the conveniences Java provides except where the JIT has an opportunity to omit such barrier code — then we'll bring up the barriers again. When measuring performance impact we will assume the barrier is always the cheapest kind it may be.
In addition to SATB marking, "dirty cards" are used after a field has been written to indicate which potentially old regions may reference young regions. As before, the source can be found on GitHub and this series will primarily ignore the contents of these barriers and only comment when the barrier could have been omitted but wasn't. Again, when measuring performance we will assume the fast-path is always taken.
To finish up, let's briefly look at object allocation in HotSpot. We'll do the simplest I can think of, auto-boxing a double (auto-boxing an int will result in the integer cache being used, same for long).
1 2 3
public static Double createDouble(double aValue) {
return aValue;
}
The assembly for this looks as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
; [Verified Entry Point] ; {method} 'createDouble' '(D)Ljava/lang/Double;' in 'com/overlyenthusiastic/examples/ExampleFunctions' ; parm0: xmm0:xmm0 = double ; [sp+0x20] (sp of caller) mov dword ptr [rsp-0x14000], eax push rbp sub rsp, 0x10 cmp dword ptr [r15+0x20], 0x1 ; Thread::_nmethod_disarmed_guard_value jne entry_barrier method_start: vmovq rbp, xmm0 ; stash `aValue` in callee saved register mov rax, qword ptr [r15+0x1b8] ; Thread::_tlab::_top mov r10, rax add r10, 0x18 cmp r10, qword ptr [r15+0x1c8] ; Thread::_tlab::_end jae tlab_allocation_failure ; If this happens, `aValue` will be preserved in rbp mov qword ptr [r15+0x1b8], r10 ; Thread::_tlab::_top prefetchw byte ptr [r10+0x100] mov qword ptr [rax], 0x1 ; mark word mov dword ptr [rax+0x8], 0x18cd48 ; compressed class ptr {metadata('java/lang/Double')} mov dword ptr [rax+0xc], r12d ; zero out the 4 bytes of padding allocation_success: vmovq xmm0, rbp vmovsd qword ptr [rax+0x10], xmm0 ; Set the boxed value add rsp, 0x10 pop rbp safepoint0_check: cmp rsp, qword ptr [r15+0x448] ; JavaThread::_poll_data::_polling_word ja safepoint0 ret tlab_allocation_failure: movabs rsi, 0x7fd6f618cd48 ; {metadata('java/lang/Double')} call Runtime::newInstance ; ImmutableOopMap {} ;*new {reexecute=0 rethrow=0 return_oop=1} ; - java.lang.Double::valueOf@0 (line 924) ; {runtime_call _new_instance_Java} jmp allocation_success allocation_failure: mov rsi, rax add rsp, 0x10 pop rbp jmp Strub::RethrowJava safepoint0: movabs r10, qword safepoint0_check mov qword ptr [r15+0x460], r10 ; JavaThread::_saved_exception_pc jmp Stub::SafepointBlob entry_barrier: call Stub::nmethod_entry_barrier jmp method_start ; [Exception Handler] jmp Stub::ExceptionHandler ; [Deopt Handler Code] call deopt deopt: sub qword ptr [rsp], 0x5 jmp Stub::DeoptimizationBlob
The TLAB, or Thread-Local Allocation-Buffer, is a thread-local contiguous buffer of free memory that the current thread can use for allocations by simply doing a pointer bump. I won't cover TLAB Allocations in great detail, as others have already covered them before. The only time we'll talk about them is when we could coalesce multiple allocations into a single TLAB allocation or avoid zero-ing parts of an allocation where we have a guaranteed write later. Note that there is no communication with the runtime (beyond modification of the TLAB pointers) during a TLAB allocation.
We can see 3 main parts of an allocation:
GC related allocation stalls (eg gc triggered by memory pressure) will typically only occur after a failed TLAB allocation. For most allocations, especially small allocations, it can be expected that a TLAB allocation will succeed.
For the rest of this series we'll only be looking at the "user" portion of functions that have been fed through a clean-up script and hand-annotated to make them easier to follow. It's recommended to have reference material like Intel's manuals handy, or access to other sources that mirror that information if you're looking to follow through some of the AVX sections in future (unless you've got a far better memory than me). An online assembler/disassembler can also be useful and there are multiple sources that can be used to estimate the throughput and latency of non-memory referencing instructions on various architectures alongside tools like uiCA that can do a uArch simulation of specific architectures. Libraries like JOL (or the JetBrains plugin for the same) can be useful to assist in determine which fields in memory are being referenced by a given instruction.
In part 2 we'll be looking discussing how to estimate the performance of a given assembly snippet, with part 3 leaning back on benchmarking to confirm the theory shown in part 2. Part 4 of this series is where we'll finally start discussing some potential alternatives to C2 code.
© 2024-2025 James Venning, All Rights Reserved
Any trademarks are properties of their respective owners. All content and any views or opinions expressed are my own and not associated with my employer. This site is not affiliated with Oracle®.