Home About

Overly Enthusiastic

I like learning how stuff works.

  1. An Exploration of Java's JIT
  2. โ€บ
  3. Benchmarking Alternative Assemblies
Background

Benchmarking Alternative Assemblies

In the previous instalment in this series we discussed why benchmarks can be limiting, especially when discussing potential changes to the JIT itself. In this article, we will back-track a bit and actually discuss how custom assembly snippets can be benchmarked in Java by looking at the JVMCI. This article will be heavy with code snippets for anyone wanting to try and follow along with future articles at home.

Delivering custom code for a function

Let's say we have the below program and we run it.

1
2
3
4
5
6
7
8
9
10
11
12
13
package com.overlyenthusiastic.examples;

public class Substitute {
    public static void main(String[] aArgs) throws InterruptedException {
        System.out.println(getInteger());
        Thread.sleep(1_000);
        System.out.println(getInteger());
    }

    private static int getInteger() {
        return 7;
    }
}
๐Ÿ“‹
Copied!

What would you expect this to output? Naively, you might assume the answer is that it would print two lines, both containing 7. Such assumptions can sometimes be safe when dealing with sane co-workers, but, to me at least, this is far too pedestrian and I would like this to print 7 on the first line and 8 on the second.

In order to achieve this, we're going to use the JVMCI (the same interface used by Graal) to intercept the compilation request and deliver a custom machine code blob. We'll first need to use a compiler directives file and some command line arguments to force compilation of this function and dump the assembly for us to modify:

1
2
3
4
5
6
-XX:+UnlockDiagnosticVMOptions
-XX:-TieredCompilation
-XX:CompilerDirectivesFile=./samples/src/main/resources/substitute_compiler_directives
-XX:PrintAssemblyOptions=intel
-XX:+UseG1GC
-Xcomp
๐Ÿ“‹
Copied!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[
  {
    match: "com.overlyenthusiastic.examples.Substitute::main",
    inline: [
      "-*::*"
    ]
  },
  {
    match: "com.overlyenthusiastic.examples.Substitute::getInteger",
    Exclude: false,
    c2: {
      PrintNMethods: true
    }
  },
  {
    match: "*::*",
    Exclude: true
  }
]
๐Ÿ“‹
Copied!

Using this, I get this output for getInteger:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    sub    rsp, 0x18
    mov    qword ptr [rsp+0x10], rbp
    cmp    dword ptr [r15+0x20], dword 0   ; Thread::_nmethod_disarmed_guard_value
    jne    entry_barrier
method_start:
    mov    eax, 0x7
    add    rsp, 0x10
    pop    rbp
safepoint0_check:
    cmp    rsp, qword ptr [r15+0x448]     ; JavaThread::_poll_data::_polling_word
    ja     safepoint0
    ret
safepoint0:
    movabs r10, qword safepoint0_check
    mov    qword ptr [r15+0x460], r10     ; JavaThread::_saved_exception_pc
    jmp    Runtime::Safepoint
entry_barrier:
    call   Runtime::EntryBarrier
    jmp    method_start
; [Exception Handler]
    jmp    Runtime::ExceptionHandler
; [Deopt Handler Code]
    call   deopt
deopt:
    sub    qword ptr [rsp], 0x5
    jmp    Runtime::Deoptimise
๐Ÿ“‹
Copied!

That's far more ceremony than I'd like. Let's change the 7 to an 8 trim it down to:

1
2
3
    cmp    dword ptr [r15+0x20], dword 0
    mov    eax, 0x8
    ret
๐Ÿ“‹
Copied!

The cmp is to trick the JVM into thinking we've got an entry barrier, which is required for methods emitted by the JVMCI. This assembly assembles to 41817F2000000000B808000000C3, which we can feed into the VM using a dummy compiler like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
package com.overlyenthusiastic;

import jdk.vm.ci.code.*;
import jdk.vm.ci.code.site.*;
import jdk.vm.ci.hotspot.*;
import jdk.vm.ci.meta.*;
import jdk.vm.ci.runtime.*;

import java.util.HexFormat;

public class DummyCompiler implements JVMCICompiler {
    @Override
    public CompilationRequestResult compileMethod(CompilationRequest aRequest) {
        if (aRequest.getEntryBCI() != -1) {
            return HotSpotCompilationRequestResult.failure("OSR is not supported.", false);
        }

        final byte[] myMachineCode = HexFormat.of().parseHex("41817F2000000000B808000000C3");
        final int myEntryBarrierPatchId = (int) (long) ((HotSpotVMConfigAccess) ((HotSpotJVMCIRuntime) JVMCI.getRuntime()).getConfig())
                .getConstant("CodeInstaller::ENTRY_BARRIER_PATCH", Long.class);

        final HotSpotCompilationRequest myRequest = (HotSpotCompilationRequest) aRequest;
        JVMCI.getRuntime().getHostJVMCIBackend().getCodeCache().installCode(aRequest.getMethod(),
                new HotSpotCompiledNmethod("dummy", myMachineCode, myMachineCode.length,
                        new Site[] { new Mark(0, myEntryBarrierPatchId) },
                        new Assumptions.Assumption[0], new ResolvedJavaMethod[0], new HotSpotCompiledCode.Comment[0],
                        new byte[0], 8,
                        new DataPatch[0], false, 16,
                        StackSlot.get(ValueKind.Illegal, 0, false),
                        (HotSpotResolvedJavaMethod) aRequest.getMethod(), aRequest.getEntryBCI(),
                        myRequest.getId(), myRequest.getJvmciEnv(), false),
                null,
                new HotSpotSpeculationLog(),
                true);

        return HotSpotCompilationRequestResult.success(0);
    }

    @Override
    public boolean isGCSupported(int aGcIdentifier) {
        return true; // Not dealing with anything needing GC barriers.
    }
}
๐Ÿ“‹
Copied!

We'll also need a JVMCIServiceLocator implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package com.overlyenthusiastic;

import jdk.vm.ci.runtime.*;
import jdk.vm.ci.services.*;

import javax.annotation.Nullable;

public class DummyServiceLocator extends JVMCIServiceLocator {
    @Nullable
    @Override
    public <S> S getProvider(Class<S> aService) {
        if (aService == JVMCICompilerFactory.class) {
            //noinspection unchecked
            return (S) new JVMCICompilerFactory() {
                @Override
                public String getCompilerName() {
                    return "Dummy";
                }

                @Override
                public JVMCICompiler createCompiler(JVMCIRuntime aRuntime) {
                    return new DummyCompiler();
                }
            };
        }

        return null;
    }
}
๐Ÿ“‹
Copied!

And a META-INF/services/jdk.vm.ci.services.JVMCIServiceLocator file in our resources:

1
com.overlyenthusiastic.DummyServiceLocator
๐Ÿ“‹
Copied!

And finally some command line arguments to go along with this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-XX:+UnlockDiagnosticVMOptions
-XX:-TieredCompilation
-XX:CompileThreshold=1
-XX:CompilerDirectivesFile=./samples/src/main/resources/substitute_compiler_directives
-XX:+UseG1GC
-XX:+UnlockExperimentalVMOptions
-XX:+EnableJVMCI
-XX:+UseJVMCICompiler
-Djvmci.Compiler=Dummy
-Xbootclasspath/a:./samples/target/classes
--add-modules jdk.internal.vm.ci
--add-exports jdk.internal.vm.ci/jdk.vm.ci.code=ALL-UNNAMED
--add-exports jdk.internal.vm.ci/jdk.vm.ci.hotspot=ALL-UNNAMED
--add-exports jdk.internal.vm.ci/jdk.vm.ci.meta=ALL-UNNAMED
--add-exports jdk.internal.vm.ci/jdk.vm.ci.runtime=ALL-UNNAMED
--add-exports jdk.internal.vm.ci/jdk.vm.ci.services=ALL-UNNAMED
๐Ÿ“‹
Copied!

Phew. Ok, that was a lot. Let's run our application now.

1
2
7
8
๐Ÿ“‹
Copied!

Ok! So here what's happening is the first call to getInteger will go through the interpreter while -XX:CompileThreshold=1 causes compilation to begin in the background immediately. This completes while the main thread sleeps and by the time we call it a second time, our new and improved code will give the caller a bigger (and therefore better) int back.

On-demand code

For isolated examples like we have, we actually don't want a full-blown JVMCI compiler to be used. Instead, we can actually compile single methods on-demand by adding this to our toy compiler:

1
2
3
4
public CompilationRequestResult compileMethod(Executable aMethod) {
    final ResolvedJavaMethod myMethod = HotSpotJVMCIRuntime.runtime().getHostJVMCIBackend().getMetaAccess().lookupJavaMethod(aMethod);
    return compileMethod(new HotSpotCompilationRequest((HotSpotResolvedJavaMethod) myMethod, -1, 0, 0));
}
๐Ÿ“‹
Copied!

This can be manually invoked using:

1
2
final Method myMethod = SomeClass.class.getMethod("someMethod");
new DummyCompiler().compileMethod(myMethod); // TODO: Check the result object for an error
๐Ÿ“‹
Copied!

This allows us to omit -XX:+UseJVMCICompiler -Djvmci.Compiler=Dummy and -Xbootclasspath/a:... from our command line (but we still require -XX:+EnableJVMCI) and is suitable for use in larger examples (for example, replacing a single function in a testing or staging environment, if ever desired). Installing such a call at the top of our main method results in the following output:

1
2
8
8
๐Ÿ“‹
Copied!

Benchmarking

Using JVMCI to deliver non-trivial code chunks by pasting in raw machine code, including snippets containing things like embedded object literals, raw data constants, implicit & explicit exceptions, working safepoints, etc, is possible though frustrating. Instead, a custom assembler (the source of which is too large to fit in the margin of this blog) will be used to parse snippets (as shown below) that have additional mark-up for communicating metadata to the runtime (such as the location of the nmethod_entry_barrier from above, calls that need to be relocated/fixed-up, exception information, etc). Future examples in this series will show the assembly to be benchmarked, however this is the last (and I suppose first) time I'll provide accompanying machine code and site/mark information (due to verbosity, the likelihood of JVM changes making such information useless, the effort required to produce it, and the low chance of a reader actually using such information).

Here is the getInteger example with annotated site/mark information:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    sub    rsp, dword 0x18                 ;!mark verified_entry
    mov    qword ptr [rsp+0x10], rbp
    cmp    dword ptr [r15+0x20], dword 0   ;!mark nmethod_entry_barrier
    jne    entry_barrier
method_start:
    mov    eax, 0x8
    add    rsp, 0x10
    pop    rbp
safepoint0_check:
    cmp    rsp, qword ptr [r15+0x448]     ;!site safepoint ; JavaThread::_poll_data::_polling_word
    ja     safepoint0
    ret
safepoint0:
    lea    r10, [rip + rel safepoint0_check]
    mov    qword ptr [r15+0x460], r10     ; JavaThread::_saved_exception_pc
    jmp    Runtime::Safepoint             ;!site internal_call
entry_barrier:
    call   Runtime::EntryBarrier          ;!site internal_call
    jmp    method_start
    jmp    Runtime::ExceptionHandler      ;!mark exception_handler_entry !site call
    call   deopt                          ;!mark deopt_handler_entry
deopt:
    sub    qword ptr [rsp], 0x5
    jmp    Runtime::Deoptimise            ;!site internal_call
๐Ÿ“‹
Copied!

This assembles to: 4881EC1800000048896C241041817F20000000007527B8080000004883C4105D493BA7480400007701C34C8D15000000004D899760040000E900000000E800000000EBD2E900000000E80000000048832C2405E900000000.

The sites are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// ...

final long myDeoptBlobAddress = getAddress("CompilerToVM::Data::SharedRuntime_deopt_blob_unpack");
final long myEntryBarrierAddress = getAddress("CompilerToVM::Data::nmethod_entry_barrier");
final long mySafepointBlobAddress = getAddress("CompilerToVM::Data::SharedRuntime_polling_page_return_handler");
final long myExceptionHandlerAddress = getAddress("CompilerToVM::Data::SharedRuntime_polling_page_return_handler");
final int myVerifiedEntryId = (int) getConstant("CodeInstaller::VERIFIED_ENTRY");
final int myEntryBarrierPatchId = (int) getConstant("CodeInstaller::ENTRY_BARRIER_PATCH");
final int myExceptionHandlerEntryId = (int) getConstant("CodeInstaller::EXCEPTION_HANDLER_ENTRY");
final int myDeoptHandlerEntryId = (int) getConstant("CodeInstaller::DEOPT_HANDLER_ENTRY");

final DebugInfo myDebugInfo = new DebugInfo(new BytecodeFrame(null, aRequest.getMethod(), 0,
        true, false, new JavaValue[0], new JavaKind[0], 0, 0, 0),
        new VirtualObject[0]);
myDebugInfo.setReferenceMap(new HotSpotReferenceMap(new Location[0], new Location[0], new int[0], 0));
final Site[] mySites = new Site[] {
        new Mark(0x00, myVerifiedEntryId),
        new Mark(0x0C, myEntryBarrierPatchId),
        new Infopoint(0x20, myDebugInfo, InfopointReason.SAFEPOINT),
        new Call(new HotSpotForeignCallTarget(mySafepointBlobAddress), 0x38, 5, true, myDebugInfo),
        new Call(new HotSpotForeignCallTarget(myEntryBarrierAddress), 0x3D, 5, true, myDebugInfo),
        new Mark(0x44, myExceptionHandlerEntryId),
        new Call(new HotSpotForeignCallTarget(myExceptionHandlerAddress), 0x44, 5, true, myDebugInfo),
        new Mark(0x49, myDeoptHandlerEntryId),
        new Call(new HotSpotForeignCallTarget(myDeoptBlobAddress), 0x53, 5, true, myDebugInfo),
};

// ...

private static long getConstant(String aName) {
    return ((HotSpotVMConfigAccess) ((HotSpotJVMCIRuntime) JVMCI.getRuntime()).getConfig())
            .getConstant(aName, Long.class);
}

private static long getAddress(String aName) {
    return ((HotSpotVMConfigAccess) ((HotSpotJVMCIRuntime) JVMCI.getRuntime()).getConfig())
            .getFieldValue(aName, Long.class, "address");
}
๐Ÿ“‹
Copied!

To benchmark this, we'll be using JMH.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@State(Scope.Benchmark)
public class Article3 {
    @Setup
    public void setup() throws Exception {
        final Method myMethod = Article3.class.getMethod("getIntegerJvmci");
        @Nullable
        final Object myFailure = new DummyCompiler().compileMethod(myMethod).getFailure();
        if (myFailure != null) {
            throw new IllegalStateException("Failed to compile: " + myFailure);
        }
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int getIntegerJvmci() {
        throw new IllegalStateException("Unreachable");
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int getIntegerStock() {
        return 8;
    }

    public static void main() throws Exception {
        new Runner(new OptionsBuilder()
                .mode(Mode.AverageTime)
                .timeUnit(TimeUnit.NANOSECONDS)
                .warmupIterations(3)
                .measurementIterations(5)
                .shouldDoGC(true)
                .measurementTime(TimeValue.seconds(5))
                .warmupTime(TimeValue.seconds(3))
                .forks(3)
                .threads(1)
                .include(Article3.class.getName())
                .jvmArgs("-Xmx8G", "-XX:+UseG1GC",
                        "-XX:+UnlockExperimentalVMOptions", "-XX:+UnlockDiagnosticVMOptions",
                        "-XX:+EnableJVMCI", "-XX:-TieredCompilation",
                        "-XX:CompileCommand=print,com/overlyenthusiastic/benchmarks/Article3.*",
                        "-XX:PrintAssemblyOptions=intel",
                        "--add-modules", "jdk.internal.vm.ci",
                        "--add-exports", "jdk.internal.vm.ci/jdk.vm.ci.code=ALL-UNNAMED",
                        "--add-exports", "jdk.internal.vm.ci/jdk.vm.ci.hotspot=ALL-UNNAMED",
                        "--add-exports", "jdk.internal.vm.ci/jdk.vm.ci.meta=ALL-UNNAMED",
                        "--add-exports", "jdk.internal.vm.ci/jdk.vm.ci.runtime=ALL-UNNAMED",
                        "--add-exports", "jdk.internal.vm.ci/jdk.vm.ci.services=ALL-UNNAMED")
                .build()).run();
    }
}
๐Ÿ“‹
Copied!

The result of this, unsurprisingly, is the below boring result.

1
2
3
Benchmark                 Mode  Cnt  Score   Error  Units
Article3.getIntegerJvmci  avgt    5  0.771 ยฑ 0.026  ns/op
Article3.getIntegerStock  avgt    5  0.772 ยฑ 0.017  ns/op
๐Ÿ“‹
Copied!

It turns out that identical code has identical performance. Let's hope in part 4 where we actually start changing up what assembly we use we can observe some differences.

Back to Series Overview

ยฉ 2024-2025 James Venning, All Rights Reserved

Any trademarks are properties of their respective owners. All content and any views or opinions expressed are my own and not associated with my employer. This site is not affiliated with Oracleยฎ.