I was curious what calls to Math.fma() would look like in GraalVM-generated native code on my Mac. Turns out it’s easy to check.
Writing the test case
First, write a simple Java test class. As we’ll see later, it’s a good idea to use distinctive names (unlike, say, “Test”). I’ll go for “BlueStilton”. Most symbol tables are uncontaminated by cheese.
public class BlueStilton {
public static void main(String[] args) {
double a = Math.random();
double b = Math.random();
double c = Math.random();
double result = Math.fma(a, b, c);
// print the result, mostly to ensure it doesn't get optimized away
System.out.println("Math.fma(" + a + ", " + b + ", " + c + ") = " + result);
}
}Compile this to a class file as usual.
Creating the native image
Now, from the .class file, generate a native image using GraalVM, with as much debug info as possible. I’m using GraalVM 24.0.1 here.
native-image \
-g -O0 \
-H:+SourceLevelDebug \
-H:-DeleteLocalSymbols \
--no-fallback \
-H:GenerateDebugInfo=1 \
-H:+PreserveFramePointer \
BlueStiltonDisassembling with LLDB
Let’s use lldb to check the generated code. (The LLVM toolchain is the default on macOS. Things may be different on Linux and Windows, where it’s usually GCC and MSVC, respectively.)
lldb bluestiltonIt can be somewhat challenging to actually find your code. Your main method isn’t the program’s entry point: there’s a lot of stuff that needs to happen before it’s called. After all, there is still a VM that has to get initialised. And even with optimisations turned off using -O0, native-image will inline many methods, making them disappear from the compiled code. On top of that, Java class and method names are mangled, similarly to what C++ does. Let’s see if we can locate our cheese anyway.
(lldb) image lookup -rn BlueStilton
1 match found in /.../bluestilton:
Address: bluestilton[0x0000000100004000] (bluestilton.__TEXT.__text + 0)
Summary: bluestilton`BlueStilton_main_8oe5bpk7YO0nSUOK71Ax8D
(lldb)
The main function is here. Let’s go ahead and disassemble it.
(lldb) disassemble -n BlueStilton_main_8oe5bpk7YO0nSUOK71Ax8D
bluestilton`BlueStilton_main_8oe5bpk7YO0nSUOK71Ax8D:
...(some lines omitted) ...
bluestilton[0x100004020] <+32>: nop
bluestilton[0x100004024] <+36>: str x0, [sp, #0x28]
bluestilton[0x100004028] <+40>: bl 0x100152550 ; Math_random_cQK0nQncAs7ADbsad25IVC
bluestilton[0x10000402c] <+44>: nop
bluestilton[0x100004030] <+48>: str d0, [sp, #0x48]
bluestilton[0x100004034] <+52>: nop
bluestilton[0x100004038] <+56>: bl 0x100152550 ; Math_random_cQK0nQncAs7ADbsad25IVC
bluestilton[0x10000403c] <+60>: nop
bluestilton[0x100004040] <+64>: str d0, [sp, #0x40]
bluestilton[0x100004044] <+68>: nop
bluestilton[0x100004048] <+72>: bl 0x100152550 ; Math_random_cQK0nQncAs7ADbsad25IVC
bluestilton[0x10000404c] <+76>: nop
bluestilton[0x100004050] <+80>: nop
bluestilton[0x100004054] <+84>: ldp d4, d5, [sp, #0x40]
bluestilton[0x100004058] <+88>: fmadd d6, d5, d4, d0
bluestilton[0x10000405c] <+92>: nop
...(some lines omitted) ...
(lldb)
As we can see, what looks like a static method call in the Java code gets translated directly to an inline fmadd – the native AArch64 fused multiply-and-add instruction. No method calls.
That’s good news, because without an intrinsic FMA instruction on the target hardware, the OpenJDK implementation of Math.fma() would fall back to a software emulation of FMA using BigDecimal. A reasonably graceful degradation to ensure correct semantics – but of course also awfully slow.
What about auto-vectorisation?
FMAs are often found in loops over arrays, making them a potential target for vectorisation by the compiler. I’d like to see that in action. Here’s another quick test program:
import java.util.Arrays;
public class RedLeicester {
static final int LEN = 1024;
public static void main(String[] args) {
var a = new double[LEN];
var b = new double[LEN];
var c = Math.random();
for (int i = 0; i < LEN; i++) {
a[i] = Math.random();
b[i] = Math.random();
}
System.out.println("initialized data");
for (int i = 0; i < LEN; i++) {
a[i] = Math.fma(c, b[i], a[i]);
// a[i] += c * b[i];
}
System.out.println(Arrays.toString(a));
}
}At least on my M2 Mac and using Oracle GraalVM 24.0.1 (not CE!), no vectorisation seems to happen. No NEON instructions. Even using -O2 -march=native, it’s just a plain loop over the array’s elements, one by one.
redleicester[0x1000042dc] <+732>: mov w0, wzr
redleicester[0x1000042e0] <+736>: ldr d0, [sp, #0x48]
redleicester[0x1000042e4] <+740>: ldp x3, x2, [sp, #0x28]
redleicester[0x1000042e8] <+744>: b 0x100004320 ; <+800>
redleicester[0x1000042ec] <+748>: nop
redleicester[0x1000042f0] <+752>: nop
redleicester[0x1000042f4] <+756>: nop
redleicester[0x1000042f8] <+760>: nop
redleicester[0x1000042fc] <+764>: nop
redleicester[0x100004300] <+768>: nop
redleicester[0x100004304] <+772>: orr x1, xzr, #0x8
redleicester[0x100004308] <+776>: add x1, x1, w0, uxtw #3
redleicester[0x10000430c] <+780>: ldr d1, [x3, x1]
redleicester[0x100004310] <+784>: ldr d2, [x2, x1]
redleicester[0x100004314] <+788>: fmadd d1, d0, d2, d1
redleicester[0x100004318] <+792>: str d1, [x3, x1]
redleicester[0x10000431c] <+796>: add w0, w0, #0x1
redleicester[0x100004320] <+800>: cmp w0, #0x400
redleicester[0x100004324] <+804>: b.lo 0x100004300 ; <+768>
Switching from Math.fma() to a plain multiply-and-add, using floats instead of doubles, or changing the array size doesn’t help either. With -O3, it seems that you get more loop unrolling, but still no vector-based instructions.
Perhaps it’s a limitation of the compiler, perhaps it’s intentional? A Medium post by Andrew Craik from 2021 states that “only native images built with profile-guided optimization (PGO) will see the full benefit of GraalVM’s vectorization capabilities” – alas, trying it with PGO didn’t change anything for me.
A quick test on x64 Linux with the identical Java code shows basically the same result, so this does not appear to be exclusive to AArch64 or macOS.