I was curious what calls to Math.fma() would look like in GraalVM-generated native code on my Mac. Turns out it’s easy to check.
First, write a simple Java test class. As we’ll see later, it’s a good idea to use distinctive names (unlike, say, “Test”). I’ll go for “BlueStilton”. Most symbol tables are uncontaminated by cheese.
public class BlueStilton {
public static void main(String[] args) {
// get 3 random double numbers as separate vars
double a = Math.random();
double b = Math.random();
double c = Math.random();
// do Math.fma() on them
double result = Math.fma(a, b, c);
// print the result, mostly to ensure it doesn't get optimized away
System.out.println("Math.fma(" + a + ", " + b + ", " + c + ") = " + result);
}
}
Compile this to a class file as usual.
Now, from the .class file, generate a native image using GraalVM, making sure to get as much debug info as possible. I’m using GraalVM 24.0.1 here.
native-image \
-g -O0 \
-H:+SourceLevelDebug \
-H:-DeleteLocalSymbols \
--no-fallback \
-H:GenerateDebugInfo=1 \
-H:+PreserveFramePointer \
BlueStilton
If that works, let’s use lldb
to check the generated code. (The LLVM toolchain is the default on MacOS. Things may be different on Linux and Windows, where it’s usually gcc and MSVC, respectively.)
lldb bluestilton
It can be somewhat challenging to actually find your code. Even with optimization turned off using -O0
, native-image will still inline many methods, making them disappear from the compiled code. Regardless of inlining, Java class and method names are also mangled. Let’s see if we can locate our cheese anyway.
(lldb) image lookup -rn BlueStilton
1 match found in /.../bluestilton:
Address: bluestilton[0x0000000100004000] (bluestilton.__TEXT.__text + 0)
Summary: bluestilton`BlueStilton_main_8oe5bpk7YO0nSUOK71Ax8D
(lldb)
The main function is here. Let’s go ahead and disassemble it.
(lldb) disassemble -n BlueStilton_main_8oe5bpk7YO0nSUOK71Ax8D
bluestilton`BlueStilton_main_8oe5bpk7YO0nSUOK71Ax8D:
...(some lines omitted) ...
bluestilton[0x100004020] <+32>: nop
bluestilton[0x100004024] <+36>: str x0, [sp, #0x28]
bluestilton[0x100004028] <+40>: bl 0x100152550 ; Math_random_cQK0nQncAs7ADbsad25IVC
bluestilton[0x10000402c] <+44>: nop
bluestilton[0x100004030] <+48>: str d0, [sp, #0x48]
bluestilton[0x100004034] <+52>: nop
bluestilton[0x100004038] <+56>: bl 0x100152550 ; Math_random_cQK0nQncAs7ADbsad25IVC
bluestilton[0x10000403c] <+60>: nop
bluestilton[0x100004040] <+64>: str d0, [sp, #0x40]
bluestilton[0x100004044] <+68>: nop
bluestilton[0x100004048] <+72>: bl 0x100152550 ; Math_random_cQK0nQncAs7ADbsad25IVC
bluestilton[0x10000404c] <+76>: nop
bluestilton[0x100004050] <+80>: nop
bluestilton[0x100004054] <+84>: ldp d4, d5, [sp, #0x40]
bluestilton[0x100004058] <+88>: fmadd d6, d5, d4, d0
bluestilton[0x10000405c] <+92>: nop
...(some lines omitted) ...
(lldb)
As we can see, what looks like a static method call in the Java code gets translated directly to an inline fmadd – the native aarch64 fused multiply and add instruction. No method calls.
And that’s good news, because without an intrinsic FMA instruction on the target hardware, the OpenJDK implementation of Math.fma() would fall back to a software emulation of FMA using BigDecimal. A reasonably graceful degradation to ensure correct semantics, but of course also awfully slow.
What about auto-vectorization?
FMAs are often found in loops over arrays, making them a potential target for vectorization by the compiler. I’d like to see that in action. Here’s another quick test program:
import java.util.Arrays;
public class RedLeicester {
static final int LEN = 1024;
public static void main(String[] args) {
var a = new double[LEN];
var b = new double[LEN];
var c = Math.random();
for (int i = 0; i < LEN; i++) {
[i] = Math.random();
a[i] = Math.random();
b}
System.out.println("initialized data");
for (int i = 0; i < LEN; i++) {
[i] = Math.fma(c, b[i], a[i]);
a// a[i] += c * b[i];
}
System.out.println(Arrays.toString(a));
}
}
At least on my M2 Mac and using Oracle GraalVM 24.0.1 (not CE!), no vectorization seems to happen. No NEON instructions. Even using -O2 -march=native
, it’s just a plain loop over the array’s elements, one by one:
redleicester[0x1000042dc] <+732>: mov w0, wzr
redleicester[0x1000042e0] <+736>: ldr d0, [sp, #0x48]
redleicester[0x1000042e4] <+740>: ldp x3, x2, [sp, #0x28]
redleicester[0x1000042e8] <+744>: b 0x100004320 ; <+800>
redleicester[0x1000042ec] <+748>: nop
redleicester[0x1000042f0] <+752>: nop
redleicester[0x1000042f4] <+756>: nop
redleicester[0x1000042f8] <+760>: nop
redleicester[0x1000042fc] <+764>: nop
redleicester[0x100004300] <+768>: nop
redleicester[0x100004304] <+772>: orr x1, xzr, #0x8
redleicester[0x100004308] <+776>: add x1, x1, w0, uxtw #3
redleicester[0x10000430c] <+780>: ldr d1, [x3, x1]
redleicester[0x100004310] <+784>: ldr d2, [x2, x1]
redleicester[0x100004314] <+788>: fmadd d1, d0, d2, d1
redleicester[0x100004318] <+792>: str d1, [x3, x1]
redleicester[0x10000431c] <+796>: add w0, w0, #0x1
redleicester[0x100004320] <+800>: cmp w0, #0x400
redleicester[0x100004324] <+804>: b.lo 0x100004300 ; <+768>
Switching from Math.fma() to a plain multiply-and-add, using floats instead of doubles, or changing the array size doesn’t help either. With -O3
, it seems that you get more loop unrolling, but still no vector-based instructions. Perhaps it’s a limitation of the compiler, perhaps it’s intentional? A Medium post by Andrew Craik from 2021 states that “only native images built with profile-guided optimization (PGO) will see the full benefit of GraalVM’s vectorization capabilities” – alas, trying it with PGO didn’t change anything for me.
A quick test on x64 Linux with the identical Java code shows basically the same result, so this does not appear to be exclusive to aarch64 or MacOS.