GraalVM optimization levels matter (and the JVM is amazing)

Benchmarking some changes to my little solarpos command-line tool, I noticed something odd: the Homebrew-installed version (built with GraalVM native image using “optimise for size”) was dramatically slower than running the same code on the JVM. This sent me down something of a rabbit hole investigating GraalVM’s optimisation levels, with interesting results.

Setup

The solarpos tool calculates solar positions and sunrise/sunset times for given coordinates and dates. In the test configuration running combined coordinate sweeps and time series, it’s essentially a tight loop doing mathematical calculations – date/time arithmetic, trigonometry, polynomial evaluations, and astronomical algorithms. There is some I/O involved (sending CSV records to stdout, which I pipe straight to /dev/null in the test runs), but this doesn’t dominate the workload according to some quick profiling. Perfect for performance testing.

I tested all five GraalVM native-image optimisation levels (-O0, -O1, -O2, -Os, -O3) using GraalVM 25, measuring performance with a workload of 7.6 million solar position calculations. I also included the JVM performance as a baseline comparison.

Results

Code

library(tidyverse)
library(scales)

# Performance data from the benchmark (verified with 7.6M operations)
perf_data <- tribble(
  ~opt_level, ~performance_ops_s, ~build_time_s, ~binary_size_mb, ~size_label,
  "JVM",      270280,            NA,             NA,              "JAR",
  "-O0",      73422,             41,             15,              "15M",
  "-O1",      128792,            39,             13,              "13M",
  "-O2",      134715,            51,             17,              "17M",
  "-Os",      29045,             37,             12,              "12M",
  "-O3",      146502,            67,             23,              "23M"
) |>
  mutate(
    opt_level = factor(opt_level, levels = c("JVM", "-O0", "-O1", "-O2", "-Os", "-O3")),
    performance_relative = performance_ops_s / performance_ops_s[opt_level == "-Os"]
  )

# Performance comparison chart
perf_data |>
  ggplot(aes(x = opt_level, y = performance_ops_s, fill = opt_level)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = paste0(round(performance_ops_s/1000, 0), "k ops/s")),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k")) +
  scale_fill_viridis_d(name = "Optimisation", direction = -1) +
  labs(
    title = "GraalVM native image optimisation level vs. performance",
    subtitle = "Solar position calculations (7.6M operations)",
    x = "Optimisation level",
    y = "Operations per second"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

The most striking finding: -Os is not just slower than -O3, it’s by far the slowest of all optimisation levels. This seems a bit counterintuitive, since -Os is supposed to optimise for size while maintaining reasonable performance. Also noteworthy: the JVM (270k ops/s) significantly outperforms even the best native image option (-O3 at 147k ops/s) by 84%.

Build time vs performance trade-offs

Code

# Create a combined plot showing the three metrics
p1 <- perf_data |>
  ggplot(aes(x = opt_level, y = performance_ops_s, fill = opt_level)) +
  geom_col(alpha = 0.8) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"),
                     expand = expansion(mult = c(0, 0.1))) +
  scale_fill_viridis_d(direction = -1) +
  labs(title = "Performance", y = "Ops/second") +
  theme_minimal() +
  theme(legend.position = "none", axis.title.x = element_blank())

p2 <- perf_data |>
  mutate(build_time_plot = ifelse(is.na(build_time_s), 0, build_time_s)) |>
  ggplot(aes(x = opt_level, y = build_time_plot, fill = opt_level)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = ifelse(is.na(build_time_s), "N/A", paste0(build_time_s, "s"))),
            vjust = ifelse(is.na(perf_data$build_time_s), 0.5, -0.5), size = 3) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  scale_fill_viridis_d(direction = -1, drop = FALSE) +
  labs(title = "Build time (native image only)", y = "Seconds") +
  theme_minimal() +
  theme(legend.position = "none", axis.title.x = element_blank())

p3 <- perf_data |>
  mutate(binary_size_plot = ifelse(is.na(binary_size_mb), 0, binary_size_mb)) |>
  ggplot(aes(x = opt_level, y = binary_size_plot, fill = opt_level)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = ifelse(is.na(size_label), "N/A", size_label)),
            vjust = ifelse(is.na(perf_data$binary_size_mb), 0.5, -0.5), size = 3) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  scale_fill_viridis_d(direction = -1, drop = FALSE) +
  labs(title = "Binary size (native image only)", y = "Megabytes", x = "Optimisation level") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine plots
library(patchwork)
p1 / p2 / p3

The numbers

Here’s the complete breakdown:

Code

perf_data |>
  select(
    `Level` = opt_level,
    `Performance` = performance_ops_s,
    `vs -Os` = performance_relative,
    `Build time` = build_time_s,
    `Binary size` = size_label
  ) |>
  mutate(
    Performance = format(Performance, big.mark = ","),
    `vs -Os` = paste0(round(`vs -Os`, 1), "×"),
    `Build time` = case_when(
      is.na(`Build time`) ~ "—",
      TRUE ~ paste0(`Build time`, "s")
    ),
    `Binary size` = case_when(
      is.na(`Binary size`) ~ "—",
      TRUE ~ `Binary size`
    )
  ) |>
  knitr::kable(align = c("l", "r", "r", "r", "r"))

Level	Performance	vs -Os	Build time	Binary size
JVM	270,280	9.3×	—	JAR
-O0	73,422	2.5×	41s	15M
-O1	128,792	4.4×	39s	13M
-O2	134,715	4.6×	51s	17M
-Os	29,045	1×	37s	12M
-O3	146,502	5×	67s	23M

Practical implications

For mathematical or compute-intensive workloads, at least on Oracle GraalVM 25:

Don’t assume -Os is a reasonable default – it can be dramatically slower than other options.
-O3 provides substantial gains at the cost of longer build times and larger binaries.
-O2 offers a middle ground with good performance and reasonable build times (it’s also the default).
Native Image may not be the best choice for best throughput. For a CLI tool, the main reason to use native-image is fast startup and having a completely self-contained single executable of moderate size on disk and in memory. But when it comes to sheer throughput in long-running calculation tasks, the JVM seems hard to beat.

More things to try

After a very quick test with profile-guided optimisation (PGO) failed to show any improvements, I didn’t investigate this any further. Perhaps it’s worth repeating with a bit more care and better training data.

I also didn’t test any other optimisation flags or tweaks. Java 25 comes with a couple of interesting performance-related options to try out, such as compact object headers or a bunch of GC algorithms to choose from. When using native-image however, the GC selection is very limited and the default (and only reasonable choice that works across all platforms) is the Serial collector.

For this test, I simply used defaults for everything except of course the -O flag of native-image.

Methodology notes

All tests were conducted on an Apple Silicon (M2) Macbook Air using Oracle GraalVM 25+37.1. The benchmark workload consisted of 7,615,728 solar position calculations across coordinate ranges and time series – a realistic computational load for this type of application. Results were verified with multiple runs to ensure stability.

The specific command used for benchmarking was:

  solarpos --perf --format=CSV --no-headers 50:55:0.1 10:15:0.1 2024 position --step=3h > /dev/null

This calculates solar positions for a 5°×5° grid (latitude 50-55°N, longitude 10-15°E) with 0.1° resolution across the entire year 2024 with 3-hour time steps.

Build times include both Maven compilation and GraalVM native image generation. Performance measurements used the application’s built-in --perf flag, which tracks operations per second during execution.

The results should be broadly applicable to similar workloads, but as always, benchmark your own code – YMMV.