Julia HPC BLAS/MKL Diagnostic & Setup Guide

When Julia linear algebra or sparse solvers are much slower on the HPC than on a laptop, the cause is often that Julia is using a generic BLAS instead of optimized OpenBLAS or Intel MKL. This guide shows how to diagnose and fix the problem.

1. Check Julia build and CPU

In Julia:

versioninfo(verbose = true)

Look at:

CPU: → should say broadwell (or your CPU name)
LLVM ORCJIT: → should not say generic
BLAS: → check which library Julia is linked against

!!! tip If you see generic, Julia is not using CPU-specific optimizations (AVX2/FMA).

2. Inspect BLAS setup

using LinearAlgebra
BLAS.vendor()
BLAS.get_config()
BLAS.get_num_threads()

:openblas64 with Generic → unoptimized
:mkl → Intel MKL is active

3. Run diagnostic benchmark

Save the following as diag.jl and run with julia diag.jl:

using LinearAlgebra
using BenchmarkTools
using SparseArrays

println("=== Julia Diagnostic ===")
println("Julia version: ", VERSION)
println("Architecture:  ", Sys.ARCH)
println("CPU name:      ", Sys.CPU_NAME)
println("Threads:       ", Threads.nthreads())

println("\n--- BLAS Info ---")
println("Vendor:        ", BLAS.vendor())
println("Config:        ", BLAS.get_config())
println("Threads (BLAS):", BLAS.get_num_threads())

println("\n--- Simple BLAS Benchmark (1000x1000 GEMM) ---")
A = rand(Float64, 1000, 1000)
B = rand(Float64, 1000, 1000)
@btime $A * $B;

println("\n--- Sparse Solver Test ---")
n = 5000
M = sprandn(n, n, 0.0005) + I
b = randn(n)
@btime $M \ $b;

println("\n=== Done ===")

Run it:

julia diag.jl

Compare results before and after loading optimized BLAS/MKL.

4. Use HPC BLAS modules

Check available modules:

module avail openblas
module avail intel-mkl

Load one:

module load openblas/0.3.26
# or
module load intel-mkl/2024

Restart Julia, then check again with:

using LinearAlgebra
BLAS.get_config()

5. Use MKL.jl (easy method)

From inside Julia:

using Pkg
Pkg.add("MKL")   # only once
using MKL        # switches Julia to MKL

Check:

using LinearAlgebra
BLAS.vendor()
BLAS.get_config()

Should print :mkl and libmkl_rt.so.

6. Control threading

Sometimes BLAS defaults to 1 thread on clusters. In Julia:

using LinearAlgebra
BLAS.set_num_threads(32)    # match number of cores requested
BLAS.get_num_threads()

7. Troubleshooting

If Julia still uses generic OpenBLAS:
Ensure LD_LIBRARY_PATH includes the optimized BLAS.

Try forcing with LD_PRELOAD:

export LD_PRELOAD=/path/to/openblas/lib/libopenblas.so
julia

For maximum performance:
Rebuild Julia from source with:
```
make -j MARCH=native
```

8. Key checks

BLAS.vendor() → should be mkl or optimized openblas
BLAS.get_config() → should list Haswell/Broadwell, not Generic
A * B (1000×1000) should run in tens of ms, not seconds
Sparse solve should drop from multi-seconds to sub-second