Skip to content

Julia HPC BLAS/MKL Diagnostic & Setup Guide

When Julia linear algebra or sparse solvers are much slower on the HPC than on a laptop, the cause is often that Julia is using a generic BLAS instead of optimized OpenBLAS or Intel MKL. This guide shows how to diagnose and fix the problem.


1. Check Julia build and CPU

In Julia:

versioninfo(verbose = true)

Look at:

  • CPU: → should say broadwell (or your CPU name)
  • LLVM ORCJIT: → should not say generic
  • BLAS: → check which library Julia is linked against

!!! tip If you see generic, Julia is not using CPU-specific optimizations (AVX2/FMA).


2. Inspect BLAS setup

using LinearAlgebra
BLAS.vendor()
BLAS.get_config()
BLAS.get_num_threads()
  • :openblas64 with Generic → unoptimized
  • :mkl → Intel MKL is active

3. Run diagnostic benchmark

Save the following as diag.jl and run with julia diag.jl:

using LinearAlgebra
using BenchmarkTools
using SparseArrays

println("=== Julia Diagnostic ===")
println("Julia version: ", VERSION)
println("Architecture:  ", Sys.ARCH)
println("CPU name:      ", Sys.CPU_NAME)
println("Threads:       ", Threads.nthreads())

println("\n--- BLAS Info ---")
println("Vendor:        ", BLAS.vendor())
println("Config:        ", BLAS.get_config())
println("Threads (BLAS):", BLAS.get_num_threads())

println("\n--- Simple BLAS Benchmark (1000x1000 GEMM) ---")
A = rand(Float64, 1000, 1000)
B = rand(Float64, 1000, 1000)
@btime $A * $B;

println("\n--- Sparse Solver Test ---")
n = 5000
M = sprandn(n, n, 0.0005) + I
b = randn(n)
@btime $M \ $b;

println("\n=== Done ===")

Run it:

julia diag.jl

Compare results before and after loading optimized BLAS/MKL.


4. Use HPC BLAS modules

Check available modules:

module avail openblas
module avail intel-mkl

Load one:

module load openblas/0.3.26
# or
module load intel-mkl/2024

Restart Julia, then check again with:

using LinearAlgebra
BLAS.get_config()

5. Use MKL.jl (easy method)

From inside Julia:

using Pkg
Pkg.add("MKL")   # only once
using MKL        # switches Julia to MKL

Check:

using LinearAlgebra
BLAS.vendor()
BLAS.get_config()

Should print :mkl and libmkl_rt.so.


6. Control threading

Sometimes BLAS defaults to 1 thread on clusters. In Julia:

using LinearAlgebra
BLAS.set_num_threads(32)    # match number of cores requested
BLAS.get_num_threads()

7. Troubleshooting

  • If Julia still uses generic OpenBLAS:

  • Ensure LD_LIBRARY_PATH includes the optimized BLAS.

  • Try forcing with LD_PRELOAD:

    export LD_PRELOAD=/path/to/openblas/lib/libopenblas.so
    julia
    
  • For maximum performance:

  • Rebuild Julia from source with:

    make -j MARCH=native
    

8. Key checks

  • BLAS.vendor() → should be mkl or optimized openblas
  • BLAS.get_config() → should list Haswell/Broadwell, not Generic
  • A * B (1000×1000) should run in tens of ms, not seconds
  • Sparse solve should drop from multi-seconds to sub-second