Julia HPC BLAS/MKL Diagnostic & Setup Guide
When Julia linear algebra or sparse solvers are much slower on the HPC than on a laptop, the cause is often that Julia is using a generic BLAS instead of optimized OpenBLAS or Intel MKL. This guide shows how to diagnose and fix the problem.
1. Check Julia build and CPU
In Julia:
Look at:
CPU:→ should saybroadwell(or your CPU name)LLVM ORCJIT:→ should not saygenericBLAS:→ check which library Julia is linked against
!!! tip
If you see generic, Julia is not using CPU-specific optimizations (AVX2/FMA).
2. Inspect BLAS setup
:openblas64with Generic → unoptimized:mkl→ Intel MKL is active
3. Run diagnostic benchmark
Save the following as diag.jl and run with julia diag.jl:
using LinearAlgebra
using BenchmarkTools
using SparseArrays
println("=== Julia Diagnostic ===")
println("Julia version: ", VERSION)
println("Architecture: ", Sys.ARCH)
println("CPU name: ", Sys.CPU_NAME)
println("Threads: ", Threads.nthreads())
println("\n--- BLAS Info ---")
println("Vendor: ", BLAS.vendor())
println("Config: ", BLAS.get_config())
println("Threads (BLAS):", BLAS.get_num_threads())
println("\n--- Simple BLAS Benchmark (1000x1000 GEMM) ---")
A = rand(Float64, 1000, 1000)
B = rand(Float64, 1000, 1000)
@btime $A * $B;
println("\n--- Sparse Solver Test ---")
n = 5000
M = sprandn(n, n, 0.0005) + I
b = randn(n)
@btime $M \ $b;
println("\n=== Done ===")
Run it:
Compare results before and after loading optimized BLAS/MKL.
4. Use HPC BLAS modules
Check available modules:
Load one:
Restart Julia, then check again with:
5. Use MKL.jl (easy method)
From inside Julia:
Check:
Should print :mkl and libmkl_rt.so.
6. Control threading
Sometimes BLAS defaults to 1 thread on clusters. In Julia:
using LinearAlgebra
BLAS.set_num_threads(32) # match number of cores requested
BLAS.get_num_threads()
7. Troubleshooting
-
If Julia still uses generic OpenBLAS:
-
Ensure
LD_LIBRARY_PATHincludes the optimized BLAS. -
Try forcing with
LD_PRELOAD: -
For maximum performance:
-
Rebuild Julia from source with:
8. Key checks
BLAS.vendor()→ should bemklor optimizedopenblasBLAS.get_config()→ should listHaswell/Broadwell, notGenericA * B(1000×1000) should run in tens of ms, not seconds- Sparse solve should drop from multi-seconds to sub-second