runner: Remove CGO engines, use llama-server exclusively for GGML models (#16031)

* broad lint fixes to sidestep CI scope glitch

* runner: Remove CGO engines, use llama-server exclusively for GGML models

Remove the vendored GGML and llama.cpp backend, CGO runner, Go model
implementations, and sample.  llama-server (built from upstream llama.cpp via
FetchContent) is now the sole inference engine for GGUF-based models.
(Safetensor based models continue to run on the new MLX engine.)  This allows
us to more rapidly pick up new capabilities and fixes from llama.cpp as they
come out.

On windows this now requires recent AMD driver versions to support ROCm v7 as
llama.cpp currently does not support building against v6.

* llama/compat: load Ollama-format GGUFs in llama-server

Squashed from upstream/jmorganca/llama-compat on 2026-04-29.
Source tip: 0c33775d37.

Original source commits:
- 25223160d llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs
- 7449b539a llm,server: route Ollama-format gemma3 blobs through llama/compat
- 436f2e2b1 llama/compat: make patch-apply idempotent
- 8c2c9d4c8 llama/compat: extend gemma3 handler to cover 1B and 270M blobs
- 021389f7b llama/compat: shrink clip.cpp injection from 18 lines to 1
- 61b367ec2 llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines)
- 36049361c llama/compat: simplify shim (gemma3-tested)
- 8fa664865 llama/compat: add qwen35moe text handler
- db0c74530 llama/compat: add qwen35moe vision (clip) support
- 2a388da77 llama/compat: split shared infra into a util TU
- 9a69a17dc llama/compat: document non-public API dependencies
- d0f38a915 llama/compat: add gpt-oss and lfm2 handlers
- 086071822 llama/compat: add mistral3 text handler (vision TODO)
- 63bde9ff7 llama/compat: add mistral3 vision (clip) support
- 3a57b89d5 llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K
- 99cb87439 llama/compat: add qwen35, gemma4, deepseek-ocr handlers
- 2c7850dba llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip)
- 9e3b54225 llama/compat: add llama4 text + clip handlers
- 034fee349 llama/compat: add gemma4 clip handler (gemma4v projector)
- 9945c5a93 server: remove dhiltgen/* compat redirect table
- 5d4539101 llama/compat: rewrite gemma4 tokenizer model to BPE
- 7e0765327 llama/compat: add glm-ocr text handler + text-loader load-op hook
- f1bd1a25a llama/compat: add glm-ocr clip handler (glm4v projector)
- 4b5cf3420 llama/compat: collapse text-loader hook back to one new patch line
- eb4ecf4fc llama/compat: extend gemma4 clip handler to gemma4a (audio)
- a23a5e76f llama/compat: fix gemma4a per-block norm tensor mapping
- cd2dcaff4 llama/compat: add embeddinggemma handler
- 1ce8a6b26 llama/compat: add qwen3-vl + qwen2.5-vl handlers
- fd98ffa1e llama/compat: add gemma3n + glm4moelite handlers
- cc7bdf0bc llama/compat: handle null buft in maybe_load_tensor
- 0c33775d3 llama/compat: disable mmap when load_op transforms text-side tensors

* refine implementation

* ci: fix windows MLX build

* ci: fix windows llama-server build

* ci: fix windows rocm build

* ci: windows mlx tuning

Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit

* ci: fix windows dependencies

* win: fix dependency gathering

* disable openmp

* win: arm64 cross-compile build

also DRY out CI steps

* scheduler improvements

* ci: improvements from #15982

* win: favor ninja for faster developer builds

* win: fix build

* win: fix arm64 cross-compile

* win: avoid spaces in compiler path

* misc discovery fixes, and bos handling

* lint fixes

* win: fix arm cross-compile build/CI bugs

* llama.cpp update

* win: handle multiple CRT dirs

* vulkan: add windows iGPU detection

* fix creation bugs for patched models, other refactoring work

* tune batch size for better performance

* ci and lint fixes

* fix repeat_last_n bug

* build: revamp build for better developer UX

* amd, sampler, qwen3next fixes

* version bump

* fix mlx build

* revamp GPU discovery

Scanning the output of llama-server is turning out to be too error prone across
llama.cpp updates, so this switches to a thin dynamic library load against the
bundled GGML libraries so more details can be gathered from the API.

* version bump

* missing file

* ci: fix cache miss on rocm build

* refine vulkan dep handling

* fix ps reporting bug on full GPU load

* improve cmake wiring for customized local builds

* version bump

* docker build arg cleanup

* improve windows exit error logs

* fix community gemma4 support and ci flakes

* fix mlx unit test

* tighten up ps logic to avoid double counting fit log lines

* version bump

* fix ps view for full gpu layer offload

* add MTP wiring for llama-server and create with GGUFs

* pick best template by capabilities

* version bump

* ci: harden apt repos

* remove unused cpu core discovery

* adjust batch default logic to reduce OOMs

* support larger tool calls

* fix audio support, template show

* qwen35 mtp patch support

* flesh out dtypes

* rocm deps

* version bump

* lint fix

* block broken gfx1150 on windows

* fix qwen3.5 moe mtp tensors in patch

* mmproj oom fallback and vulkan on by default

* qwen MTP compat fix

* version bump

* ci: fix WoA cross-compile

* ci: workaround ui tool in cross-compile

* version bump

* win: enable OpenMP for CPU builds

* build: improve developer UX

* ci: windows path workaround for CPU build

* win: fix WoA dependencies

* win: fix large offset reads for mmproj patched loads

* version bump

* fix vulkan dup detection

* add OLLAMA_IGPU_ENABLE and largely disable iGPUs by default

* opt-in MTP, win large offset, integraton fixes

* fix unit test scheduler interaction hang

* fix multi-gpu filtering

* version bump

* review comments

* fix thinking level

* fix linux rocm ordering and granite 3.3 template

* version bump

* ci fix - non-shallow MLX checkout

* bypass linux sysfs unit test on windows

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
This commit is contained in:
Daniel Hiltgen 2026-05-29 13:35:47 -07:00 committed by GitHub
parent f63eea3d27
commit 9db4bdbad6
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
1100 changed files with 28510 additions and 430069 deletions

View file

@ -16,7 +16,7 @@ jobs:
outputs:
GOFLAGS: ${{ steps.goflags.outputs.GOFLAGS }}
VERSION: ${{ steps.goflags.outputs.VERSION }}
vendorsha: ${{ steps.changes.outputs.vendorsha }}
vendorsha: ${{ steps.goflags.outputs.vendorsha }}
steps:
- uses: actions/checkout@v4
- name: Set environment
@ -24,7 +24,7 @@ jobs:
run: |
echo GOFLAGS="'-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=${GITHUB_REF_NAME#v}\" \"-X=github.com/ollama/ollama/server.mode=release\"'" | tee -a $GITHUB_OUTPUT
echo VERSION="${GITHUB_REF_NAME#v}" | tee -a $GITHUB_OUTPUT
echo vendorsha=$(make -f Makefile.sync print-base) | tee -a $GITHUB_OUTPUT
echo vendorsha=$(cat LLAMA_CPP_VERSION)-$(cat MLX_VERSION)-$(cat MLX_C_VERSION) | tee -a $GITHUB_OUTPUT
darwin-build:
runs-on: macos-26-xlarge
@ -57,7 +57,9 @@ jobs:
go-version-file: go.mod
cache-dependency-path: |
go.sum
Makefile.sync
LLAMA_CPP_VERSION
MLX_VERSION
MLX_C_VERSION
- run: |
./scripts/build_darwin.sh
- name: Log build results
@ -73,15 +75,18 @@ jobs:
dist/*.dmg
windows-depends:
needs: setup-environment
strategy:
matrix:
os: [windows]
arch: [amd64]
preset: ['CPU']
build-steps: ['cpu cpuArm64']
include:
- os: windows
arch: amd64
preset: 'CUDA 12'
build-steps: cuda12
install: https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_571.96_windows.exe
cuda-components:
- '"cudart"'
@ -89,10 +94,10 @@ jobs:
- '"cublas"'
- '"cublas_dev"'
cuda-version: '12.8'
flags: ''
- os: windows
arch: amd64
preset: 'CUDA 13'
build-steps: cuda13
install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
cuda-components:
- '"cudart"'
@ -103,23 +108,23 @@ jobs:
- '"nvvm"'
- '"nvptxcompiler"'
cuda-version: '13.0'
flags: ''
- os: windows
arch: amd64
preset: 'ROCm 6'
install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-WinSvr2022-For-HIP.exe
rocm-version: '6.2'
flags: '-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma" -DCMAKE_CXX_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma"'
runner_dir: 'rocm'
preset: 'ROCm 7'
build-steps: rocm7
install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-26.Q1-Win11-For-HIP.exe
rocm-version: '7.1'
- os: windows
arch: amd64
preset: Vulkan
build-steps: vulkan
install: https://sdk.lunarg.com/sdk/download/1.4.321.1/windows/vulkansdk-windows-X64-1.4.321.1.exe
flags: ''
runner_dir: 'vulkan'
- os: windows
arch: amd64
preset: 'MLX CUDA 13'
build-steps: mlxCuda13
build-parallel: '16'
cmake-cuda-flags: '-t 6'
install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
cudnn-install: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.18.1.3_cuda13-archive.zip
cuda-components:
@ -135,13 +140,12 @@ jobs:
- '"nvvm"'
- '"nvptxcompiler"'
cuda-version: '13.0'
flags: ''
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
environment: release
env:
GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
VERSION: ${{ needs.setup-environment.outputs.VERSION }}
steps:
# Increase pagefile to handle momentary spikes in RAM from NVCC compiles
- if: startsWith(matrix.preset, 'MLX ')
name: Increase pagefile to 200 GB
uses: al-cheb/configure-pagefile-action@v1.5
@ -155,6 +159,15 @@ jobs:
if (Get-Command ccache -ErrorAction SilentlyContinue) {
ccache -o cache_dir=${{ github.workspace }}\.ccache
}
- if: matrix.preset == 'CPU'
name: Install Windows ARM64 cross compiler
run: |
Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-x86_64.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
Expand-Archive -Path ${{ runner.temp }}\llvm-mingw-ucrt.zip -DestinationPath "C:\Program Files\"
$installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt-x86_64").path
if (!(Test-Path "$installPath\bin\aarch64-w64-mingw32-gcc.exe")) {
throw "llvm-mingw x86_64 package is missing the aarch64 cross compiler"
}
- if: startsWith(matrix.preset, 'CUDA ') || startsWith(matrix.preset, 'ROCm ') || startsWith(matrix.preset, 'Vulkan') || startsWith(matrix.preset, 'MLX ')
id: cache-install
uses: actions/cache/restore@v4
@ -203,12 +216,12 @@ jobs:
}
$vulkanPath = (Resolve-Path "C:\VulkanSDK\*").path
$vulkanRuntime = Join-Path $vulkanPath "Helpers\VulkanRT.exe"
if (Test-Path $vulkanRuntime) {
Start-Process -FilePath $vulkanRuntime -ArgumentList "/s" -NoNewWindow -Wait
}
echo "$vulkanPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "VULKAN_SDK=$vulkanPath" >> $env:GITHUB_ENV
- if: matrix.preset == 'CPU'
run: |
echo "CC=clang.exe" | Out-File -FilePath $env:GITHUB_ENV -Append
echo "CXX=clang++.exe" | Out-File -FilePath $env:GITHUB_ENV -Append
- if: startsWith(matrix.preset, 'MLX ')
name: Install cuDNN for MLX
run: |
@ -240,73 +253,63 @@ jobs:
with:
path: ${{ github.workspace }}\.ccache
key: ccache-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.preset }}-${{ needs.setup-environment.outputs.vendorsha }}
- name: Build target "${{ matrix.preset }}"
- name: Build Windows dependencies
run: |
Import-Module 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\Tools\Microsoft.VisualStudio.DevShell.dll'
Enter-VsDevShell -VsInstallPath 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise' -SkipAutomaticLocation -DevCmdArguments '-arch=x64 -no_logo'
cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }} --install-prefix "$((pwd).Path)\dist\${{ matrix.os }}-${{ matrix.arch }}"
cmake --build --preset "${{ matrix.preset }}" -- -l $([Environment]::ProcessorCount)
cmake --install build --component "${{ startsWith(matrix.preset, 'MLX ') && 'MLX' || startsWith(matrix.preset, 'CUDA ') && 'CUDA' || startsWith(matrix.preset, 'ROCm ') && 'HIP' || startsWith(matrix.preset, 'Vulkan') && 'Vulkan' || 'CPU' }}" --strip
if ('${{ matrix.preset }}'.StartsWith('MLX ')) { cmake --install build --component MLX_VENDOR }
Remove-Item -Path dist\lib\ollama\rocm\rocblas\library\*gfx906* -ErrorAction SilentlyContinue
$steps = "${{ matrix.build-steps }}".Split(' ', [System.StringSplitOptions]::RemoveEmptyEntries)
./scripts/build_windows.ps1 @steps
env:
CMAKE_GENERATOR: Ninja
OLLAMA_BUILD_PARALLEL: ${{ matrix.build-parallel || '' }}
OLLAMA_CMAKE_CUDA_FLAGS: ${{ matrix.cmake-cuda-flags || '' }}
- name: Log build results
run: |
gci -path .\dist -Recurse -File | ForEach-Object { get-filehash -path $_.FullName -Algorithm SHA256 } | format-list
- if: matrix.preset == 'CPU'
name: Verify Windows CPU payloads
shell: bash
run: |
set -euo pipefail
for payload in \
dist/windows-amd64/lib/ollama/llama-server.exe \
dist/windows-arm64/lib/ollama/llama-server.exe
do
[ -f "$payload" ] || { echo "missing $payload"; exit 1; }
done
- uses: actions/upload-artifact@v4
with:
name: depends-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.preset }}
path: dist\*
windows-build:
strategy:
matrix:
os: [windows]
arch: [amd64, arm64]
include:
- os: windows
arch: amd64
llvmarch: x86_64
- os: windows
arch: arm64
llvmarch: aarch64
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
runs-on: windows
environment: release
needs: [setup-environment]
env:
GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
VERSION: ${{ needs.setup-environment.outputs.VERSION }}
steps:
- name: Install ARM64 system dependencies
if: matrix.arch == 'arm64'
run: |
$ErrorActionPreference = "Stop"
Set-ExecutionPolicy Bypass -Scope Process -Force
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072
iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
echo "C:\ProgramData\chocolatey\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
Invoke-WebRequest -Uri https://aka.ms/vs/17/release/vc_redist.arm64.exe -OutFile "${{ runner.temp }}\vc_redist.arm64.exe"
Start-Process -FilePath "${{ runner.temp }}\vc_redist.arm64.exe" -ArgumentList @("/install", "/quiet", "/norestart") -NoNewWindow -Wait
choco install -y --no-progress git gzip
echo "C:\Program Files\Git\cmd" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
- name: Install clang and gcc-compat
run: |
$ErrorActionPreference = "Stop"
Set-ExecutionPolicy Bypass -Scope Process -Force
Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-${{ matrix.llvmarch }}.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-x86_64.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
Expand-Archive -Path ${{ runner.temp }}\llvm-mingw-ucrt.zip -DestinationPath "C:\Program Files\"
$installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt*").path
$installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt-x86_64").path
echo "$installPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
if (!(Test-Path "$installPath\bin\aarch64-w64-mingw32-gcc.exe")) {
throw "llvm-mingw x86_64 package is missing the aarch64 cross compiler"
}
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version-file: go.mod
cache-dependency-path: |
go.sum
Makefile.sync
LLAMA_CPP_VERSION
MLX_VERSION
MLX_C_VERSION
- name: Verify gcc is actually clang
run: |
$ErrorActionPreference='Continue'
@ -323,20 +326,30 @@ jobs:
with:
node-version: "20"
- run: |
./scripts/build_windows ollama app
./scripts/build_windows ollama ollamaArm64 app appArm64
- name: Verify Windows build payloads
shell: bash
run: |
set -euo pipefail
for payload in \
dist/windows-amd64/ollama.exe \
dist/windows-arm64/ollama.exe
do
[ -f "$payload" ] || { echo "missing $payload"; exit 1; }
done
- name: Log build results
run: |
gci -path .\dist -Recurse -File | ForEach-Object { get-filehash -path $_.FullName -Algorithm SHA256 } | format-list
- uses: actions/upload-artifact@v4
with:
name: build-${{ matrix.os }}-${{ matrix.arch }}
name: build-windows-amd64
path: |
dist\*
windows-app:
runs-on: windows
environment: release
needs: [windows-build, windows-depends]
needs: [setup-environment, windows-build, windows-depends]
env:
GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
VERSION: ${{ needs.setup-environment.outputs.VERSION }}
@ -362,7 +375,9 @@ jobs:
go-version-file: go.mod
cache-dependency-path: |
go.sum
Makefile.sync
LLAMA_CPP_VERSION
MLX_VERSION
MLX_C_VERSION
- uses: actions/download-artifact@v4
with:
pattern: depends-windows*
@ -376,6 +391,18 @@ jobs:
- name: Log dist contents after download
run: |
gci -path .\dist -recurse
- name: Verify Windows package inputs
shell: bash
run: |
set -euo pipefail
for payload in \
dist/windows-amd64/ollama.exe \
dist/windows-amd64/lib/ollama/llama-server.exe \
dist/windows-arm64/ollama.exe \
dist/windows-arm64/lib/ollama/llama-server.exe
do
[ -f "$payload" ] || { echo "missing $payload"; exit 1; }
done
- run: |
./scripts/build_windows.ps1 deps sign installer zip
- name: Log contents after build
@ -389,31 +416,28 @@ jobs:
dist/*.ps1
dist/OllamaSetup.exe
# Pre-build each Dockerfile stage on its own runner in parallel and push the
# resulting layers to a per-stage registry cache. The downstream
# docker-build-push job then assembles cache-hit-only.
linux-depends:
strategy:
matrix:
include:
- arch: amd64
target: cpu
target: llama-server-cpu
- arch: amd64
target: cuda-12
target: llama-server-cuda_v12
- arch: amd64
target: cuda-13
target: llama-server-cuda_v13
- arch: amd64
target: mlx
- arch: amd64
target: rocm-7
target: llama-server-rocm_v7_2
- arch: amd64
target: vulkan
target: llama-server-vulkan
- arch: arm64
target: cpu
target: llama-server-cpu
- arch: arm64
target: cuda-12
target: llama-server-cuda_v12
- arch: arm64
target: cuda-13
target: llama-server-cuda_v13
- arch: arm64
target: jetpack-5
- arch: arm64
@ -430,7 +454,6 @@ jobs:
with:
username: ${{ vars.DOCKER_USER }}
password: ${{ secrets.DOCKER_ACCESS_TOKEN }}
# Increase swap to handle momentary spikes in RAM from NVCC compiles
- if: matrix.target == 'mlx'
name: Increase Linux swap to 200 GB
shell: bash
@ -459,12 +482,13 @@ jobs:
provenance: false
sbom: false
build-args: |
GOFLAGS=${{ env.GOFLAGS }}
CGO_CFLAGS=${{ env.CGO_CFLAGS }}
CGO_CXXFLAGS=${{ env.CGO_CXXFLAGS }}
GOFLAGS=${{ env.GOFLAGS }}
APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
OLLAMA_MLX_BUILD_JOBS=16
OLLAMA_MLX_NVCC_THREADS=6
APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
cache-from: |
type=registry,ref=ollama/release:cache-${{ matrix.arch }}-${{ matrix.target }}
type=registry,ref=${{ vars.DOCKER_REPO }}:latest
@ -472,58 +496,65 @@ jobs:
# Build each Docker variant (OS, arch, and flavor) separately. Using QEMU is unreliable and slower.
# Heavy stages were pre-built by linux-depends; this job is cache-hit-only for those layers
# and just assembles, runs the Go build, and pushes the final image.
# and just assembles, runs the Go build, pushes the final image, and extracts release bundles.
docker-build-push:
strategy:
matrix:
include:
- os: linux
arch: arm64
archive-target: archive
build-args: |
CGO_CFLAGS
CGO_CXXFLAGS
GOFLAGS
APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
OLLAMA_MLX_BUILD_JOBS=16
OLLAMA_MLX_NVCC_THREADS=6
cache-from: |
type=registry,ref=${{ vars.DOCKER_REPO }}:latest
type=registry,ref=ollama/release:cache-arm64-cpu
type=registry,ref=ollama/release:cache-arm64-cuda-12
type=registry,ref=ollama/release:cache-arm64-cuda-13
type=registry,ref=ollama/release:cache-arm64-llama-server-cpu
type=registry,ref=ollama/release:cache-arm64-llama-server-cuda_v12
type=registry,ref=ollama/release:cache-arm64-llama-server-cuda_v13
type=registry,ref=ollama/release:cache-arm64-jetpack-5
type=registry,ref=ollama/release:cache-arm64-jetpack-6
type=registry,ref=${{ vars.DOCKER_REPO }}:latest
- os: linux
arch: amd64
archive-target: archive
build-args: |
CGO_CFLAGS
CGO_CXXFLAGS
GOFLAGS
APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
OLLAMA_MLX_BUILD_JOBS=16
OLLAMA_MLX_NVCC_THREADS=6
cache-from: |
type=registry,ref=${{ vars.DOCKER_REPO }}:latest
type=registry,ref=ollama/release:cache-amd64-cpu
type=registry,ref=ollama/release:cache-amd64-cuda-12
type=registry,ref=ollama/release:cache-amd64-cuda-13
type=registry,ref=ollama/release:cache-amd64-llama-server-cpu
type=registry,ref=ollama/release:cache-amd64-llama-server-cuda_v12
type=registry,ref=ollama/release:cache-amd64-llama-server-cuda_v13
type=registry,ref=ollama/release:cache-amd64-mlx
type=registry,ref=ollama/release:cache-amd64-vulkan
type=registry,ref=ollama/release:cache-amd64-llama-server-rocm_v7_2
type=registry,ref=ollama/release:cache-amd64-llama-server-vulkan
type=registry,ref=${{ vars.DOCKER_REPO }}:latest
- os: linux
arch: amd64
suffix: '-rocm'
archive-target: image-archive
build-args: |
CGO_CFLAGS
CGO_CXXFLAGS
GOFLAGS
FLAVOR=rocm
APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
OLLAMA_MLX_BUILD_JOBS=16
OLLAMA_MLX_NVCC_THREADS=6
cache-from: |
type=registry,ref=ollama/release:cache-amd64-llama-server-cpu
type=registry,ref=ollama/release:cache-amd64-llama-server-rocm_v7_2
type=registry,ref=${{ vars.DOCKER_REPO }}:latest
type=registry,ref=ollama/release:cache-amd64-cpu
type=registry,ref=ollama/release:cache-amd64-rocm-7
runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
environment: release
needs: [setup-environment, linux-depends]
@ -556,14 +587,11 @@ jobs:
name: digest-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.suffix }}
path: |
${{ runner.temp }}/${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.suffix }}.txt
# Re-run buildx with --target archive against buildkit's local cache to
# extract the release directory layout. All upstream stages were just
# built above, so this is a cache-hit-only pass that just writes files.
- uses: docker/build-push-action@v6
with:
context: .
platforms: ${{ matrix.os }}/${{ matrix.arch }}
target: archive
target: ${{ matrix.archive-target }}
provenance: false
sbom: false
build-args: ${{ matrix.build-args }}
@ -572,24 +600,33 @@ jobs:
- name: Deduplicate CUDA libraries
run: |
./scripts/deduplicate_cuda_libs.sh dist/${{ matrix.os }}-${{ matrix.arch }}
- name: Verify Linux build payloads
shell: bash
run: |
set -euo pipefail
base="dist/${{ matrix.os }}-${{ matrix.arch }}"
for payload in \
"$base/bin/ollama" \
"$base/lib/ollama/llama-server"
do
[ -f "$payload" ] || { echo "missing $payload"; exit 1; }
done
- run: |
for COMPONENT in bin/* lib/ollama/*; do
case "$COMPONENT" in
bin/ollama*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/*.so*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/llama-server*|lib/ollama/llama-quantize*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/cuda_v*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/vulkan*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/mlx*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-mlx.tar.in ;;
lib/ollama/include*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
lib/ollama/include*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-mlx.tar.in ;;
lib/ollama/cuda_jetpack5) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-jetpack5.tar.in ;;
lib/ollama/cuda_jetpack6) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-jetpack6.tar.in ;;
lib/ollama/rocm) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-rocm.tar.in ;;
lib/ollama/rocm_v*) echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-rocm.tar.in ;;
esac
done
working-directory: dist/${{ matrix.os }}-${{ matrix.arch }}
# rocm builds cpu + rocm libs for the container image, which
# creates a CPU-only amd64 tarball that would collide with the full
# bundle when the release job merges artifacts.
- if: matrix.suffix == '-rocm'
run: rm -f dist/${{ matrix.os }}-${{ matrix.arch }}/ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in
- run: |
@ -665,6 +702,21 @@ jobs:
- name: Copy install scripts to dist
run: |
cp scripts/install.sh dist/install.sh
- name: Verify release artifacts
run: |
required=(
dist/OllamaSetup.exe
dist/install.ps1
dist/install.sh
dist/ollama-windows-amd64.zip
dist/ollama-windows-arm64.zip
)
for payload in "${required[@]}"; do
if [ ! -f "$payload" ]; then
echo "::error::Missing expected release artifact: $payload"
exit 1
fi
done
- name: Generate checksum file
run: find . -type f -not -name 'sha256sum.txt' | xargs sha256sum | tee sha256sum.txt
working-directory: dist

View file

@ -23,7 +23,7 @@ jobs:
outputs:
changed: ${{ steps.changes.outputs.changed }}
app_changed: ${{ steps.changes.outputs.app_changed }}
vendorsha: ${{ steps.changes.outputs.vendorsha }}
enginehash: ${{ steps.changes.outputs.enginehash }}
steps:
- uses: actions/checkout@v4
with:
@ -38,9 +38,42 @@ jobs:
| xargs python3 -c "import sys; from pathlib import Path; print(any(Path(x).match(glob) for x in sys.argv[1:] for glob in '$*'.split(' ')))"
}
echo changed=$(changed 'llama/llama.cpp/**/*' 'ml/backend/ggml/ggml/**/*' '.github/**/*') | tee -a $GITHUB_OUTPUT
echo changed=$(changed \
'CMakeLists.txt' \
'CMakePresets.json' \
'cmake/**' \
'cmake/**/*' \
'llama/server/**/*' \
'llama/compat/**/*' \
'LLAMA_CPP_VERSION' \
'MLX_VERSION' \
'MLX_C_VERSION' \
'llama/llama.cpp/**/*' \
'ml/backend/ggml/ggml/**/*' \
'x/imagegen/mlx/**' \
'x/imagegen/mlx/**/*' \
'.github/**/*') | tee -a $GITHUB_OUTPUT
echo app_changed=$(changed 'app/**' 'app/**/*') | tee -a $GITHUB_OUTPUT
echo vendorsha=$(make -f Makefile.sync print-base) | tee -a $GITHUB_OUTPUT
echo enginehash=$(cat LLAMA_CPP_VERSION)-$(cat MLX_VERSION)-$(cat MLX_C_VERSION) | tee -a $GITHUB_OUTPUT
patches:
strategy:
matrix:
os: [ubuntu-latest, windows-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Verify patches apply cleanly
shell: bash
run: |
cmake -S llama/server -B "$RUNNER_TEMP/llama-server-patch-check" \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DGGML_BACKEND_DL=ON \
-DGGML_NATIVE=OFF \
-DGGML_OPENMP=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \
-DOLLAMA_RUNNER_DIR=
linux:
needs: [changes]
@ -49,23 +82,41 @@ jobs:
matrix:
include:
- preset: CPU
superbuild_target: ollama-local
superbuild_dir: build/local-superbuild
superbuild_args: ''
expected_payload: lib/ollama/llama-server
install-go: true
- preset: CUDA
container: nvidia/cuda:13.0.0-devel-ubuntu22.04
flags: '-DCMAKE_CUDA_ARCHITECTURES=87'
superbuild_target: ollama-llama-server-cuda_v13
superbuild_dir: build/local-superbuild-cuda_v13
superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=87'
expected_payload: lib/ollama/cuda_v13/libggml-cuda.so
- preset: ROCm
container: rocm/dev-ubuntu-22.04:7.2.1
extra-packages: rocm-libs
flags: '-DAMDGPU_TARGETS=gfx1010 -DCMAKE_PREFIX_PATH=/opt/rocm'
superbuild_target: ollama-llama-server-rocm_v7_2
superbuild_dir: build/local-superbuild-rocm_v7_2
superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=rocm_v7_2 -DAMDGPU_TARGETS=gfx1010 -DCMAKE_PREFIX_PATH=/opt/rocm'
expected_payload: lib/ollama/rocm_v7_2/libggml-hip.so
- preset: Vulkan
container: ubuntu:22.04
extra-packages: >
mesa-vulkan-drivers vulkan-tools
libvulkan1 libvulkan-dev
vulkan-sdk cmake ccache g++ make
vulkan-sdk spirv-headers cmake ccache g++ make
superbuild_target: ollama-llama-server-vulkan
superbuild_dir: build/local-superbuild-vulkan
superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=vulkan'
expected_payload: lib/ollama/vulkan/libggml-vulkan.so
- preset: 'MLX CUDA 13'
container: nvidia/cuda:13.0.0-devel-ubuntu22.04
extra-packages: libcudnn9-dev-cuda-13 libopenblas-dev liblapack-dev liblapacke-dev git curl
flags: '-DCMAKE_CUDA_ARCHITECTURES=87 -DMLX_CUDA_ARCHITECTURES=80-virtual -DBLAS_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu -DLAPACK_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu'
superbuild_target: ollama-mlx-cuda_v13
superbuild_dir: build/local-superbuild-mlx-cuda_v13
superbuild_args: '-DOLLAMA_MLX_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=87 -DMLX_CUDA_ARCHITECTURES=80-virtual -DBLAS_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu -DLAPACK_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu'
expected_payload: lib/ollama/mlx_cuda_v13/libmlx.so
install-go: true
runs-on: linux
container: ${{ matrix.container }}
@ -82,11 +133,9 @@ jobs:
echo "deb [signed-by=/usr/share/keyrings/lunarg-archive-keyring.gpg] https://packages.lunarg.com/vulkan/1.4.313 jammy main" | $sudo tee /etc/apt/sources.list.d/lunarg-vulkan-1.4.313-jammy.list > /dev/null
$sudo apt-get update
fi
$sudo apt-get install -y cmake ccache ${{ matrix.extra-packages }}
# MLX requires CMake 3.25+, install from official releases
if [ "${{ matrix.preset }}" = "MLX CUDA 13" ]; then
curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.31.2/cmake-3.31.2-linux-$(uname -m).tar.gz | $sudo tar xz -C /usr/local --strip-components 1
fi
$sudo apt-get install -y cmake ccache curl git ${{ matrix.extra-packages }}
# Use a current CMake for upstream llama.cpp and Vulkan dependency discovery.
curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.31.2/cmake-3.31.2-linux-$(uname -m).tar.gz | $sudo tar xz -C /usr/local --strip-components 1
# Export VULKAN_SDK if provided by LunarG package (defensive)
if [ -d "/usr/lib/x86_64-linux-gnu/vulkan" ] && [ "${{ matrix.preset }}" = "Vulkan" ]; then
echo "VULKAN_SDK=/usr" >> $GITHUB_ENV
@ -96,17 +145,30 @@ jobs:
- if: matrix.install-go
name: Install Go
run: |
[ -n "${{ matrix.container }}" ] || sudo=sudo
GO_VERSION=$(awk '/^go / { print $2 }' go.mod)
curl -fsSL "https://golang.org/dl/go${GO_VERSION}.linux-$(dpkg --print-architecture).tar.gz" | tar xz -C /usr/local
curl -fsSL "https://golang.org/dl/go${GO_VERSION}.linux-$(dpkg --print-architecture).tar.gz" | $sudo tar xz -C /usr/local
echo "/usr/local/go/bin" >> $GITHUB_PATH
- uses: actions/cache@v4
with:
path: /github/home/.cache/ccache
key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.vendorsha }}
- run: |
cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }}
cmake --build --preset "${{ matrix.preset }}" -- -l $(nproc)
key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.enginehash }}
- name: Build native superbuild
if: matrix.superbuild_target
run: |
cmake -S . -B "${{ matrix.superbuild_dir }}" ${{ matrix.superbuild_args }}
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) cmake --build "${{ matrix.superbuild_dir }}" --target "${{ matrix.superbuild_target }}" -- -l $(nproc)
test -e "${{ matrix.superbuild_dir }}/${{ matrix.expected_payload }}"
- name: Verify local superbuild install
if: matrix.superbuild_target == 'ollama-local'
run: |
./ollama --version
"${{ matrix.superbuild_dir }}/lib/ollama/llama-server" --version
test -x "${{ matrix.superbuild_dir }}/lib/ollama/llama-quantize"
cmake --install "${{ matrix.superbuild_dir }}" --component ollama-local --prefix "$RUNNER_TEMP/ollama-local"
"$RUNNER_TEMP/ollama-local/bin/ollama" --version
"$RUNNER_TEMP/ollama-local/lib/ollama/llama-server" --version
test -x "$RUNNER_TEMP/ollama-local/lib/ollama/llama-quantize"
windows:
needs: [changes]
if: needs.changes.outputs.changed == 'True'
@ -114,9 +176,16 @@ jobs:
matrix:
include:
- preset: CPU
superbuild_target: ollama-local
superbuild_dir: build\local-superbuild
superbuild_args: ''
expected_payload: lib\ollama\llama-server.exe
- preset: CUDA
install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
flags: '-DCMAKE_CUDA_ARCHITECTURES=80'
superbuild_target: ollama-llama-server-cuda_v13
superbuild_dir: build\local-superbuild-cuda_v13
superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=80'
expected_payload: lib\ollama\cuda_v13\ggml-cuda.dll
cuda-components:
- '"cudart"'
- '"nvcc"'
@ -127,14 +196,26 @@ jobs:
- '"nvptxcompiler"'
cuda-version: '13.0'
- preset: ROCm
install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-WinSvr2022-For-HIP.exe
flags: '-DAMDGPU_TARGETS=gfx1010 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma" -DCMAKE_CXX_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma"'
install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-26.Q1-Win11-For-HIP.exe
rocm-version: '7.1'
superbuild_target: ollama-llama-server-rocm_v7_1
superbuild_dir: build\local-superbuild-rocm_v7_1
superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=rocm_v7_1 -DAMDGPU_TARGETS=gfx1010'
expected_payload: lib\ollama\rocm_v7_1\ggml-hip.dll
- preset: Vulkan
install: https://sdk.lunarg.com/sdk/download/1.4.321.1/windows/vulkansdk-windows-X64-1.4.321.1.exe
superbuild_target: ollama-llama-server-vulkan
superbuild_dir: build\local-superbuild-vulkan
superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=vulkan'
expected_payload: lib\ollama\vulkan\ggml-vulkan.dll
- preset: 'MLX CUDA 13'
install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
cudnn-install: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.18.1.3_cuda13-archive.zip
flags: '-DCMAKE_CUDA_ARCHITECTURES=80 -DMLX_CUDA_ARCHITECTURES=80-virtual'
superbuild_target: ollama-mlx-cuda_v13
superbuild_dir: build\local-superbuild-mlx-cuda_v13
superbuild_args: '-DOLLAMA_MLX_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=80 -DMLX_CUDA_ARCHITECTURES=80-virtual'
expected_payload: lib\ollama\mlx_cuda_v13\mlx.dll
install-go: true
cuda-components:
- '"cudart"'
- '"nvcc"'
@ -203,6 +284,10 @@ jobs:
}
$vulkanPath = (Resolve-Path "C:\VulkanSDK\*").path
$vulkanRuntime = Join-Path $vulkanPath "Helpers\VulkanRT.exe"
if (Test-Path $vulkanRuntime) {
Start-Process -FilePath $vulkanRuntime -ArgumentList "/s" -NoNewWindow -Wait
}
echo "$vulkanPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "VULKAN_SDK=$vulkanPath" >> $env:GITHUB_ENV
- if: matrix.preset == 'MLX CUDA 13'
@ -232,18 +317,44 @@ jobs:
C:\Program Files\NVIDIA\CUDNN
key: ${{ matrix.install }}-${{ matrix.cudnn-install }}
- uses: actions/checkout@v4
- if: matrix.superbuild_target == 'ollama-local' || matrix.install-go
uses: actions/setup-go@v5
with:
go-version-file: 'go.mod'
- uses: actions/cache@v4
with:
path: ${{ github.workspace }}\.ccache
key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.vendorsha }}
- run: |
key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.enginehash }}
- name: Build native superbuild
if: matrix.superbuild_target
run: |
$ErrorActionPreference = "Stop"
Import-Module 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\Tools\Microsoft.VisualStudio.DevShell.dll'
Enter-VsDevShell -VsInstallPath 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise' -SkipAutomaticLocation -DevCmdArguments '-arch=x64 -no_logo'
cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }}
cmake --build --preset "${{ matrix.preset }}" -- -l $([Environment]::ProcessorCount)
cmake -S . -B "${{ matrix.superbuild_dir }}" ${{ matrix.superbuild_args }}
$env:CMAKE_BUILD_PARALLEL_LEVEL = [Environment]::ProcessorCount
cmake --build "${{ matrix.superbuild_dir }}" --target "${{ matrix.superbuild_target }}" -- -l $([Environment]::ProcessorCount)
if (!(Test-Path "${{ matrix.superbuild_dir }}\${{ matrix.expected_payload }}")) {
throw "missing ${{ matrix.expected_payload }}"
}
env:
CMAKE_GENERATOR: Ninja
- name: Verify local superbuild install
if: matrix.superbuild_target == 'ollama-local'
run: |
$ErrorActionPreference = "Stop"
& ".\ollama.exe" --version
& "${{ matrix.superbuild_dir }}\lib\ollama\llama-server.exe" --version
if (!(Test-Path "${{ matrix.superbuild_dir }}\lib\ollama\llama-quantize.exe")) {
throw "missing llama-quantize.exe"
}
$installPrefix = Join-Path $env:RUNNER_TEMP "ollama-local"
cmake --install "${{ matrix.superbuild_dir }}" --component ollama-local --prefix "$installPrefix"
& "$installPrefix\bin\ollama.exe" --version
& "$installPrefix\lib\ollama\llama-server.exe" --version
if (!(Test-Path "$installPrefix\lib\ollama\llama-quantize.exe")) {
throw "missing installed llama-quantize.exe"
}
go_mod_tidy:
runs-on: ubuntu-latest
steps:
@ -266,7 +377,9 @@ jobs:
go-version-file: 'go.mod'
cache-dependency-path: |
go.sum
Makefile.sync
LLAMA_CPP_VERSION
MLX_VERSION
MLX_C_VERSION
- uses: actions/setup-node@v4
with:
node-version: '20'
@ -280,6 +393,17 @@ jobs:
if: ${{ startsWith(matrix.os, 'ubuntu') }}
working-directory: ./app/ui/app
run: npm test
- name: Verify MLX generated files are current
if: ${{ startsWith(matrix.os, 'ubuntu') }}
run: |
cmake -S . -B build/mlx-generate -DOLLAMA_MLX_BACKENDS=cuda_v13
cmake --build build/mlx-generate --target ollama-mlx-generate-wrappers
git diff --exit-code -- \
x/imagegen/mlx/mlx.h \
x/imagegen/mlx/mlx.c \
x/mlxrunner/mlx/generated.h \
x/mlxrunner/mlx/generated.c \
x/mlxrunner/mlx/include/mlx/c
- name: Run go generate
run: go generate ./...
@ -294,12 +418,3 @@ jobs:
- uses: golangci/golangci-lint-action@v9
with:
only-new-issues: true
patches:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Verify patches apply cleanly and do not change files
run: |
make -f Makefile.sync clean checkout apply-patches sync
git diff --compact-summary --exit-code

View file

@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.21)
cmake_minimum_required(VERSION 3.24)
project(Ollama C CXX)
@ -23,30 +23,23 @@ include(GNUInstallDirs)
find_package(Threads REQUIRED)
set(CMAKE_BUILD_TYPE Release)
set(BUILD_SHARED_LIBS ON)
if(NOT CMAKE_CONFIGURATION_TYPES AND NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
endif()
# These defaults can be overridden by presets (e.g., for static macOS llama-server builds)
if(NOT DEFINED BUILD_SHARED_LIBS)
set(BUILD_SHARED_LIBS ON)
endif()
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS ON) # Recent versions of MLX Requires gnu++17 extensions to compile properly
set(CMAKE_CXX_EXTENSIONS ON) # Recent versions of MLX require gnu++17 extensions to compile properly
set(GGML_BUILD ON)
set(GGML_SHARED ON)
set(GGML_CCACHE ON)
set(GGML_BACKEND_DL ON)
set(GGML_BACKEND_SHARED ON)
set(GGML_SCHED_MAX_COPIES 4)
set(GGML_LLAMAFILE ON)
set(GGML_CUDA_PEER_MAX_BATCH_SIZE 128)
set(GGML_CUDA_GRAPHS ON)
set(GGML_CUDA_FA ON)
set(GGML_CUDA_COMPRESSION_MODE default)
if((CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_OSX_ARCHITECTURES MATCHES "arm64")
OR (NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_SYSTEM_PROCESSOR MATCHES "arm|aarch64|ARM64|ARMv[0-9]+"))
set(GGML_CPU_ALL_VARIANTS ON)
endif()
# GGML backend for inference is provided by llama-server (built separately via
# llama/server/CMakeLists.txt using FetchContent from the pinned llama.cpp source).
# The root CMake project is the orchestration entrypoint; backend-specific
# build rules live in subprojects under cmake/.
if(APPLE)
set(CMAKE_BUILD_RPATH "@loader_path")
@ -55,7 +48,8 @@ if(APPLE)
endif()
set(OLLAMA_BUILD_DIR ${CMAKE_BINARY_DIR}/lib/ollama)
set(OLLAMA_INSTALL_DIR ${CMAKE_INSTALL_PREFIX}/lib/ollama/${OLLAMA_RUNNER_DIR})
set(OLLAMA_LIB_DIR "lib/ollama" CACHE STRING "Install destination for Ollama runtime payloads")
set(OLLAMA_INSTALL_DIR ${OLLAMA_LIB_DIR}/${OLLAMA_RUNNER_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${OLLAMA_BUILD_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_DEBUG ${OLLAMA_BUILD_DIR})
@ -64,314 +58,9 @@ set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${OLLAMA_BUILD_DIR})
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_DEBUG ${OLLAMA_BUILD_DIR})
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE ${OLLAMA_BUILD_DIR})
# Store ggml include paths for use with target_include_directories later.
# We avoid global include_directories() to prevent polluting the include path
# for other projects like MLX (whose openblas dependency has its own common.h).
set(GGML_INCLUDE_DIRS
${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src
${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/include
${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cpu
${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cpu/amx
)
add_compile_definitions(NDEBUG GGML_VERSION=0x0 GGML_COMMIT=0x0)
# Define GGML version variables for shared library SOVERSION
# These are required by ggml/src/CMakeLists.txt for proper library versioning
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 0)
set(GGML_VERSION_PATCH 0)
set(GGML_VERSION "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
set(GGML_CPU ON)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src)
set_property(TARGET ggml PROPERTY EXCLUDE_FROM_ALL TRUE)
get_target_property(CPU_VARIANTS ggml-cpu MANUALLY_ADDED_DEPENDENCIES)
if(NOT CPU_VARIANTS)
set(CPU_VARIANTS "ggml-cpu")
endif()
# Apply ggml include directories to ggml targets only (not globally)
target_include_directories(ggml-base PRIVATE ${GGML_INCLUDE_DIRS})
foreach(variant ${CPU_VARIANTS})
if(TARGET ${variant})
target_include_directories(${variant} PRIVATE ${GGML_INCLUDE_DIRS})
endif()
endforeach()
install(TARGETS ggml-base ${CPU_VARIANTS}
RUNTIME_DEPENDENCIES
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CPU
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CPU
FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CPU
)
check_language(CUDA)
if(CMAKE_CUDA_COMPILER)
if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.24" AND NOT CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES "native")
endif()
find_package(CUDAToolkit)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cuda)
target_include_directories(ggml-cuda PRIVATE ${GGML_INCLUDE_DIRS})
install(TARGETS ggml-cuda
RUNTIME_DEPENDENCIES
DIRECTORIES ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR}
PRE_INCLUDE_REGEXES cublas cublasLt cudart
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CUDA
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CUDA
)
endif()
set(WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX "^gfx(908|90a|1200|1201):xnack[+-]$"
CACHE STRING
"Regular expression describing AMDGPU_TARGETS not supported on Windows. Override to force building these targets. Default \"^gfx(908|90a|1200|1201):xnack[+-]$\"."
)
check_language(HIP)
if(CMAKE_HIP_COMPILER)
set(HIP_PLATFORM "amd")
if(NOT AMDGPU_TARGETS)
find_package(hip REQUIRED)
list(FILTER AMDGPU_TARGETS INCLUDE REGEX "^gfx(94[012]|101[02]|1030|110[012]|120[01])$")
endif()
if(WIN32 AND WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX)
list(FILTER AMDGPU_TARGETS EXCLUDE REGEX ${WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX})
endif()
if(AMDGPU_TARGETS)
find_package(hip REQUIRED)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-hip)
target_include_directories(ggml-hip PRIVATE ${GGML_INCLUDE_DIRS})
if (WIN32)
target_compile_definitions(ggml-hip PRIVATE GGML_CUDA_NO_PEER_COPY)
endif()
target_compile_definitions(ggml-hip PRIVATE GGML_HIP_NO_VMM)
install(TARGETS ggml-hip
RUNTIME_DEPENDENCY_SET rocm
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
)
install(RUNTIME_DEPENDENCY_SET rocm
DIRECTORIES ${HIP_BIN_INSTALL_DIR} ${HIP_LIB_INSTALL_DIR}
PRE_INCLUDE_REGEXES hipblas rocblas amdhip64 rocsolver amd_comgr hsa-runtime64 rocsparse tinfo rocprofiler-register roctx64 rocroller drm drm_amdgpu numa elf
PRE_EXCLUDE_REGEXES ".*"
POST_EXCLUDE_REGEXES "system32"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
)
foreach(HIP_LIB_BIN_INSTALL_DIR IN ITEMS ${HIP_BIN_INSTALL_DIR} ${HIP_LIB_INSTALL_DIR})
if(EXISTS ${HIP_LIB_BIN_INSTALL_DIR}/rocblas)
install(DIRECTORY ${HIP_LIB_BIN_INSTALL_DIR}/rocblas DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP)
break()
endif()
endforeach()
endif()
endif()
if(NOT APPLE)
find_package(Vulkan)
if(Vulkan_FOUND)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-vulkan)
target_include_directories(ggml-vulkan PRIVATE ${GGML_INCLUDE_DIRS})
install(TARGETS ggml-vulkan
RUNTIME_DEPENDENCIES
PRE_INCLUDE_REGEXES vulkan
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT Vulkan
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT Vulkan
)
endif()
endif()
option(MLX_ENGINE "Enable MLX backend" OFF)
if(MLX_ENGINE)
message(STATUS "Setting up MLX (this takes a while...)")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/x/imagegen/mlx)
# Find CUDA toolkit if MLX is built with CUDA support
find_package(CUDAToolkit)
# Build list of directories for runtime dependency resolution
set(MLX_RUNTIME_DIRS ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR})
# Add cuDNN bin paths for DLLs (Windows MLX CUDA builds)
# CUDNN_ROOT_DIR is the standard CMake variable for cuDNN location
if(DEFINED ENV{CUDNN_ROOT_DIR})
# cuDNN 9.x has versioned subdirectories under bin/ (e.g., bin/13.0/)
file(GLOB CUDNN_BIN_SUBDIRS "$ENV{CUDNN_ROOT_DIR}/bin/*")
list(APPEND MLX_RUNTIME_DIRS ${CUDNN_BIN_SUBDIRS})
endif()
# Add build output directory and MLX dependency build directories
list(APPEND MLX_RUNTIME_DIRS ${OLLAMA_BUILD_DIR})
# OpenBLAS DLL location (pre-built zip extracts into openblas-src/bin/)
list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/openblas-src/bin)
# NCCL: on Linux, if real NCCL is found, cmake bundles libnccl.so via the
# regex below. If NCCL is not found, MLX links a static stub (OBJECT lib)
# so there is no runtime dependency. This path covers the stub build dir
# for windows so we include the DLL in our dependencies.
list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/distributed/nccl/nccl_stub-prefix/src/nccl_stub-build/Release)
# Base regexes for runtime dependencies (cross-platform)
set(MLX_INCLUDE_REGEXES cublas cublasLt cudart cufft nvrtc nvrtc-builtins cudnn nccl openblas gfortran)
# On Windows, also include dl.dll (dlfcn-win32 POSIX emulation layer)
if(WIN32)
list(APPEND MLX_INCLUDE_REGEXES "^dl\\.dll$")
endif()
# Split mlx/mlxc libraries from runtime deps to avoid stripping deps
install(TARGETS mlx mlxc
RUNTIME_DEPENDENCY_SET mlx_runtime_deps
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
)
install(RUNTIME_DEPENDENCY_SET mlx_runtime_deps
DIRECTORIES ${MLX_RUNTIME_DIRS}
PRE_INCLUDE_REGEXES ${MLX_INCLUDE_REGEXES}
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
)
if(TARGET jaccl)
install(TARGETS jaccl
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
)
endif()
# Install the Metal library for macOS arm64 (must be colocated with the binary)
# Metal backend is only built for arm64, not x86_64
if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
install(FILES ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/backend/metal/kernels/mlx.metallib
DESTINATION ${OLLAMA_INSTALL_DIR}
COMPONENT MLX)
endif()
# Install headers for NVRTC JIT compilation at runtime.
# MLX's own install rules use the default component so they get skipped by
# --component MLX. Headers are installed alongside libmlx in OLLAMA_INSTALL_DIR.
#
# Layout:
# ${OLLAMA_INSTALL_DIR}/include/cccl/{cuda,nv}/ — CCCL headers
# ${OLLAMA_INSTALL_DIR}/include/*.h — CUDA toolkit headers
#
# MLX's jit_module.cpp resolves CCCL via
# current_binary_dir()[.parent_path()] / "include" / "cccl"
# On Linux, MLX's jit_module.cpp resolves CCCL via
# current_binary_dir().parent_path() / "include" / "cccl", so we create a
# symlink from lib/ollama/include -> ${OLLAMA_RUNNER_DIR}/include
# This will need refinement if we add multiple CUDA versions for MLX in the future.
# CUDA runtime headers are found via CUDA_PATH env var (set by mlxrunner).
if(EXISTS ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda)
install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda
DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
COMPONENT MLX)
install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/nv
DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
COMPONENT MLX)
if(NOT WIN32 AND NOT APPLE)
install(CODE "
set(_link \"${CMAKE_INSTALL_PREFIX}/lib/ollama/include\")
set(_target \"${OLLAMA_RUNNER_DIR}/include\")
if(NOT EXISTS \${_link})
execute_process(COMMAND \${CMAKE_COMMAND} -E create_symlink \${_target} \${_link})
endif()
" COMPONENT MLX)
endif()
endif()
# Install minimal CUDA toolkit headers needed by MLX JIT kernels.
# These are the transitive closure of includes from mlx/backend/cuda/device/*.cuh.
# The Go mlxrunner sets CUDA_PATH to OLLAMA_INSTALL_DIR so MLX finds them at
# $CUDA_PATH/include/*.h via NVRTC --include-path.
if(CUDAToolkit_FOUND)
# CUDAToolkit_INCLUDE_DIRS may be a semicolon-separated list
# (e.g. ".../include;.../include/cccl"). Find the entry that
# contains the CUDA runtime headers we need.
set(_cuda_inc "")
foreach(_dir ${CUDAToolkit_INCLUDE_DIRS})
if(EXISTS "${_dir}/cuda_runtime_api.h")
set(_cuda_inc "${_dir}")
break()
endif()
endforeach()
if(NOT _cuda_inc)
message(WARNING "Could not find cuda_runtime_api.h in CUDAToolkit_INCLUDE_DIRS: ${CUDAToolkit_INCLUDE_DIRS}")
else()
set(_dst "${OLLAMA_INSTALL_DIR}/include")
set(_MLX_JIT_CUDA_HEADERS
builtin_types.h
cooperative_groups.h
cuda_bf16.h
cuda_bf16.hpp
cuda_device_runtime_api.h
cuda_fp16.h
cuda_fp16.hpp
cuda_fp8.h
cuda_fp8.hpp
cuda_runtime_api.h
device_types.h
driver_types.h
math_constants.h
surface_types.h
texture_types.h
vector_functions.h
vector_functions.hpp
vector_types.h
)
foreach(_hdr ${_MLX_JIT_CUDA_HEADERS})
install(FILES "${_cuda_inc}/${_hdr}"
DESTINATION ${_dst}
COMPONENT MLX)
endforeach()
# Subdirectory headers
install(DIRECTORY "${_cuda_inc}/cooperative_groups"
DESTINATION ${_dst}
COMPONENT MLX
FILES_MATCHING PATTERN "*.h")
install(FILES "${_cuda_inc}/crt/host_defines.h"
DESTINATION "${_dst}/crt"
COMPONENT MLX)
endif()
endif()
# On Windows, explicitly install dl.dll (dlfcn-win32 POSIX dlopen emulation)
# RUNTIME_DEPENDENCIES auto-excludes it via POST_EXCLUDE_FILES_STRICT because
# dlfcn-win32 is a known CMake target with its own install rules (which install
# to the wrong destination). We must install it explicitly here.
if(WIN32)
install(FILES ${OLLAMA_BUILD_DIR}/dl.dll
DESTINATION ${OLLAMA_INSTALL_DIR}
COMPONENT MLX)
endif()
# Manually install CUDA runtime libraries that MLX loads via dlopen
# (not detected by RUNTIME_DEPENDENCIES since they aren't link-time deps)
if(CUDAToolkit_FOUND)
file(GLOB MLX_CUDA_LIBS
"${CUDAToolkit_LIBRARY_DIR}/libcudart.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcublas.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcublasLt.so*"
"${CUDAToolkit_LIBRARY_DIR}/libnvrtc.so*"
"${CUDAToolkit_LIBRARY_DIR}/libnvrtc-builtins.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcufft.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcudnn.so*")
if(MLX_CUDA_LIBS)
install(FILES ${MLX_CUDA_LIBS}
DESTINATION ${OLLAMA_INSTALL_DIR}
COMPONENT MLX_VENDOR)
endif()
endif()
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/llama/server/CMakeLists.txt")
set(OLLAMA_HAVE_LLAMA_SERVER TRUE)
else()
set(OLLAMA_HAVE_LLAMA_SERVER FALSE)
endif()
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/local.cmake)

View file

@ -11,109 +11,10 @@
}
},
{
"name": "CPU",
"inherits": [ "Default" ]
},
{
"name": "CUDA",
"inherits": [ "Default" ]
},
{
"name": "CUDA 11",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "50-virtual;60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-virtual;87-virtual;89-virtual;90-virtual",
"CMAKE_CUDA_FLAGS": "-Wno-deprecated-gpu-targets -t 2",
"OLLAMA_RUNNER_DIR": "cuda_v11"
}
},
{
"name": "CUDA 12",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "50;52;60;61;70;75;80;86;89;90;90a;120",
"CMAKE_CUDA_FLAGS": "-Wno-deprecated-gpu-targets -t 2",
"OLLAMA_RUNNER_DIR": "cuda_v12"
}
},
{
"name": "CUDA 13",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;87-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual",
"CMAKE_CUDA_FLAGS": "-t 2",
"OLLAMA_RUNNER_DIR": "cuda_v13"
}
},
{
"name": "JetPack 5",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "72;87",
"OLLAMA_RUNNER_DIR": "cuda_jetpack5"
}
},
{
"name": "JetPack 6",
"inherits": [ "CUDA" ],
"cacheVariables": {
"CMAKE_CUDA_ARCHITECTURES": "87",
"OLLAMA_RUNNER_DIR": "cuda_jetpack6"
}
},
{
"name": "ROCm",
"name": "MLX Metal",
"inherits": [ "Default" ],
"cacheVariables": {
"CMAKE_HIP_PLATFORM": "amd"
}
},
{
"name": "ROCm 6",
"inherits": [ "ROCm" ],
"cacheVariables": {
"CMAKE_HIP_FLAGS": "-parallel-jobs=4",
"AMDGPU_TARGETS": "gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-",
"OLLAMA_RUNNER_DIR": "rocm"
}
},
{
"name": "ROCm 7",
"inherits": [ "ROCm" ],
"cacheVariables": {
"CMAKE_HIP_FLAGS": "-parallel-jobs=4",
"AMDGPU_TARGETS": "gfx942;gfx950;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1103;gfx1150;gfx1151;gfx1200;gfx1201;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-",
"OLLAMA_RUNNER_DIR": "rocm"
}
},
{
"name": "Vulkan",
"inherits": [ "Default" ],
"cacheVariables": {
"OLLAMA_RUNNER_DIR": "vulkan"
}
},
{
"name": "MLX",
"inherits": [ "Default" ],
"cacheVariables": {
"MLX_ENGINE": "ON",
"OLLAMA_RUNNER_DIR": "mlx"
}
},
{
"name": "MLX CUDA 12",
"inherits": [ "MLX", "CUDA 12" ],
"cacheVariables": {
"OLLAMA_RUNNER_DIR": "mlx_cuda_v12"
}
},
{
"name": "MLX CUDA 13",
"inherits": [ "MLX", "CUDA 13" ],
"cacheVariables": {
"MLX_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual",
"OLLAMA_RUNNER_DIR": "mlx_cuda_v13"
"OLLAMA_MLX_BACKENDS": "metal_v3;metal_v4"
}
}
],
@ -124,74 +25,9 @@
"configuration": "Release"
},
{
"name": "CPU",
"configurePreset": "Default",
"targets": [ "ggml-cpu" ]
},
{
"name": "CUDA",
"configurePreset": "CUDA",
"targets": [ "ggml-cuda" ]
},
{
"name": "CUDA 11",
"inherits": [ "CUDA" ],
"configurePreset": "CUDA 11"
},
{
"name": "CUDA 12",
"inherits": [ "CUDA" ],
"configurePreset": "CUDA 12"
},
{
"name": "CUDA 13",
"inherits": [ "CUDA" ],
"configurePreset": "CUDA 13"
},
{
"name": "JetPack 5",
"inherits": [ "CUDA" ],
"configurePreset": "JetPack 5"
},
{
"name": "JetPack 6",
"inherits": [ "CUDA" ],
"configurePreset": "JetPack 6"
},
{
"name": "ROCm",
"configurePreset": "ROCm",
"targets": [ "ggml-hip" ]
},
{
"name": "ROCm 6",
"inherits": [ "ROCm" ],
"configurePreset": "ROCm 6"
},
{
"name": "ROCm 7",
"inherits": [ "ROCm" ],
"configurePreset": "ROCm 7"
},
{
"name": "Vulkan",
"targets": [ "ggml-vulkan" ],
"configurePreset": "Vulkan"
},
{
"name": "MLX",
"targets": [ "mlx", "mlxc" ],
"configurePreset": "MLX"
},
{
"name": "MLX CUDA 12",
"targets": [ "mlx", "mlxc" ],
"configurePreset": "MLX CUDA 12"
},
{
"name": "MLX CUDA 13",
"targets": [ "mlx", "mlxc" ],
"configurePreset": "MLX CUDA 13"
"name": "MLX Metal",
"targets": [ "ollama-mlx-backends" ],
"configurePreset": "MLX Metal"
}
]
}

View file

@ -37,116 +37,150 @@ RUN dnf install -y unzip \
ENV CMAKE_GENERATOR=Ninja
ENV LDFLAGS=-s
FROM base AS cpu
#
# GPU toolchain stages — provide compilers for llama-server GPU builds
#
FROM base AS cpu-deps
RUN dnf install -y gcc-toolset-11-gcc gcc-toolset-11-gcc-c++
ENV PATH=/opt/rh/gcc-toolset-11/root/usr/bin:$PATH
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CPU' \
&& cmake --build --preset 'CPU' -- -l $(nproc) \
&& cmake --install build --component CPU --strip
FROM base AS cuda-11
ARG CUDA11VERSION=11.8
RUN dnf install -y cuda-toolkit-${CUDA11VERSION//./-}
ENV PATH=/usr/local/cuda-11/bin:$PATH
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CUDA 11' \
&& cmake --build --preset 'CUDA 11' -- -l $(nproc) \
&& cmake --install build --component CUDA --strip
FROM base AS cuda-12
FROM base AS cuda-12-deps
ARG CUDA12VERSION=12.8
RUN dnf install -y cuda-toolkit-${CUDA12VERSION//./-}
ENV PATH=/usr/local/cuda-12/bin:$PATH
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CUDA 12' \
&& cmake --build --preset 'CUDA 12' -- -l $(nproc) \
&& cmake --install build --component CUDA --strip
FROM base AS cuda-13
FROM base AS cuda-13-deps
ARG CUDA13VERSION=13.0
RUN dnf install -y cuda-toolkit-${CUDA13VERSION//./-}
ENV PATH=/usr/local/cuda-13/bin:$PATH
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'CUDA 13' \
&& cmake --build --preset 'CUDA 13' -- -l $(nproc) \
&& cmake --install build --component CUDA --strip
FROM base AS rocm-7-deps
ENV PATH=/opt/rocm/llvm/bin:/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/opt/rocm/bin:$PATH
FROM base AS rocm-7
ENV PATH=/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/opt/rocm/bin:/opt/rocm/hcc/bin:$PATH
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'ROCm 7' \
&& cmake --build --preset 'ROCm 7' -- -l $(nproc) \
&& cmake --install build --component HIP --strip
RUN rm -f dist/lib/ollama/rocm/rocblas/library/*gfx90[06]*
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK5VERSION} AS jetpack-5
ARG CMAKEVERSION
ARG NINJAVERSION
RUN apt-get update && apt-get install -y curl ccache unzip \
&& curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
&& curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
&& unzip /tmp/ninja.zip -d /usr/local/bin \
&& rm /tmp/ninja.zip
ENV CMAKE_GENERATOR=Ninja
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'JetPack 5' \
&& cmake --build --preset 'JetPack 5' -- -l $(nproc) \
&& cmake --install build --component CUDA --strip
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK6VERSION} AS jetpack-6
ARG CMAKEVERSION
ARG NINJAVERSION
RUN apt-get update && apt-get install -y curl ccache unzip \
&& curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
&& curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
&& unzip /tmp/ninja.zip -d /usr/local/bin \
&& rm /tmp/ninja.zip
ENV CMAKE_GENERATOR=Ninja
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'JetPack 6' \
&& cmake --build --preset 'JetPack 6' -- -l $(nproc) \
&& cmake --install build --component CUDA --strip
FROM base AS vulkan
FROM base AS vulkan-deps
ARG VULKANVERSION
RUN ln -s /usr/bin/python3 /usr/bin/python \
&& wget https://sdk.lunarg.com/sdk/download/${VULKANVERSION}/linux/vulkansdk-linux-x86_64-${VULKANVERSION}.tar.xz -O /tmp/vulkansdk.tar.xz \
&& tar xvf /tmp/vulkansdk.tar.xz -C /tmp \
&& /tmp/${VULKANVERSION}/vulkansdk -j 8 vulkan-headers \
&& /tmp/${VULKANVERSION}/vulkansdk -j 8 spirv-headers \
&& /tmp/${VULKANVERSION}/vulkansdk -j 8 shaderc \
&& cp -r /tmp/${VULKANVERSION}/x86_64/include/* /usr/local/include/ \
&& cp -r /tmp/${VULKANVERSION}/x86_64/lib/* /usr/local/lib \
&& cp -r /tmp/${VULKANVERSION}/x86_64/share/* /usr/local/share/ \
&& cp -r /tmp/${VULKANVERSION}/x86_64/bin/* /usr/local/bin/ \
&& rm -rf /tmp/${VULKANVERSION} /tmp/vulkansdk.tar.xz
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
ENV VULKAN_SDK=/usr/local
#
# llama-server stages — rebuild when LLAMA_CPP_VERSION, llama/server/, or llama/compat/ changes.
#
# CPU stage: llama-server + ggml-base + ggml-cpu variants → lib/ollama/
# GPU stages: GPU backend .so only → lib/ollama/<variant>/
#
FROM cpu-deps AS llama-server-cpu
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake --preset 'Vulkan' \
&& cmake --build --preset 'Vulkan' -- -l $(nproc) \
&& cmake --install build --component Vulkan --strip
cmake -S llama/server --preset cpu \
&& cmake --build build/llama-server-cpu -- -l $(nproc) \
&& cmake --install build/llama-server-cpu --component llama-server --strip \
&& for lib in \
/usr/lib64/libgomp.so* \
/usr/lib64/libomp.so* \
/opt/rh/gcc-toolset-11/root/usr/lib64/libgomp.so* \
/opt/rh/gcc-toolset-11/root/usr/lib64/libomp.so*; do \
[ -e "$lib" ] && cp -a "$lib" dist/lib/ollama/ || true; \
done
FROM cuda-12-deps AS llama-server-cuda_v12
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake -S llama/server --preset llama_cuda_v12_linux \
&& cmake --build build/llama-server-cuda_v12 -- -l $(nproc) \
&& cmake --install build/llama-server-cuda_v12 --component llama-server --strip
FROM cuda-13-deps AS llama-server-cuda_v13
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake -S llama/server --preset llama_cuda_v13_linux \
&& cmake --build build/llama-server-cuda_v13 -- -l $(nproc) \
&& cmake --install build/llama-server-cuda_v13 --component llama-server --strip
FROM rocm-7-deps AS llama-server-rocm_v7_2
ENV CC=clang CXX=clang++
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake -S llama/server --preset rocm_v7_2_linux \
&& cmake --build build/llama-server-rocm_v7_2 -- -l $(nproc) \
&& cmake --install build/llama-server-rocm_v7_2 --component llama-server --strip
RUN rm -f dist/lib/ollama/rocm_v7_2/rocblas/library/*gfx90[06]*
FROM vulkan-deps AS llama-server-vulkan
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake -S llama/server --preset vulkan \
&& cmake --build build/llama-server-vulkan -- -l $(nproc) \
&& cmake --install build/llama-server-vulkan --component llama-server --strip
#
# JetPack stages — self-contained with their own base images
#
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK5VERSION} AS jetpack-5
ARG CMAKEVERSION
ARG NINJAVERSION
RUN apt-get update && apt-get install -y curl ccache git unzip \
&& curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
&& curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
&& unzip /tmp/ninja.zip -d /usr/local/bin \
&& rm /tmp/ninja.zip
ENV CMAKE_GENERATOR=Ninja
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake -S llama/server --preset llama_cuda_jetpack5 \
&& cmake --build build/llama-server-cuda_jetpack5 -- -l $(nproc) \
&& cmake --install build/llama-server-cuda_jetpack5 --component llama-server --strip
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK6VERSION} AS jetpack-6
ARG CMAKEVERSION
ARG NINJAVERSION
RUN apt-get update && apt-get install -y curl ccache git unzip \
&& curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
&& curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
&& unzip /tmp/ninja.zip -d /usr/local/bin \
&& rm /tmp/ninja.zip
ENV CMAKE_GENERATOR=Ninja
COPY LLAMA_CPP_VERSION .
COPY llama/server llama/server
COPY llama/compat llama/compat
RUN --mount=type=cache,target=/root/.ccache \
cmake -S llama/server --preset llama_cuda_jetpack6 \
&& cmake --build build/llama-server-cuda_jetpack6 -- -l $(nproc) \
&& cmake --install build/llama-server-cuda_jetpack6 --component llama-server --strip
#
# MLX stage
#
FROM base AS mlx
ARG CUDA13VERSION=13.0
# OLLAMA_MLX_BUILD_JOBS empty -> ninja gates by load average (-l $(nproc))
ARG OLLAMA_MLX_BUILD_JOBS=
ARG OLLAMA_MLX_NVCC_THREADS=2
ARG MLX_CUDA_RAM_MB=
RUN dnf install -y cuda-toolkit-${CUDA13VERSION//./-} \
&& dnf install -y openblas-devel lapack-devel \
&& dnf install -y libcudnn9-cuda-13 libcudnn9-devel-cuda-13 \
@ -157,7 +191,7 @@ ENV LAPACK_INCLUDE_DIRS=/usr/include/openblas
ENV CGO_LDFLAGS="-L/usr/local/cuda-13/lib64 -L/usr/local/cuda-13/targets/x86_64-linux/lib/stubs"
WORKDIR /go/src/github.com/ollama/ollama
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
COPY cmake cmake
COPY x/imagegen/mlx x/imagegen/mlx
COPY go.mod go.sum .
COPY MLX_VERSION MLX_C_VERSION .
@ -173,10 +207,12 @@ RUN --mount=type=cache,target=/root/.ccache \
&& if [ -f /tmp/local-mlx-c/CMakeLists.txt ]; then \
export OLLAMA_MLX_C_SOURCE=/tmp/local-mlx-c; \
fi \
&& cmake --preset 'MLX CUDA 13' -DBLAS_INCLUDE_DIRS=/usr/include/openblas -DLAPACK_INCLUDE_DIRS=/usr/include/openblas -DCMAKE_CUDA_FLAGS="-t ${OLLAMA_MLX_NVCC_THREADS}" \
&& cmake --build --preset 'MLX CUDA 13' -- -l $(nproc) ${OLLAMA_MLX_BUILD_JOBS:+-j ${OLLAMA_MLX_BUILD_JOBS}} \
&& cmake --install build --component MLX --strip \
&& cmake --install build --component MLX_VENDOR
&& cmake -S . -B build/mlx_cuda_v13 -DOLLAMA_MLX_BACKENDS=cuda_v13 -DBLAS_INCLUDE_DIRS=/usr/include/openblas -DLAPACK_INCLUDE_DIRS=/usr/include/openblas -DCMAKE_CUDA_FLAGS="-t ${OLLAMA_MLX_NVCC_THREADS}" ${MLX_CUDA_RAM_MB:+-DMLX_CUDA_RAM_MB=${MLX_CUDA_RAM_MB}} -DOLLAMA_PAYLOAD_INSTALL_PREFIX=/go/src/github.com/ollama/ollama/dist \
&& cmake --build build/mlx_cuda_v13 --target ollama-mlx-cuda_v13 -- -l $(nproc) ${OLLAMA_MLX_BUILD_JOBS:+-j ${OLLAMA_MLX_BUILD_JOBS}}
#
# Go build
#
FROM base AS build
WORKDIR /go/src/github.com/ollama/ollama
@ -194,38 +230,59 @@ ENV CGO_CXXFLAGS="${CGO_CXXFLAGS}"
RUN --mount=type=cache,target=/root/.cache/go-build \
go build -trimpath -buildmode=pie -o /bin/ollama .
#
# Assembly stages — combine llama-server variants + GPU runtime libs
#
FROM --platform=linux/amd64 scratch AS amd64
# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
COPY --from=cuda-13 dist/lib/ollama /lib/ollama/
COPY --from=vulkan dist/lib/ollama /lib/ollama/
COPY --from=llama-server-cpu dist/lib/ollama /lib/ollama/
COPY --from=llama-server-cuda_v12 dist/lib/ollama /lib/ollama/
COPY --from=llama-server-cuda_v13 dist/lib/ollama /lib/ollama/
COPY --from=llama-server-vulkan dist/lib/ollama /lib/ollama/
COPY --from=mlx /go/src/github.com/ollama/ollama/dist/lib/ollama /lib/ollama/
FROM --platform=linux/arm64 scratch AS arm64
# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
COPY --from=cuda-13 dist/lib/ollama/ /lib/ollama/
COPY --from=llama-server-cpu dist/lib/ollama /lib/ollama/
COPY --from=llama-server-cuda_v12 dist/lib/ollama /lib/ollama/
COPY --from=llama-server-cuda_v13 dist/lib/ollama /lib/ollama/
COPY --from=jetpack-5 dist/lib/ollama/ /lib/ollama/
COPY --from=jetpack-6 dist/lib/ollama/ /lib/ollama/
FROM scratch AS rocm
COPY --from=rocm-7 dist/lib/ollama /lib/ollama
COPY --from=llama-server-cpu dist/lib/ollama /lib/ollama
COPY --from=llama-server-rocm_v7_2 dist/lib/ollama /lib/ollama
FROM ${FLAVOR} AS archive
COPY --from=cpu dist/lib/ollama /lib/ollama
FROM --platform=linux/amd64 scratch AS amd64-archive
COPY --from=amd64 /lib/ollama /lib/ollama/
COPY --from=llama-server-rocm_v7_2 dist/lib/ollama /lib/ollama/
FROM --platform=linux/arm64 scratch AS arm64-archive
COPY --from=arm64 /lib/ollama /lib/ollama/
FROM ${TARGETARCH}-archive AS archive
COPY --from=build /bin/ollama /bin/ollama
FROM ${FLAVOR} AS image-archive
COPY --from=build /bin/ollama /bin/ollama
FROM ubuntu:24.04
ARG APT_MIRROR=http://archive.ubuntu.com/ubuntu
RUN sed -i "s|http://archive.ubuntu.com/ubuntu|$APT_MIRROR|g" /etc/apt/sources.list.d/ubuntu.sources \
ARG APT_PORTS_MIRROR=http://ports.ubuntu.com/ubuntu-ports
RUN sed -i \
-e "s|http://archive.ubuntu.com/ubuntu|$APT_MIRROR|g" \
-e "s|http://ports.ubuntu.com/ubuntu-ports|$APT_PORTS_MIRROR|g" \
/etc/apt/sources.list.d/ubuntu.sources \
&& apt-get update \
&& apt-get install -y ca-certificates libvulkan1 libopenblas0 \
&& sed -i "s|$APT_MIRROR|http://archive.ubuntu.com/ubuntu|g" /etc/apt/sources.list.d/ubuntu.sources \
&& sed -i \
-e "s|$APT_MIRROR|http://archive.ubuntu.com/ubuntu|g" \
-e "s|$APT_PORTS_MIRROR|http://ports.ubuntu.com/ubuntu-ports|g" \
/etc/apt/sources.list.d/ubuntu.sources \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
COPY --from=archive /bin /usr/bin
COPY --from=image-archive /bin /usr/bin
ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
COPY --from=archive /lib/ollama /usr/lib/ollama
COPY --from=image-archive /lib/ollama /usr/lib/ollama
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_VISIBLE_DEVICES=all

1
LLAMA_CPP_VERSION Normal file
View file

@ -0,0 +1 @@
b9409

View file

@ -1 +1 @@
e8ebdebeeb655feaa85a51f6b24ece5b6d5518d1
2165dc08d7b33258260aa849d39f087d50e62962

View file

@ -1,76 +0,0 @@
UPSTREAM=https://github.com/ggml-org/llama.cpp.git
WORKDIR=llama/vendor
FETCH_HEAD=ec98e2002
.PHONY: help
help:
@echo "Available targets:"
@echo " sync Sync with upstream repositories"
@echo " checkout Checkout upstream repository"
@echo " apply-patches Apply patches to local repository"
@echo " format-patches Format patches from local repository"
@echo " clean Clean local repository"
@echo
@echo "Example:"
@echo " make -f $(lastword $(MAKEFILE_LIST)) clean apply-patches sync"
.PHONY: sync
sync: llama/build-info.cpp ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal
llama/build-info.cpp: llama/build-info.cpp.in llama/llama.cpp
sed -e 's|@FETCH_HEAD@|$(FETCH_HEAD)|' <$< >$@
ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal: ml/backend/ggml/ggml
go generate ./$(@D)
.PHONY: llama/llama.cpp
llama/llama.cpp: llama/vendor
rsync -arvzc --delete -f "include LICENSE" -f "merge $@/.rsync-filter" $(addprefix $<,/LICENSE /) $@
.PHONY: ml/backend/ggml/ggml
ml/backend/ggml/ggml: llama/vendor
rsync -arvzc --delete -f "include LICENSE" -f "merge $@/.rsync-filter" $(addprefix $<,/LICENSE /ggml/) $@
PATCHES=$(wildcard llama/patches/*.patch)
PATCHED=$(join $(dir $(PATCHES)), $(addsuffix ed, $(addprefix ., $(notdir $(PATCHES)))))
.PHONY: apply-patches
.NOTPARALLEL:
apply-patches: $(PATCHED)
llama/patches/.%.patched: llama/patches/%.patch
@if git -c user.name=nobody -c 'user.email=<>' -C $(WORKDIR) am -3 $(realpath $<); then \
touch $@; \
else \
echo "Patch failed. Resolve any conflicts then continue."; \
echo "1. Run 'git -C $(WORKDIR) am --continue'"; \
echo "2. Run 'make -f $(lastword $(MAKEFILE_LIST)) format-patches'"; \
echo "3. Run 'make -f $(lastword $(MAKEFILE_LIST)) clean apply-patches'"; \
exit 1; \
fi
.PHONY: checkout
checkout: $(WORKDIR)
git -C $(WORKDIR) fetch
git -C $(WORKDIR) checkout -f $(FETCH_HEAD)
$(WORKDIR):
git clone $(UPSTREAM) $(WORKDIR)
.PHONY: format-patches
format-patches: llama/patches
git -C $(WORKDIR) format-patch \
--no-signature \
--no-numbered \
--zero-commit \
-o $(realpath $<) \
$(FETCH_HEAD)
.PHONY: clean
clean: checkout
@git -C $(WORKDIR) am --abort || true
$(RM) llama/patches/.*.patched
.PHONY: print-base
print-base:
@echo $(FETCH_HEAD)

View file

@ -259,6 +259,10 @@ func (c *Client) stream(ctx context.Context, method, path string, data any, fn f
}
}
if err := scanner.Err(); err != nil {
return err
}
return nil
}

View file

@ -3,6 +3,7 @@ package api
import (
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httptest"
"net/url"
@ -192,6 +193,35 @@ func TestClientStream(t *testing.T) {
}
}
func TestClientStreamReportsReadErrors(t *testing.T) {
client := NewClient(
&url.URL{Scheme: "http", Host: "example.com"},
&http.Client{Transport: roundTripFunc(func(*http.Request) (*http.Response, error) {
body := failingReader{
data: []byte(`{"message":{"content":"partial"}}` + "\n"),
err: io.ErrUnexpectedEOF,
}
return &http.Response{
StatusCode: http.StatusOK,
Status: "200 OK",
Body: io.NopCloser(&body),
Header: make(http.Header),
}, nil
})},
)
err := client.stream(t.Context(), http.MethodPost, "/api/chat", nil, func([]byte) error {
return nil
})
if err == nil {
t.Fatal("expected stream read error")
}
if !strings.Contains(err.Error(), io.ErrUnexpectedEOF.Error()) {
t.Fatalf("expected unexpected EOF, got %v", err)
}
}
func TestClientDo(t *testing.T) {
testCases := []struct {
name string
@ -320,3 +350,23 @@ func TestClientDo(t *testing.T) {
})
}
}
type roundTripFunc func(*http.Request) (*http.Response, error)
func (f roundTripFunc) RoundTrip(req *http.Request) (*http.Response, error) {
return f(req)
}
type failingReader struct {
data []byte
err error
}
func (r *failingReader) Read(p []byte) (int, error) {
if len(r.data) > 0 {
n := copy(p, r.data)
r.data = r.data[n:]
return n, nil
}
return 0, r.err
}

View file

@ -600,12 +600,13 @@ type Options struct {
// Runner options which must be set when the model is loaded into memory
type Runner struct {
NumCtx int `json:"num_ctx,omitempty"`
NumBatch int `json:"num_batch,omitempty"`
NumGPU int `json:"num_gpu,omitempty"`
MainGPU int `json:"main_gpu,omitempty"`
UseMMap *bool `json:"use_mmap,omitempty"`
NumThread int `json:"num_thread,omitempty"`
NumCtx int `json:"num_ctx,omitempty"`
NumBatch int `json:"num_batch,omitempty"`
NumGPU int `json:"num_gpu,omitempty"`
MainGPU *int `json:"main_gpu,omitempty"`
UseMMap *bool `json:"use_mmap,omitempty"`
NumThread int `json:"num_thread,omitempty"`
DraftNumPredict int `json:"draft_num_predict,omitempty"`
}
// EmbedRequest is the request passed to [Client.Embed].
@ -672,6 +673,9 @@ type CreateRequest struct {
// Quantize is the quantization format for the model; leave blank to not change the quantization level.
Quantize string `json:"quantize,omitempty"`
// DraftQuantize is the quantization format for the draft model.
DraftQuantize string `json:"draft_quantize,omitempty"`
// From is the name of the model or file to use as the source.
From string `json:"from,omitempty"`
@ -681,6 +685,9 @@ type CreateRequest struct {
// Files is a map of files include when creating the model.
Files map[string]string `json:"files,omitempty"`
// DraftFiles is a map of draft model files to include when creating the model.
DraftFiles map[string]string `json:"draft_files,omitempty"`
// Adapters is a map of LoRA adapters to include when creating the model.
Adapters map[string]string `json:"adapters,omitempty"`
@ -1049,14 +1056,25 @@ func (opts *Options) FromMap(m map[string]any) error {
}
field.Set(reflect.ValueOf(slice))
case reflect.Pointer:
var b bool
if field.Type() == reflect.TypeOf(&b) {
switch field.Type().Elem().Kind() {
case reflect.Bool:
val, ok := val.(bool)
if !ok {
return fmt.Errorf("option %q must be of type boolean", key)
}
field.Set(reflect.ValueOf(&val))
} else {
case reflect.Int:
var i int
switch t := val.(type) {
case int64:
i = int(t)
case float64:
i = int(t)
default:
return fmt.Errorf("option %q must be of type integer", key)
}
field.Set(reflect.ValueOf(&i))
default:
return fmt.Errorf("unknown type loading config params: %v %v", field.Kind(), field.Type())
}
default:
@ -1089,11 +1107,12 @@ func DefaultOptions() Options {
Runner: Runner{
// options set when the model is loaded
NumCtx: int(envconfig.ContextLength()),
NumBatch: 512,
NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically
NumThread: 0, // let the runtime decide
UseMMap: nil,
NumCtx: int(envconfig.ContextLength()),
NumBatch: 512,
NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically
NumThread: 0, // let the runtime decide
DraftNumPredict: 4,
UseMMap: nil,
},
}
}
@ -1297,14 +1316,20 @@ func FormatParams(params map[string][]string) (map[string]any, error) {
// TODO: only string slices are supported right now
out[key] = vals
case reflect.Pointer:
var b bool
if field.Type() == reflect.TypeOf(&b) {
switch field.Type().Elem().Kind() {
case reflect.Bool:
boolVal, err := strconv.ParseBool(vals[0])
if err != nil {
return nil, fmt.Errorf("invalid bool value %s", vals)
}
out[key] = &boolVal
} else {
case reflect.Int:
intVal, err := strconv.ParseInt(vals[0], 10, 64)
if err != nil {
return nil, fmt.Errorf("invalid int value %s", vals)
}
out[key] = intVal
default:
return nil, fmt.Errorf("unknown type %s for %s", field.Kind(), key)
}
default:

View file

@ -20,6 +20,10 @@ func testPropsMap(m map[string]ToolProperty) *ToolPropertiesMap {
return props
}
func testIntPtr(v int) *int {
return &v
}
// testArgs creates ToolCallFunctionArguments from a map (convenience function for tests, order not preserved)
func testArgs(m map[string]any) ToolCallFunctionArguments {
args := NewToolCallFunctionArguments()
@ -168,6 +172,47 @@ func TestUseMmapParsingFromJSON(t *testing.T) {
}
}
func TestMainGPUParsingFromJSON(t *testing.T) {
tests := []struct {
name string
req string
wantGPU *int
}{
{
name: "Undefined",
req: `{}`,
},
{
name: "Zero",
req: `{ "main_gpu": 0 }`,
wantGPU: testIntPtr(0),
},
{
name: "Nonzero",
req: `{ "main_gpu": 1 }`,
wantGPU: testIntPtr(1),
},
}
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
var oMap map[string]any
err := json.Unmarshal([]byte(test.req), &oMap)
require.NoError(t, err)
opts := DefaultOptions()
err = opts.FromMap(oMap)
require.NoError(t, err)
if test.wantGPU == nil {
assert.Nil(t, opts.MainGPU)
} else if assert.NotNil(t, opts.MainGPU) {
assert.Equal(t, *test.wantGPU, *opts.MainGPU)
}
})
}
}
func TestUseMmapFormatParams(t *testing.T) {
tr := true
fa := false
@ -232,6 +277,12 @@ func TestUseMmapFormatParams(t *testing.T) {
}
}
func TestMainGPUFormatParams(t *testing.T) {
resp, err := FormatParams(map[string][]string{"main_gpu": {"0"}})
require.NoError(t, err)
assert.Equal(t, int64(0), resp["main_gpu"])
}
func TestMessage_UnmarshalJSON(t *testing.T) {
tests := []struct {
input string

View file

@ -90,9 +90,8 @@ DialogFontSize=12
[Files]
#if FileExists("..\dist\windows-ollama-app-amd64.exe")
Source: "..\dist\windows-ollama-app-amd64.exe"; DestDir: "{app}"; DestName: "{#MyAppExeName}" ;Check: not IsArm64(); Flags: ignoreversion 64bit; BeforeInstall: TaskKill('{#MyAppExeName}')
Source: "..\dist\windows-amd64\vc_redist.x64.exe"; DestDir: "{tmp}"; Check: not IsArm64() and vc_redist_needed(); Flags: deleteafterinstall
Source: "..\dist\windows-amd64\ollama.exe"; DestDir: "{app}"; Check: not IsArm64(); Flags: ignoreversion 64bit; BeforeInstall: TaskKill('ollama.exe')
Source: "..\dist\windows-amd64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
Source: "..\dist\windows-amd64\lib\ollama\*"; Excludes: "\mlx_*\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
#endif
; For local development, rely on binary compatibility at runtime since we can't cross compile
@ -103,9 +102,11 @@ Source: "..\dist\windows-ollama-app-amd64.exe"; DestDir: "{app}"; DestName: "{#M
#endif
#if FileExists("..\dist\windows-arm64\ollama.exe")
Source: "..\dist\windows-arm64\vc_redist.arm64.exe"; DestDir: "{tmp}"; Check: IsArm64() and vc_redist_needed(); Flags: deleteafterinstall
Source: "..\dist\windows-arm64\ollama.exe"; DestDir: "{app}"; Check: IsArm64(); Flags: ignoreversion 64bit; BeforeInstall: TaskKill('ollama.exe')
#endif
#if DirExists("..\dist\windows-arm64\lib\ollama")
Source: "..\dist\windows-arm64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: IsArm64(); Flags: ignoreversion 64bit recursesubdirs
#endif
Source: ".\assets\app.ico"; DestDir: "{app}"; Flags: ignoreversion
@ -118,12 +119,6 @@ Name: "{userprograms}\{#MyAppName}"; Filename: "{app}\{#MyAppExeName}"; IconFile
Type: files; Name: "{%LOCALAPPDATA}\Ollama\updates"
[Run]
#if DirExists("..\dist\windows-arm64")
Filename: "{tmp}\vc_redist.arm64.exe"; Parameters: "/install /passive /norestart"; Check: IsArm64() and vc_redist_needed(); StatusMsg: "Installing VC++ Redistributables..."; Flags: waituntilterminated
#endif
#if DirExists("..\dist\windows-amd64")
Filename: "{tmp}\vc_redist.x64.exe"; Parameters: "/install /passive /norestart"; Check: not IsArm64() and vc_redist_needed(); StatusMsg: "Installing VC++ Redistributables..."; Flags: waituntilterminated
#endif
Filename: "{cmd}"; Parameters: "/C set PATH={app};%PATH% & ""{app}\{#MyAppExeName}"""; Flags: postinstall nowait runhidden
[UninstallRun]
@ -184,46 +179,6 @@ begin
Result := Pos(';' + ExpandConstant(Param) + ';', ';' + OrigPath + ';') = 0;
end;
{ --- VC Runtime libraries discovery code - Only install vc_redist if it isn't already installed ----- }
const VCRTL_MIN_V1 = 14;
const VCRTL_MIN_V2 = 40;
const VCRTL_MIN_V3 = 33807;
const VCRTL_MIN_V4 = 0;
// check if the minimum required vc redist is installed (by looking the registry)
function vc_redist_needed (): Boolean;
var
sRegKey: string;
v1: Cardinal;
v2: Cardinal;
v3: Cardinal;
v4: Cardinal;
begin
if (IsArm64()) then begin
sRegKey := 'SOFTWARE\WOW6432Node\Microsoft\VisualStudio\14.0\VC\Runtimes\arm64';
end else begin
sRegKey := 'SOFTWARE\Microsoft\VisualStudio\14.0\VC\Runtimes\x64';
end;
if (RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Major', v1) and
RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Minor', v2) and
RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Bld', v3) and
RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'RBld', v4)) then
begin
Log ('VC Redist version: ' + IntToStr (v1) +
'.' + IntToStr (v2) + '.' + IntToStr (v3) +
'.' + IntToStr (v4));
{ Version info was found. Return true if later or equal to our
minimal required version RTL_MIN_Vx }
Result := not (
(v1 > VCRTL_MIN_V1) or ((v1 = VCRTL_MIN_V1) and
((v2 > VCRTL_MIN_V2) or ((v2 = VCRTL_MIN_V2) and
((v3 > VCRTL_MIN_V3) or ((v3 = VCRTL_MIN_V3) and
(v4 >= VCRTL_MIN_V4)))))));
end
else
Result := TRUE;
end;
function GetDirSize(Path: String): Int64;
var
FindRec: TFindRec;

691
cmake/local.cmake Normal file
View file

@ -0,0 +1,691 @@
# Local Ollama superbuild targets.
#
# This file keeps the repository-root CMake project focused on orchestration:
# it builds a runnable local Ollama payload by delegating llama.cpp work to the
# llama/server CMake project and building the Go binary into a matching layout.
include(ExternalProject)
set(OLLAMA_LLAMA_BACKENDS "" CACHE STRING
"Semicolon-separated llama-server GPU backends to build: cuda_v12;cuda_v13;rocm_v7_1;rocm_v7_2;vulkan;cuda_jetpack5;cuda_jetpack6")
set(_ollama_mlx_backends_doc "Semicolon-separated MLX backends to build: cuda_v13;metal_v3;metal_v4")
set(OLLAMA_VERSION "0.0.0" CACHE STRING "Ollama version embedded in the local Go binary")
set(OLLAMA_PAYLOAD_INSTALL_PREFIX "${CMAKE_BINARY_DIR}" CACHE PATH
"Build-time staging prefix for nested Ollama native payloads")
string(REGEX REPLACE "^v" "" OLLAMA_VERSION "${OLLAMA_VERSION}")
set(OLLAMA_NATIVE_CONFIG_ARG)
if(CMAKE_CONFIGURATION_TYPES)
set(OLLAMA_NATIVE_CONFIG_ARG --config Release)
endif()
set(OLLAMA_NATIVE_EXTERNAL_OPTIONS)
if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.28)
list(APPEND OLLAMA_NATIVE_EXTERNAL_OPTIONS BUILD_JOB_SERVER_AWARE TRUE)
endif()
function(ollama_check_metal_toolchain output_version)
find_program(_ollama_xcrun xcrun)
if(NOT _ollama_xcrun)
message(FATAL_ERROR
"MLX Metal requires Xcode command line tools. Install Xcode, run "
"`sudo xcode-select -s /Applications/Xcode.app/Contents/Developer`, "
"then install the Metal toolchain with "
"`xcodebuild -downloadComponent MetalToolchain`.")
endif()
execute_process(
COMMAND zsh "-c"
"echo \"__METAL_VERSION__\" | \"${_ollama_xcrun}\" -sdk macosx metal -E -x metal -P - 2>/dev/null | tail -1 | tr -d '\n'"
OUTPUT_VARIABLE _metal_version
RESULT_VARIABLE _metal_result)
if(NOT _metal_result EQUAL 0 OR NOT _metal_version MATCHES "^[0-9]+$")
message(FATAL_ERROR
"MLX Metal requires Xcode's Metal toolchain. Install Xcode, run "
"`sudo xcode-select -s /Applications/Xcode.app/Contents/Developer`, "
"then install the Metal toolchain with "
"`xcodebuild -downloadComponent MetalToolchain`.")
endif()
set(${output_version} "${_metal_version}" PARENT_SCOPE)
endfunction()
function(ollama_macos_major_version output)
execute_process(
COMMAND sw_vers -productVersion
OUTPUT_VARIABLE _macos_version
OUTPUT_STRIP_TRAILING_WHITESPACE
RESULT_VARIABLE _macos_result
ERROR_QUIET)
if(_macos_result EQUAL 0)
string(REGEX MATCH "^[0-9]+" _macos_major "${_macos_version}")
endif()
set(${output} "${_macos_major}" PARENT_SCOPE)
endfunction()
function(ollama_macos_sdk_major_version output)
execute_process(
COMMAND xcrun --sdk macosx --show-sdk-version
OUTPUT_VARIABLE _sdk_version
OUTPUT_STRIP_TRAILING_WHITESPACE
RESULT_VARIABLE _sdk_result
ERROR_QUIET)
if(_sdk_result EQUAL 0)
string(REGEX MATCH "^[0-9]+" _sdk_major "${_sdk_version}")
endif()
set(${output} "${_sdk_major}" PARENT_SCOPE)
endfunction()
function(ollama_default_mlx_backends output)
set(_backends "")
if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
ollama_check_metal_toolchain(_metal_version)
ollama_macos_major_version(_macos_major)
ollama_macos_sdk_major_version(_sdk_major)
if(_macos_major AND _sdk_major AND _macos_major GREATER_EQUAL 26 AND _sdk_major GREATER_EQUAL 26)
set(_backends "metal_v4")
else()
set(_backends "metal_v3")
endif()
message(STATUS "Defaulting OLLAMA_MLX_BACKENDS=${_backends} for macOS arm64")
endif()
set(${output} "${_backends}" PARENT_SCOPE)
endfunction()
if(NOT DEFINED OLLAMA_MLX_BACKENDS)
ollama_default_mlx_backends(_ollama_default_mlx_backends)
set(OLLAMA_MLX_BACKENDS "${_ollama_default_mlx_backends}" CACHE STRING "${_ollama_mlx_backends_doc}")
else()
set(OLLAMA_MLX_BACKENDS "${OLLAMA_MLX_BACKENDS}" CACHE STRING "${_ollama_mlx_backends_doc}")
endif()
if(NOT OLLAMA_HAVE_LLAMA_SERVER)
if(OLLAMA_LLAMA_BACKENDS)
message(FATAL_ERROR "llama/server is required when OLLAMA_LLAMA_BACKENDS is set")
endif()
if(NOT OLLAMA_MLX_BACKENDS)
message(FATAL_ERROR "llama/server is required for local Ollama builds")
endif()
else()
file(READ "${CMAKE_SOURCE_DIR}/LLAMA_CPP_VERSION" OLLAMA_LLAMA_CPP_GIT_TAG)
string(STRIP "${OLLAMA_LLAMA_CPP_GIT_TAG}" OLLAMA_LLAMA_CPP_GIT_TAG)
include(${CMAKE_SOURCE_DIR}/llama/compat/compat.cmake)
if(DEFINED FETCHCONTENT_SOURCE_DIR_LLAMA_CPP AND NOT "${FETCHCONTENT_SOURCE_DIR_LLAMA_CPP}" STREQUAL "")
get_filename_component(OLLAMA_LLAMA_CPP_SOURCE_DIR
"${FETCHCONTENT_SOURCE_DIR_LLAMA_CPP}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
message(STATUS "Using llama.cpp source override: ${OLLAMA_LLAMA_CPP_SOURCE_DIR}")
add_custom_target(ollama-llama-cpp-source)
elseif(DEFINED ENV{OLLAMA_LLAMA_CPP_SOURCE})
get_filename_component(OLLAMA_LLAMA_CPP_SOURCE_DIR
"$ENV{OLLAMA_LLAMA_CPP_SOURCE}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
message(STATUS "Using local llama.cpp source: ${OLLAMA_LLAMA_CPP_SOURCE_DIR}")
add_custom_target(ollama-llama-cpp-source)
else()
set(OLLAMA_LLAMA_CPP_SOURCE_DIR "${CMAKE_BINARY_DIR}/_deps/llama_cpp-src")
ExternalProject_Add(ollama-llama-cpp-source
GIT_REPOSITORY "https://github.com/ggml-org/llama.cpp.git"
GIT_TAG ${OLLAMA_LLAMA_CPP_GIT_TAG}
GIT_SHALLOW TRUE
SOURCE_DIR ${OLLAMA_LLAMA_CPP_SOURCE_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
PATCH_COMMAND ${OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND}
USES_TERMINAL_DOWNLOAD TRUE
USES_TERMINAL_PATCH TRUE)
endif()
endif()
set(_mlx_source_targets)
if(OLLAMA_MLX_BACKENDS)
file(READ "${CMAKE_SOURCE_DIR}/MLX_VERSION" OLLAMA_MLX_GIT_TAG)
string(STRIP "${OLLAMA_MLX_GIT_TAG}" OLLAMA_MLX_GIT_TAG)
file(READ "${CMAKE_SOURCE_DIR}/MLX_C_VERSION" OLLAMA_MLX_C_GIT_TAG)
string(STRIP "${OLLAMA_MLX_C_GIT_TAG}" OLLAMA_MLX_C_GIT_TAG)
if(DEFINED FETCHCONTENT_SOURCE_DIR_MLX AND NOT "${FETCHCONTENT_SOURCE_DIR_MLX}" STREQUAL "")
get_filename_component(OLLAMA_MLX_SOURCE_DIR
"${FETCHCONTENT_SOURCE_DIR_MLX}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
message(STATUS "Using MLX source override: ${OLLAMA_MLX_SOURCE_DIR}")
elseif(DEFINED ENV{OLLAMA_MLX_SOURCE})
get_filename_component(OLLAMA_MLX_SOURCE_DIR
"$ENV{OLLAMA_MLX_SOURCE}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
message(STATUS "Using local MLX source: ${OLLAMA_MLX_SOURCE_DIR}")
else()
set(OLLAMA_MLX_SOURCE_DIR "${CMAKE_BINARY_DIR}/_deps/mlx-src")
ExternalProject_Add(ollama-mlx-source
GIT_REPOSITORY "https://github.com/ml-explore/mlx.git"
GIT_TAG ${OLLAMA_MLX_GIT_TAG}
# MLX uses commit hashes while we track closely; switch to shallow when MLX pins move to tags.
GIT_SHALLOW FALSE
SOURCE_DIR ${OLLAMA_MLX_SOURCE_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
USES_TERMINAL_DOWNLOAD TRUE)
list(APPEND _mlx_source_targets ollama-mlx-source)
endif()
if(DEFINED "FETCHCONTENT_SOURCE_DIR_MLX-C" AND NOT "${FETCHCONTENT_SOURCE_DIR_MLX-C}" STREQUAL "")
get_filename_component(OLLAMA_MLX_C_SOURCE_DIR
"${FETCHCONTENT_SOURCE_DIR_MLX-C}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
message(STATUS "Using MLX-C source override: ${OLLAMA_MLX_C_SOURCE_DIR}")
elseif(DEFINED ENV{OLLAMA_MLX_C_SOURCE})
get_filename_component(OLLAMA_MLX_C_SOURCE_DIR
"$ENV{OLLAMA_MLX_C_SOURCE}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
message(STATUS "Using local MLX-C source: ${OLLAMA_MLX_C_SOURCE_DIR}")
else()
set(OLLAMA_MLX_C_SOURCE_DIR "${CMAKE_BINARY_DIR}/_deps/mlx-c-src")
ExternalProject_Add(ollama-mlx-c-source
GIT_REPOSITORY "https://github.com/ml-explore/mlx-c.git"
GIT_TAG ${OLLAMA_MLX_C_GIT_TAG}
# MLX-C uses commit hashes while we track closely; switch to shallow when MLX-C pins move to tags.
GIT_SHALLOW FALSE
SOURCE_DIR ${OLLAMA_MLX_C_SOURCE_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
USES_TERMINAL_DOWNLOAD TRUE)
list(APPEND _mlx_source_targets ollama-mlx-c-source)
endif()
add_custom_target(ollama-mlx-sources DEPENDS ${_mlx_source_targets})
endif()
set(OLLAMA_NATIVE_BUILD_TOOL_COMMAND
${CMAKE_COMMAND} --build <BINARY_DIR>)
set(OLLAMA_NATIVE_BUILD_TARGET_ARG --target)
if(CMAKE_GENERATOR MATCHES "Makefiles")
set(OLLAMA_NATIVE_BUILD_TOOL_COMMAND
"$(MAKE)" -C <BINARY_DIR>)
set(OLLAMA_NATIVE_BUILD_TARGET_ARG)
endif()
function(ollama_escape_cmake_list input output)
string(REPLACE ";" "|" _escaped "${input}")
set(${output} "${_escaped}" PARENT_SCOPE)
endfunction()
function(ollama_collect_cache_args_with_prefix prefix output)
get_cmake_property(_cache_variables CACHE_VARIABLES)
list(SORT _cache_variables)
set(_args)
foreach(_var IN LISTS _cache_variables)
if(_var MATCHES "^${prefix}")
ollama_escape_cmake_list("${${_var}}" _value)
list(APPEND _args "-D${_var}=${_value}")
endif()
endforeach()
set(${output} "${_args}" PARENT_SCOPE)
endfunction()
function(ollama_append_cache_arg_if_set output name)
if(DEFINED ${name} AND NOT "${${name}}" STREQUAL "")
ollama_escape_cmake_list("${${name}}" _value)
set(${output} ${${output}} "-D${name}=${_value}" PARENT_SCOPE)
endif()
endfunction()
function(ollama_cache_arg_is_set name output)
if(DEFINED ${name} AND NOT "${${name}}" STREQUAL "")
set(${output} TRUE PARENT_SCOPE)
else()
set(${output} FALSE PARENT_SCOPE)
endif()
endfunction()
function(ollama_llama_cuda_preset backend output)
ollama_cache_arg_is_set(CMAKE_CUDA_ARCHITECTURES _has_cuda_arch)
if(_has_cuda_arch)
set(_preset "llama_${backend}_user_arch")
elseif(WIN32)
set(_preset "llama_${backend}_windows")
else()
set(_preset "llama_${backend}_linux")
endif()
set(${output} "${_preset}" PARENT_SCOPE)
endfunction()
function(ollama_mlx_cuda_preset output)
ollama_cache_arg_is_set(MLX_CUDA_ARCHITECTURES _has_mlx_arch)
ollama_cache_arg_is_set(CMAKE_CUDA_ARCHITECTURES _has_cuda_arch)
if(_has_mlx_arch OR _has_cuda_arch)
set(_preset "mlx_cuda_v13_user_arch")
elseif(WIN32)
set(_preset "mlx_cuda_v13_windows")
else()
set(_preset "mlx_cuda_v13_linux")
endif()
set(${output} "${_preset}" PARENT_SCOPE)
endfunction()
function(ollama_rocm_preset backend output)
ollama_cache_arg_is_set(AMDGPU_TARGETS _has_amdgpu_targets)
ollama_cache_arg_is_set(CMAKE_HIP_ARCHITECTURES _has_hip_arch)
if(_has_amdgpu_targets OR _has_hip_arch)
if(backend STREQUAL "rocm_v7_1" AND NOT WIN32)
message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_1 is only supported for Windows ROCm builds")
elseif(backend STREQUAL "rocm_v7_2" AND WIN32)
message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_2 is only supported for Linux ROCm builds")
endif()
elseif(backend STREQUAL "rocm_v7_1")
if(NOT WIN32)
message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_1 is only supported for Windows ROCm builds")
endif()
set(_preset "${backend}_windows")
elseif(backend STREQUAL "rocm_v7_2")
if(WIN32)
message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_2 is only supported for Linux ROCm builds")
endif()
set(_preset "${backend}_linux")
else()
message(FATAL_ERROR "Unknown ROCm backend '${backend}'")
endif()
if(_has_amdgpu_targets OR _has_hip_arch)
set(_preset "${backend}_user_arch")
endif()
set(${output} "${_preset}" PARENT_SCOPE)
endfunction()
function(ollama_add_llama_server_build name)
cmake_parse_arguments(ARG "" "PRESET;RUNNER_DIR" "TARGETS;CMAKE_ARGS" ${ARGN})
if(NOT ARG_TARGETS)
message(FATAL_ERROR "ollama_add_llama_server_build(${name}) requires TARGETS")
endif()
if(WIN32 AND name STREQUAL "vulkan")
# The Vulkan shader generator nests deeply enough to hit Windows MAX_PATH.
set(_build_dir ${CMAKE_BINARY_DIR}/ls-vk)
else()
set(_build_dir ${CMAKE_BINARY_DIR}/llama-server-${name})
endif()
ollama_collect_cache_args_with_prefix("GGML_" _ggml_cache_args)
ollama_collect_cache_args_with_prefix("LLAMA_" _llama_cache_args)
set(_cmake_args
-DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
-DCMAKE_INSTALL_PREFIX=${OLLAMA_PAYLOAD_INSTALL_PREFIX}
-DOLLAMA_LIB_DIR:STRING=${OLLAMA_LIB_DIR}
-DOLLAMA_RUNNER_DIR=${ARG_RUNNER_DIR}
-DFETCHCONTENT_SOURCE_DIR_LLAMA_CPP=${OLLAMA_LLAMA_CPP_SOURCE_DIR}
-DOLLAMA_LLAMA_CPP_SKIP_COMPAT_PATCH=ON
-DGGML_NATIVE=OFF
-DGGML_OPENMP=OFF
${ARG_CMAKE_ARGS}
${_ggml_cache_args}
${_llama_cache_args}
)
if(APPLE)
if(CMAKE_OSX_ARCHITECTURES)
list(APPEND _cmake_args
-DCMAKE_OSX_ARCHITECTURES=${CMAKE_OSX_ARCHITECTURES})
endif()
if(CMAKE_OSX_DEPLOYMENT_TARGET)
list(APPEND _cmake_args
-DCMAKE_OSX_DEPLOYMENT_TARGET=${CMAKE_OSX_DEPLOYMENT_TARGET})
endif()
endif()
set(_configure_command ${CMAKE_COMMAND}
-S ${CMAKE_SOURCE_DIR}/llama/server
-B <BINARY_DIR>
${_cmake_args})
if(ARG_PRESET)
set(_configure_command ${CMAKE_COMMAND}
-S ${CMAKE_SOURCE_DIR}/llama/server
--preset ${ARG_PRESET}
-B <BINARY_DIR>
${_cmake_args})
endif()
ExternalProject_Add(ollama-llama-server-${name}
SOURCE_DIR ${CMAKE_SOURCE_DIR}/llama/server
BINARY_DIR ${_build_dir}
CONFIGURE_COMMAND ${_configure_command}
BUILD_COMMAND ${OLLAMA_NATIVE_BUILD_TOOL_COMMAND}
${OLLAMA_NATIVE_CONFIG_ARG}
${OLLAMA_NATIVE_BUILD_TARGET_ARG} ${ARG_TARGETS}
INSTALL_COMMAND ${CMAKE_COMMAND} --install <BINARY_DIR>
${OLLAMA_NATIVE_CONFIG_ARG}
--component llama-server
DEPENDS ollama-llama-cpp-source
LIST_SEPARATOR |
# ExternalProject cannot reliably infer when nested FetchContent
# sources, compat patches, or forwarded GGML/LLAMA cache settings need
# a rebuild. Always entering the sub-build keeps direct `cmake --build`
# iteration correct; the nested generator still performs incremental
# compilation.
BUILD_ALWAYS TRUE
${OLLAMA_NATIVE_EXTERNAL_OPTIONS}
USES_TERMINAL_CONFIGURE TRUE
USES_TERMINAL_BUILD TRUE
USES_TERMINAL_INSTALL TRUE)
endfunction()
function(ollama_add_mlx_build name)
cmake_parse_arguments(ARG "" "PRESET;RUNNER_DIR" "CMAKE_ARGS" ${ARGN})
if(NOT ARG_RUNNER_DIR)
message(FATAL_ERROR "ollama_add_mlx_build(${name}) requires RUNNER_DIR")
endif()
set(_build_dir ${CMAKE_BINARY_DIR}/mlx-${name})
ollama_collect_cache_args_with_prefix("MLX_" _mlx_cache_args)
set(_cmake_args
-DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
-DCMAKE_INSTALL_PREFIX=${OLLAMA_PAYLOAD_INSTALL_PREFIX}
-DOLLAMA_LIB_DIR:STRING=${OLLAMA_LIB_DIR}
-DOLLAMA_RUNNER_DIR=${ARG_RUNNER_DIR}
-DOLLAMA_SOURCE_DIR=${CMAKE_SOURCE_DIR}
-DFETCHCONTENT_SOURCE_DIR_MLX=${OLLAMA_MLX_SOURCE_DIR}
-DFETCHCONTENT_SOURCE_DIR_MLX-C=${OLLAMA_MLX_C_SOURCE_DIR}
-DOLLAMA_MLX_GENERATE_WRAPPERS=OFF
${ARG_CMAKE_ARGS}
${_mlx_cache_args}
)
foreach(_arg IN ITEMS
BLAS_INCLUDE_DIRS
LAPACK_INCLUDE_DIRS
CUDAToolkit_ROOT
CUDNN_ROOT_DIR
CUDNN_INCLUDE_PATH
CUDNN_LIBRARY_PATH
CMAKE_CUDA_COMPILER
CMAKE_CUDA_HOST_COMPILER
CMAKE_INCLUDE_PATH
CMAKE_LIBRARY_PATH
CMAKE_PREFIX_PATH)
ollama_append_cache_arg_if_set(_cmake_args ${_arg})
endforeach()
if(APPLE)
if(CMAKE_OSX_ARCHITECTURES)
list(APPEND _cmake_args
-DCMAKE_OSX_ARCHITECTURES=${CMAKE_OSX_ARCHITECTURES})
endif()
endif()
set(_configure_command ${CMAKE_COMMAND}
-S ${CMAKE_SOURCE_DIR}/cmake/mlx
-B <BINARY_DIR>
${_cmake_args})
if(ARG_PRESET)
set(_configure_command ${CMAKE_COMMAND}
-S ${CMAKE_SOURCE_DIR}/cmake/mlx
--preset ${ARG_PRESET}
-B <BINARY_DIR>
${_cmake_args})
endif()
ExternalProject_Add(ollama-mlx-${name}
SOURCE_DIR ${CMAKE_SOURCE_DIR}/cmake/mlx
BINARY_DIR ${_build_dir}
CONFIGURE_COMMAND ${_configure_command}
BUILD_COMMAND ${OLLAMA_NATIVE_BUILD_TOOL_COMMAND}
${OLLAMA_NATIVE_CONFIG_ARG}
${OLLAMA_NATIVE_BUILD_TARGET_ARG} mlx
${OLLAMA_NATIVE_BUILD_TARGET_ARG} mlxc
INSTALL_COMMAND ${CMAKE_COMMAND} --install <BINARY_DIR>
${OLLAMA_NATIVE_CONFIG_ARG}
--component MLX
COMMAND ${CMAKE_COMMAND} --install <BINARY_DIR>
${OLLAMA_NATIVE_CONFIG_ARG}
--component MLX_VENDOR
DEPENDS ollama-mlx-sources
LIST_SEPARATOR |
BUILD_ALWAYS TRUE
${OLLAMA_NATIVE_EXTERNAL_OPTIONS}
USES_TERMINAL_CONFIGURE TRUE
USES_TERMINAL_BUILD TRUE
USES_TERMINAL_INSTALL TRUE)
endfunction()
find_program(GO_EXECUTABLE go)
if(OLLAMA_MLX_BACKENDS)
set(_mlx_c_headers_dir "${OLLAMA_MLX_C_SOURCE_DIR}/mlx/c")
set(_mlx_c_headers_dest "${CMAKE_SOURCE_DIR}/x/mlxrunner/mlx/include/mlx/c")
if(GO_EXECUTABLE AND (NOT APPLE OR CMAKE_SYSTEM_PROCESSOR STREQUAL CMAKE_HOST_SYSTEM_PROCESSOR))
add_custom_target(ollama-mlx-generate-wrappers
COMMAND ${CMAKE_COMMAND}
-DMLX_C_HEADERS_DIR=${_mlx_c_headers_dir}
-DMLX_C_HEADERS_DEST=${_mlx_c_headers_dest}
-P "${CMAKE_SOURCE_DIR}/cmake/vendor-mlx-c-headers.cmake"
COMMAND ${CMAKE_COMMAND} -E env
CC= CGO_CFLAGS= CGO_CXXFLAGS=
${GO_EXECUTABLE} generate ./x/...
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
DEPENDS ollama-mlx-sources
COMMENT "Regenerating MLX Go wrappers"
VERBATIM)
else()
add_custom_target(ollama-mlx-generate-wrappers
COMMAND ${CMAKE_COMMAND} -E echo
"Cannot regenerate MLX wrappers while Go is unavailable or while cross-compiling"
COMMAND ${CMAKE_COMMAND} -E false
DEPENDS ollama-mlx-sources
VERBATIM)
endif()
endif()
if(OLLAMA_HAVE_LLAMA_SERVER)
if(NOT OLLAMA_GO_OUTPUT)
if(WIN32)
set(OLLAMA_GO_OUTPUT ${CMAKE_SOURCE_DIR}/ollama.exe)
else()
set(OLLAMA_GO_OUTPUT ${CMAKE_SOURCE_DIR}/ollama)
endif()
endif()
if(NOT IS_ABSOLUTE "${OLLAMA_GO_OUTPUT}")
set(OLLAMA_GO_OUTPUT "${CMAKE_SOURCE_DIR}/${OLLAMA_GO_OUTPUT}")
endif()
get_filename_component(OLLAMA_GO_OUTPUT "${OLLAMA_GO_OUTPUT}" ABSOLUTE)
set(OLLAMA_GO_OUTPUT "${OLLAMA_GO_OUTPUT}" CACHE FILEPATH "Output path for the local Ollama Go binary")
get_filename_component(OLLAMA_GO_OUTPUT_DIR "${OLLAMA_GO_OUTPUT}" DIRECTORY)
set(OLLAMA_GO_LDFLAGS
"-s -w -X=github.com/ollama/ollama/version.Version=${OLLAMA_VERSION} -X=github.com/ollama/ollama/server.mode=release")
if(GO_EXECUTABLE)
add_custom_target(ollama-go ALL
COMMAND ${CMAKE_COMMAND} -E make_directory "${OLLAMA_GO_OUTPUT_DIR}"
COMMAND ${CMAKE_COMMAND} -E env CGO_ENABLED=1
${GO_EXECUTABLE} build -trimpath -ldflags "${OLLAMA_GO_LDFLAGS}" -o "${OLLAMA_GO_OUTPUT}" .
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
BYPRODUCTS ${OLLAMA_GO_OUTPUT}
COMMENT "Building Ollama Go binary"
VERBATIM)
else()
add_custom_target(ollama-go ALL
COMMAND ${CMAKE_COMMAND} -E echo
"Go executable not found. Install Go or set GO_EXECUTABLE to build the local Ollama binary."
COMMAND ${CMAKE_COMMAND} -E false
COMMENT "Building Ollama Go binary"
VERBATIM)
endif()
set(_cpu_args)
if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
list(APPEND _cpu_args
-DBUILD_SHARED_LIBS=OFF
-DGGML_BACKEND_DL=OFF
-DGGML_METAL=ON
-DGGML_METAL_EMBED_LIBRARY=ON)
else()
list(APPEND _cpu_args
-DBUILD_SHARED_LIBS=ON
-DGGML_BACKEND_DL=ON
-DGGML_CPU_ALL_VARIANTS=ON)
if(WIN32)
list(APPEND _cpu_args -DGGML_OPENMP=ON)
endif()
if(APPLE)
list(APPEND _cpu_args -DGGML_METAL=OFF)
endif()
endif()
ollama_add_llama_server_build(local
RUNNER_DIR ""
TARGETS llama-server llama-quantize
CMAKE_ARGS ${_cpu_args})
add_custom_target(ollama-local ALL
DEPENDS ollama-go ollama-llama-server-local
COMMENT "Building local Ollama payload")
install(PROGRAMS "${OLLAMA_GO_OUTPUT}"
DESTINATION "${CMAKE_INSTALL_BINDIR}"
COMPONENT ollama-local)
endif()
set(_backend_targets)
if(OLLAMA_HAVE_LLAMA_SERVER)
foreach(_backend IN LISTS OLLAMA_LLAMA_BACKENDS)
if(_backend STREQUAL "cuda_v12")
ollama_llama_cuda_preset(${_backend} _cuda_preset)
set(_cuda_args)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
ollama_add_llama_server_build(${_backend}
PRESET ${_cuda_preset}
RUNNER_DIR ${_backend}
TARGETS ggml-cuda
CMAKE_ARGS ${_cuda_args})
list(APPEND _backend_targets ollama-llama-server-${_backend})
elseif(_backend STREQUAL "cuda_v13")
ollama_llama_cuda_preset(${_backend} _cuda_preset)
set(_cuda_args)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
ollama_add_llama_server_build(${_backend}
PRESET ${_cuda_preset}
RUNNER_DIR ${_backend}
TARGETS ggml-cuda
CMAKE_ARGS ${_cuda_args})
list(APPEND _backend_targets ollama-llama-server-${_backend})
elseif(_backend STREQUAL "rocm_v7_1" OR _backend STREQUAL "rocm_v7_2")
# ROCm 7.1 and 7.2 currently share build settings. Keep the backend
# names versioned so future packaging can install side-by-side ROCm
# payloads without changing the superbuild interface.
ollama_rocm_preset(${_backend} _rocm_preset)
set(_rocm_args
-DBUILD_SHARED_LIBS=ON
-DGGML_BACKEND_DL=ON
-DGGML_HIP=ON
-DCMAKE_HIP_PLATFORM=amd
-DOLLAMA_GPU_BACKEND=hip)
ollama_append_cache_arg_if_set(_rocm_args AMDGPU_TARGETS)
ollama_append_cache_arg_if_set(_rocm_args CMAKE_HIP_ARCHITECTURES)
ollama_append_cache_arg_if_set(_rocm_args CMAKE_HIP_FLAGS)
ollama_append_cache_arg_if_set(_rocm_args CMAKE_PREFIX_PATH)
ollama_add_llama_server_build(${_backend}
PRESET ${_rocm_preset}
RUNNER_DIR ${_backend}
TARGETS ggml-hip
CMAKE_ARGS ${_rocm_args})
list(APPEND _backend_targets ollama-llama-server-${_backend})
elseif(_backend STREQUAL "vulkan")
ollama_add_llama_server_build(vulkan
RUNNER_DIR vulkan
TARGETS ggml-vulkan
CMAKE_ARGS
-DBUILD_SHARED_LIBS=ON
-DGGML_BACKEND_DL=ON
-DGGML_VULKAN=ON
-DOLLAMA_GPU_BACKEND=vulkan)
list(APPEND _backend_targets ollama-llama-server-vulkan)
elseif(_backend STREQUAL "cuda_jetpack5")
if(CMAKE_CUDA_ARCHITECTURES)
set(_cuda_preset llama_cuda_jetpack5_user_arch)
else()
set(_cuda_preset llama_cuda_jetpack5)
endif()
set(_cuda_args)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
ollama_add_llama_server_build(${_backend}
PRESET ${_cuda_preset}
RUNNER_DIR ${_backend}
TARGETS ggml-cuda
CMAKE_ARGS ${_cuda_args})
list(APPEND _backend_targets ollama-llama-server-${_backend})
elseif(_backend STREQUAL "cuda_jetpack6")
if(CMAKE_CUDA_ARCHITECTURES)
set(_cuda_preset llama_cuda_jetpack6_user_arch)
else()
set(_cuda_preset llama_cuda_jetpack6)
endif()
set(_cuda_args)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
ollama_add_llama_server_build(${_backend}
PRESET ${_cuda_preset}
RUNNER_DIR ${_backend}
TARGETS ggml-cuda
CMAKE_ARGS ${_cuda_args})
list(APPEND _backend_targets ollama-llama-server-${_backend})
else()
message(FATAL_ERROR
"Unknown OLLAMA_LLAMA_BACKENDS entry '${_backend}'")
endif()
endforeach()
endif()
if(_backend_targets)
add_custom_target(ollama-llama-server-backends ALL
DEPENDS ${_backend_targets}
COMMENT "Building llama-server GPU backends")
endif()
set(_mlx_targets)
foreach(_backend IN LISTS OLLAMA_MLX_BACKENDS)
if(_backend STREQUAL "cuda_v13")
ollama_mlx_cuda_preset(_mlx_cuda_preset)
set(_mlx_cuda_args)
ollama_append_cache_arg_if_set(_mlx_cuda_args CMAKE_CUDA_ARCHITECTURES)
ollama_append_cache_arg_if_set(_mlx_cuda_args MLX_CUDA_ARCHITECTURES)
ollama_append_cache_arg_if_set(_mlx_cuda_args CMAKE_CUDA_FLAGS)
ollama_add_mlx_build(cuda_v13
PRESET ${_mlx_cuda_preset}
RUNNER_DIR mlx_cuda_v13
CMAKE_ARGS ${_mlx_cuda_args})
list(APPEND _mlx_targets ollama-mlx-cuda_v13)
elseif(_backend STREQUAL "metal_v3")
if(NOT APPLE)
message(FATAL_ERROR "OLLAMA_MLX_BACKENDS=metal_v3 is only supported on macOS")
endif()
ollama_check_metal_toolchain(_metal_version)
ollama_add_mlx_build(metal_v3
PRESET mlx_metal_v3
RUNNER_DIR mlx_metal_v3)
list(APPEND _mlx_targets ollama-mlx-metal_v3)
elseif(_backend STREQUAL "metal_v4")
if(NOT APPLE)
message(FATAL_ERROR "OLLAMA_MLX_BACKENDS=metal_v4 is only supported on macOS")
endif()
ollama_check_metal_toolchain(_metal_version)
ollama_macos_sdk_major_version(_ollama_mlx_sdk_major)
if(_ollama_mlx_sdk_major AND _ollama_mlx_sdk_major GREATER_EQUAL 26)
ollama_add_mlx_build(metal_v4
PRESET mlx_metal_v4
RUNNER_DIR mlx_metal_v4)
list(APPEND _mlx_targets ollama-mlx-metal_v4)
else()
message(FATAL_ERROR
"OLLAMA_MLX_BACKENDS=metal_v4 requires the macOS 26 SDK. "
"Install a newer Xcode or use OLLAMA_MLX_BACKENDS=metal_v3.")
endif()
else()
message(FATAL_ERROR
"Unknown OLLAMA_MLX_BACKENDS entry '${_backend}'")
endif()
endforeach()
if(_mlx_targets)
add_custom_target(ollama-mlx-backends ALL
DEPENDS ${_mlx_targets}
COMMENT "Building MLX backends")
endif()
install(DIRECTORY "${OLLAMA_PAYLOAD_INSTALL_PREFIX}/${OLLAMA_LIB_DIR}/"
DESTINATION "${OLLAMA_LIB_DIR}"
COMPONENT ollama-local
USE_SOURCE_PERMISSIONS)

235
cmake/mlx/CMakeLists.txt Normal file
View file

@ -0,0 +1,235 @@
cmake_minimum_required(VERSION 3.24)
project(OllamaMLX C CXX)
include(CheckLanguage)
include(GNUInstallDirs)
find_package(Threads REQUIRED)
if(NOT CMAKE_CONFIGURATION_TYPES AND NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
endif()
if(NOT DEFINED BUILD_SHARED_LIBS)
set(BUILD_SHARED_LIBS ON)
endif()
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS ON)
if(APPLE)
set(CMAKE_BUILD_RPATH "@loader_path")
set(CMAKE_INSTALL_RPATH "@loader_path")
set(CMAKE_BUILD_WITH_INSTALL_RPATH ON)
endif()
if(NOT DEFINED OLLAMA_SOURCE_DIR OR "${OLLAMA_SOURCE_DIR}" STREQUAL "")
get_filename_component(OLLAMA_SOURCE_DIR "${CMAKE_CURRENT_LIST_DIR}/../.." ABSOLUTE)
endif()
get_filename_component(OLLAMA_SOURCE_DIR "${OLLAMA_SOURCE_DIR}" ABSOLUTE BASE_DIR "${CMAKE_CURRENT_LIST_DIR}")
set(OLLAMA_SOURCE_DIR "${OLLAMA_SOURCE_DIR}" CACHE PATH "Ollama repository root")
set(OLLAMA_LIB_DIR "lib/ollama" CACHE STRING "Install destination for Ollama runtime payloads")
set(OLLAMA_RUNNER_DIR "" CACHE STRING "Ollama runtime payload subdirectory")
set(OLLAMA_BUILD_DIR ${CMAKE_BINARY_DIR}/lib/ollama)
set(OLLAMA_INSTALL_DIR ${OLLAMA_LIB_DIR}/${OLLAMA_RUNNER_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${OLLAMA_BUILD_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_DEBUG ${OLLAMA_BUILD_DIR})
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_RELEASE ${OLLAMA_BUILD_DIR})
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${OLLAMA_BUILD_DIR})
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_DEBUG ${OLLAMA_BUILD_DIR})
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE ${OLLAMA_BUILD_DIR})
if(MLX_CUDA_ARCHITECTURES OR CMAKE_CUDA_ARCHITECTURES)
check_language(CUDA)
endif()
option(OLLAMA_MLX_GENERATE_WRAPPERS "Regenerate MLX Go wrappers" OFF)
message(STATUS "Setting up MLX (this takes a while...)")
add_subdirectory(${OLLAMA_SOURCE_DIR}/x/imagegen/mlx ${CMAKE_BINARY_DIR}/x/imagegen/mlx)
# Find CUDA toolkit if MLX is built with CUDA support.
find_package(CUDAToolkit)
# Build list of directories for runtime dependency resolution.
set(MLX_RUNTIME_DIRS ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR})
# Add cuDNN bin paths for DLLs (Windows MLX CUDA builds).
# CUDNN_ROOT_DIR is the standard CMake variable for cuDNN location.
if(CUDNN_ROOT_DIR)
set(_cudnn_root "${CUDNN_ROOT_DIR}")
elseif(DEFINED ENV{CUDNN_ROOT_DIR})
set(_cudnn_root "$ENV{CUDNN_ROOT_DIR}")
endif()
if(_cudnn_root)
# cuDNN 9.x has versioned subdirectories under bin/ (e.g., bin/13.0/).
file(GLOB CUDNN_BIN_SUBDIRS "${_cudnn_root}/bin/*")
list(APPEND MLX_RUNTIME_DIRS ${CUDNN_BIN_SUBDIRS})
endif()
# Add build output directory and MLX dependency build directories.
list(APPEND MLX_RUNTIME_DIRS ${OLLAMA_BUILD_DIR})
# OpenBLAS DLL location (pre-built zip extracts into openblas-src/bin/).
list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/openblas-src/bin)
# NCCL: on Linux, if real NCCL is found, cmake bundles libnccl.so via the
# regex below. If NCCL is not found, MLX links a static stub (OBJECT lib)
# so there is no runtime dependency. This path covers the stub build dir
# for windows so we include the DLL in our dependencies.
list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/distributed/nccl/nccl_stub-prefix/src/nccl_stub-build/Release)
# Base regexes for runtime dependencies (cross-platform).
set(MLX_INCLUDE_REGEXES cublas cublasLt cudart cufft nvrtc nvrtc-builtins cudnn nccl openblas gfortran)
# On Windows, also include dl.dll (dlfcn-win32 POSIX emulation layer).
if(WIN32)
list(APPEND MLX_INCLUDE_REGEXES "^dl\\.dll$")
endif()
# Keep mlx/mlxc targets separate from runtime dependencies so --strip only
# applies to the binaries we build, not vendor DLLs/libs.
install(TARGETS mlx mlxc
RUNTIME_DEPENDENCY_SET mlx_runtime_deps
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
)
install(RUNTIME_DEPENDENCY_SET mlx_runtime_deps
DIRECTORIES ${MLX_RUNTIME_DIRS}
PRE_INCLUDE_REGEXES ${MLX_INCLUDE_REGEXES}
PRE_EXCLUDE_REGEXES ".*"
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
)
if(TARGET jaccl)
install(TARGETS jaccl
RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
)
endif()
# Install the Metal library for macOS arm64 (must be colocated with the binary).
# Metal backend is only built for arm64, not x86_64.
if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
install(FILES ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/backend/metal/kernels/mlx.metallib
DESTINATION ${OLLAMA_INSTALL_DIR}
COMPONENT MLX)
endif()
# Install headers for NVRTC JIT compilation at runtime.
# MLX's own install rules use the default component so they get skipped by
# --component MLX. Headers are installed alongside libmlx in OLLAMA_INSTALL_DIR.
#
# Layout:
# ${OLLAMA_INSTALL_DIR}/include/cccl/{cuda,nv}/ - CCCL headers
# ${OLLAMA_INSTALL_DIR}/include/*.h - CUDA toolkit headers
#
# MLX's jit_module.cpp resolves CCCL via
# current_binary_dir()[.parent_path()] / "include" / "cccl"
# On Linux, MLX's jit_module.cpp resolves CCCL via
# current_binary_dir().parent_path() / "include" / "cccl", so we create a
# symlink from lib/ollama/include -> ${OLLAMA_RUNNER_DIR}/include.
# This will need refinement if we add multiple CUDA versions for MLX in the future.
# CUDA runtime headers are found via CUDA_PATH env var (set by mlxrunner).
if(EXISTS ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda)
install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda
DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
COMPONENT MLX)
install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/nv
DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
COMPONENT MLX)
endif()
# Install minimal CUDA toolkit headers needed by MLX JIT kernels.
# These are the transitive closure of includes from mlx/backend/cuda/device/*.cuh.
# The Go mlxrunner sets CUDA_PATH to OLLAMA_INSTALL_DIR so MLX finds them at
# $CUDA_PATH/include/*.h via NVRTC --include-path.
if(CUDAToolkit_FOUND)
# CUDAToolkit_INCLUDE_DIRS may be a semicolon-separated list
# (e.g. ".../include;.../include/cccl"). Find the entry that
# contains the CUDA runtime headers we need.
set(_cuda_inc "")
foreach(_dir ${CUDAToolkit_INCLUDE_DIRS})
if(EXISTS "${_dir}/cuda_runtime_api.h")
set(_cuda_inc "${_dir}")
break()
endif()
endforeach()
if(NOT _cuda_inc)
message(WARNING "Could not find cuda_runtime_api.h in CUDAToolkit_INCLUDE_DIRS: ${CUDAToolkit_INCLUDE_DIRS}")
else()
set(_dst "${OLLAMA_INSTALL_DIR}/include")
set(_MLX_JIT_CUDA_HEADERS
builtin_types.h
cooperative_groups.h
cuda_bf16.h
cuda_bf16.hpp
cuda_device_runtime_api.h
cuda_fp16.h
cuda_fp16.hpp
cuda_fp8.h
cuda_fp8.hpp
cuda_runtime_api.h
device_types.h
driver_types.h
math_constants.h
surface_types.h
texture_types.h
vector_functions.h
vector_functions.hpp
vector_types.h
)
foreach(_hdr ${_MLX_JIT_CUDA_HEADERS})
install(FILES "${_cuda_inc}/${_hdr}"
DESTINATION ${_dst}
COMPONENT MLX)
endforeach()
# Subdirectory headers.
install(DIRECTORY "${_cuda_inc}/cooperative_groups"
DESTINATION ${_dst}
COMPONENT MLX
FILES_MATCHING PATTERN "*.h")
install(FILES "${_cuda_inc}/crt/host_defines.h"
DESTINATION "${_dst}/crt"
COMPONENT MLX)
if(NOT WIN32 AND NOT APPLE)
install(CODE "
set(_link \"${CMAKE_INSTALL_PREFIX}/${OLLAMA_LIB_DIR}/include\")
set(_target \"${OLLAMA_RUNNER_DIR}/include\")
if(NOT EXISTS \${_link})
execute_process(COMMAND \${CMAKE_COMMAND} -E create_symlink \${_target} \${_link})
endif()
" COMPONENT MLX)
endif()
endif()
endif()
# On Windows, explicitly install dl.dll (dlfcn-win32 POSIX dlopen emulation).
# RUNTIME_DEPENDENCIES auto-excludes it via POST_EXCLUDE_FILES_STRICT because
# dlfcn-win32 is a known CMake target with its own install rules (which install
# to the wrong destination). We must install it explicitly here.
if(WIN32)
install(FILES ${OLLAMA_BUILD_DIR}/dl.dll
DESTINATION ${OLLAMA_INSTALL_DIR}
COMPONENT MLX)
endif()
# Manually install CUDA runtime libraries that MLX loads via dlopen
# (not detected by RUNTIME_DEPENDENCIES since they aren't link-time deps).
if(CUDAToolkit_FOUND)
file(GLOB MLX_CUDA_LIBS
"${CUDAToolkit_LIBRARY_DIR}/libcudart.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcublas.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcublasLt.so*"
"${CUDAToolkit_LIBRARY_DIR}/libnvrtc.so*"
"${CUDAToolkit_LIBRARY_DIR}/libnvrtc-builtins.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcufft.so*"
"${CUDAToolkit_LIBRARY_DIR}/libcudnn.so*")
if(MLX_CUDA_LIBS)
install(FILES ${MLX_CUDA_LIBS}
DESTINATION ${OLLAMA_INSTALL_DIR}
COMPONENT MLX_VENDOR)
endif()
endif()

View file

@ -0,0 +1,90 @@
{
"version": 3,
"configurePresets": [
{
"name": "default",
"binaryDir": "${sourceDir}/../../build/mlx",
"installDir": "${sourceDir}/../../dist",
"cacheVariables": {
"CMAKE_BUILD_TYPE": "Release",
"CMAKE_MSVC_RUNTIME_LIBRARY": "MultiThreaded",
"OLLAMA_SOURCE_DIR": "${sourceDir}/../.."
}
},
{
"name": "mlx_cuda_v13_base",
"hidden": true,
"inherits": [ "default" ],
"cacheVariables": {
"CMAKE_CUDA_FLAGS": "-t 2",
"OLLAMA_RUNNER_DIR": "mlx_cuda_v13"
}
},
{
"name": "mlx_cuda_v13_linux",
"inherits": [ "mlx_cuda_v13_base" ],
"binaryDir": "${sourceDir}/../../build/mlx_cuda_v13",
"cacheVariables": {
"MLX_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual"
}
},
{
"name": "mlx_cuda_v13_windows",
"inherits": [ "mlx_cuda_v13_base" ],
"binaryDir": "${sourceDir}/../../build/mlx_cuda_v13",
"cacheVariables": {
"MLX_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual"
}
},
{
"name": "mlx_cuda_v13_user_arch",
"inherits": [ "mlx_cuda_v13_base" ],
"binaryDir": "${sourceDir}/../../build/mlx_cuda_v13"
},
{
"name": "mlx_metal_v3",
"inherits": [ "default" ],
"binaryDir": "${sourceDir}/../../build/metal-v3",
"cacheVariables": {
"CMAKE_OSX_DEPLOYMENT_TARGET": "14.0",
"OLLAMA_RUNNER_DIR": "mlx_metal_v3"
}
},
{
"name": "mlx_metal_v4",
"inherits": [ "default" ],
"binaryDir": "${sourceDir}/../../build/metal-v4",
"cacheVariables": {
"CMAKE_OSX_DEPLOYMENT_TARGET": "26.0",
"OLLAMA_RUNNER_DIR": "mlx_metal_v4"
}
}
],
"buildPresets": [
{
"name": "mlx_cuda_v13_linux",
"configurePreset": "mlx_cuda_v13_linux",
"targets": [ "mlx", "mlxc" ]
},
{
"name": "mlx_cuda_v13_windows",
"configurePreset": "mlx_cuda_v13_windows",
"targets": [ "mlx", "mlxc" ]
},
{
"name": "mlx_cuda_v13_user_arch",
"configurePreset": "mlx_cuda_v13_user_arch",
"targets": [ "mlx", "mlxc" ]
},
{
"name": "mlx_metal_v3",
"configurePreset": "mlx_metal_v3",
"targets": [ "mlx", "mlxc" ]
},
{
"name": "mlx_metal_v4",
"configurePreset": "mlx_metal_v4",
"targets": [ "mlx", "mlxc" ]
}
]
}

View file

@ -0,0 +1,14 @@
if(NOT DEFINED MLX_C_HEADERS_DIR OR NOT IS_DIRECTORY "${MLX_C_HEADERS_DIR}")
message(FATAL_ERROR "MLX_C_HEADERS_DIR does not exist: ${MLX_C_HEADERS_DIR}")
endif()
if(NOT DEFINED MLX_C_HEADERS_DEST OR "${MLX_C_HEADERS_DEST}" STREQUAL "")
message(FATAL_ERROR "MLX_C_HEADERS_DEST is required")
endif()
file(GLOB _mlx_c_headers LIST_DIRECTORIES false "${MLX_C_HEADERS_DIR}/*.h")
if(NOT _mlx_c_headers)
message(FATAL_ERROR "No MLX-C headers found in ${MLX_C_HEADERS_DIR}")
endif()
file(MAKE_DIRECTORY "${MLX_C_HEADERS_DEST}")
file(COPY ${_mlx_c_headers} DESTINATION "${MLX_C_HEADERS_DEST}")

View file

@ -0,0 +1,69 @@
set(CMAKE_SYSTEM_NAME Windows)
set(CMAKE_SYSTEM_PROCESSOR ARM64)
set(_ollama_llvm_mingw_hints)
if(DEFINED ENV{ProgramFiles})
file(GLOB _ollama_program_files_llvm_mingw_bins
LIST_DIRECTORIES true
"$ENV{ProgramFiles}/llvm-mingw-*-x86_64*/bin")
list(SORT _ollama_program_files_llvm_mingw_bins COMPARE NATURAL ORDER DESCENDING)
list(APPEND _ollama_llvm_mingw_hints ${_ollama_program_files_llvm_mingw_bins})
endif()
if(DEFINED ENV{LOCALAPPDATA})
file(GLOB _ollama_winget_llvm_mingw_bins
LIST_DIRECTORIES true
"$ENV{LOCALAPPDATA}/Microsoft/WinGet/Packages/MartinStorsjo.LLVM-MinGW*/llvm-mingw-*-x86_64*/bin")
list(SORT _ollama_winget_llvm_mingw_bins COMPARE NATURAL ORDER DESCENDING)
list(APPEND _ollama_llvm_mingw_hints ${_ollama_winget_llvm_mingw_bins})
endif()
if(NOT CMAKE_C_COMPILER)
find_program(CMAKE_C_COMPILER
NAMES aarch64-w64-mingw32-gcc
HINTS ${_ollama_llvm_mingw_hints}
REQUIRED)
endif()
if(NOT CMAKE_CXX_COMPILER)
find_program(CMAKE_CXX_COMPILER
NAMES aarch64-w64-mingw32-g++
HINTS ${_ollama_llvm_mingw_hints}
REQUIRED)
endif()
get_filename_component(_ollama_llvm_mingw_bin_dir "${CMAKE_CXX_COMPILER}" DIRECTORY)
if(NOT HOST_CXX_COMPILER)
find_program(_ollama_path_host_cxx
NAMES clang++ g++
NO_CMAKE_FIND_ROOT_PATH)
if(_ollama_path_host_cxx)
set(HOST_CXX_COMPILER "${_ollama_path_host_cxx}")
endif()
endif()
if(NOT HOST_CXX_COMPILER)
find_program(_ollama_mingw_host_cxx
NAMES x86_64-w64-mingw32-g++
HINTS "${_ollama_llvm_mingw_bin_dir}"
REQUIRED)
if(CMAKE_HOST_WIN32)
# llama.cpp builds a small host-only UI embedding tool during
# cross-compiles, but currently models HOST_CXX_COMPILER as only an
# executable path and has no companion host flags hook. When the host
# compiler is llvm-mingw, the generated host tool otherwise depends on
# llvm-mingw runtime DLLs being on PATH. Keep that workaround local and
# explicit: wrap the compiler only to add -static for this host tool.
set(_ollama_host_cxx_wrapper "${CMAKE_BINARY_DIR}/ollama-host-cxx.cmd")
file(TO_NATIVE_PATH "${_ollama_mingw_host_cxx}" _ollama_mingw_host_cxx_native)
file(WRITE "${_ollama_host_cxx_wrapper}"
"@echo off\r\n"
"\"${_ollama_mingw_host_cxx_native}\" -static %*\r\n")
set(HOST_CXX_COMPILER "${_ollama_host_cxx_wrapper}")
else()
set(HOST_CXX_COMPILER "${_ollama_mingw_host_cxx}")
endif()
endif()
set(HOST_CXX_COMPILER "${HOST_CXX_COMPILER}" CACHE FILEPATH "Host C++ compiler for build-time tools" FORCE)
string(PREPEND CMAKE_C_FLAGS_INIT "-D_WIN32_WINNT=0x0A00 ")
string(PREPEND CMAKE_CXX_FLAGS_INIT "-D_WIN32_WINNT=0x0A00 ")

View file

@ -18,6 +18,7 @@ import (
"os"
"os/exec"
"os/signal"
"path"
"path/filepath"
"runtime"
"slices"
@ -41,6 +42,7 @@ import (
"github.com/ollama/ollama/cmd/config"
"github.com/ollama/ollama/cmd/launch"
"github.com/ollama/ollama/cmd/tui"
"github.com/ollama/ollama/discover"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/format"
"github.com/ollama/ollama/internal/modelref"
@ -232,9 +234,6 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
// This gates both safetensors LLM and imagegen model creation
experimental, _ := cmd.Flags().GetBool("experimental")
draftQuantize, _ := cmd.Flags().GetString("draft-quantize")
if draftQuantize != "" && !experimental {
return errors.New("--draft-quantize requires --experimental")
}
if experimental {
if !isLocalhost() {
return errors.New("remote safetensor model creation not yet supported")
@ -329,6 +328,12 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
if quantize != "" {
req.Quantize = quantize
}
if draftQuantize != "" {
if len(req.DraftFiles) == 0 {
return errors.New("--draft-quantize requires a DRAFT model")
}
req.DraftQuantize = draftQuantize
}
client, err := api.ClientFromEnvironment()
if err != nil {
@ -339,29 +344,40 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
g.SetLimit(max(runtime.GOMAXPROCS(0)-1, 1))
files := syncmap.NewSyncMap[string, string]()
fileNames := createRequestFileNames(req.Files)
for f, digest := range req.Files {
g.Go(func() error {
if _, err := createBlob(cmd, client, f, digest, p); err != nil {
return err
}
// TODO: this is incorrect since the file might be in a subdirectory
// instead this should take the path relative to the model directory
// but the current implementation does not allow this
files.Store(filepath.Base(f), digest)
files.Store(fileNames[f], digest)
return nil
})
}
adapters := syncmap.NewSyncMap[string, string]()
adapterNames := createRequestFileNames(req.Adapters)
for f, digest := range req.Adapters {
g.Go(func() error {
if _, err := createBlob(cmd, client, f, digest, p); err != nil {
return err
}
// TODO: same here
adapters.Store(filepath.Base(f), digest)
adapters.Store(adapterNames[f], digest)
return nil
})
}
draftFiles := syncmap.NewSyncMap[string, string]()
draftFileNames := createRequestFileNames(req.DraftFiles)
for f, digest := range req.DraftFiles {
g.Go(func() error {
if _, err := createBlob(cmd, client, f, digest, p); err != nil {
return err
}
draftFiles.Store(draftFileNames[f], digest)
return nil
})
}
@ -372,6 +388,7 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
req.Files = files.Items()
req.Adapters = adapters.Items()
req.DraftFiles = draftFiles.Items()
bars := make(map[string]*progress.Bar)
fn := func(resp api.ProgressResponse) error {
@ -409,6 +426,65 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
return nil
}
func createRequestFileNames(files map[string]string) map[string]string {
names := make(map[string]string, len(files))
root, ok := commonFileRoot(files)
for f := range files {
name := filepath.Base(f)
if ok {
abs, err := filepath.Abs(f)
if err == nil {
if rel, err := filepath.Rel(root, abs); err == nil && rel != "." && rel != ".." && !strings.HasPrefix(rel, ".."+string(filepath.Separator)) {
name = rel
}
}
}
names[f] = path.Clean(filepath.ToSlash(name))
}
return names
}
func commonFileRoot(files map[string]string) (string, bool) {
if len(files) < 2 {
return "", false
}
var root string
var volume string
for f := range files {
abs, err := filepath.Abs(f)
if err != nil {
return "", false
}
if nextVolume := filepath.VolumeName(abs); volume == "" {
volume = nextVolume
} else if !strings.EqualFold(volume, nextVolume) {
return "", false
}
dir := filepath.Dir(abs)
if root == "" {
root = dir
continue
}
for {
rel, err := filepath.Rel(root, dir)
if err == nil && (rel == "." || (rel != ".." && !strings.HasPrefix(rel, ".."+string(filepath.Separator)))) {
break
}
parent := filepath.Dir(root)
if parent == root {
return "", false
}
root = parent
}
}
return root, root != ""
}
func createBlob(cmd *cobra.Command, client *api.Client, path string, digest string, p *progress.Progress) (string, error) {
realPath, err := filepath.EvalSymlinks(path)
if err != nil {
@ -1277,11 +1353,28 @@ func showInfo(resp *api.ShowResponse, verbose bool, w io.Writer) error {
if resp.ProjectorInfo != nil {
tableRender("Projector", func() (rows [][]string) {
arch := resp.ProjectorInfo["general.architecture"].(string)
rows = append(rows, []string{"", "architecture", arch})
rows = append(rows, []string{"", "parameters", format.HumanNumber(uint64(resp.ProjectorInfo["general.parameter_count"].(float64)))})
rows = append(rows, []string{"", "embedding length", strconv.FormatFloat(resp.ProjectorInfo[fmt.Sprintf("%s.vision.embedding_length", arch)].(float64), 'f', -1, 64)})
rows = append(rows, []string{"", "dimensions", strconv.FormatFloat(resp.ProjectorInfo[fmt.Sprintf("%s.vision.projection_dim", arch)].(float64), 'f', -1, 64)})
arch, _ := resp.ProjectorInfo["general.architecture"].(string)
if arch != "" {
rows = append(rows, []string{"", "architecture", arch})
}
if v, ok := resp.ProjectorInfo["general.parameter_count"].(float64); ok {
rows = append(rows, []string{"", "parameters", format.HumanNumber(uint64(v))})
}
projectorValue := func(suffix string) (float64, bool) {
for _, modality := range []string{"vision", "audio"} {
if v, ok := resp.ProjectorInfo[fmt.Sprintf("%s.%s.%s", arch, modality, suffix)].(float64); ok {
return v, true
}
}
return 0, false
}
if v, ok := projectorValue("embedding_length"); ok {
rows = append(rows, []string{"", "embedding length", strconv.FormatFloat(v, 'f', -1, 64)})
}
if v, ok := projectorValue("projection_dim"); ok {
rows = append(rows, []string{"", "dimensions", strconv.FormatFloat(v, 'f', -1, 64)})
}
return
})
}
@ -2277,9 +2370,6 @@ func NewCLI() *cobra.Command {
if experimental, _ := cmd.Flags().GetBool("experimental"); experimental {
return nil
}
if draftQuantize, _ := cmd.Flags().GetString("draft-quantize"); draftQuantize != "" {
return errors.New("--draft-quantize requires --experimental")
}
return checkServerHeartbeat(cmd, args)
},
RunE: CreateHandler,
@ -2445,6 +2535,16 @@ func NewCLI() *cobra.Command {
_ = runner.Execute(args[1:])
})
var gpuDiscoverLibDirs []string
gpuDiscoverCmd := &cobra.Command{
Use: "gpu-discover",
Hidden: true,
RunE: func(cmd *cobra.Command, _ []string) error {
return discover.RunNativeProbeCommand(cmd.Context(), gpuDiscoverLibDirs, os.Stdout)
},
}
gpuDiscoverCmd.Flags().StringArrayVar(&gpuDiscoverLibDirs, "lib-dir", nil, "Ollama runtime library directory")
envVars := envconfig.AsMap()
envs := []envconfig.EnvVar{envVars["OLLAMA_HOST"]}
@ -2485,6 +2585,9 @@ func NewCLI() *cobra.Command {
envVars["OLLAMA_KV_CACHE_TYPE"],
envVars["OLLAMA_LLM_LIBRARY"],
envVars["OLLAMA_GPU_OVERHEAD"],
envVars["OLLAMA_IGPU_ENABLE"],
envVars["LLAMA_ARG_FIT"],
envVars["LLAMA_ARG_FIT_TARGET"],
envVars["OLLAMA_LOAD_TIMEOUT"],
})
default:
@ -2509,6 +2612,7 @@ func NewCLI() *cobra.Command {
copyCmd,
deleteCmd,
runnerCmd,
gpuDiscoverCmd,
launch.LaunchCmd(checkServerHeartbeat, runInteractiveTUI),
)

View file

@ -1525,34 +1525,65 @@ func TestCreateHandler(t *testing.T) {
}
}
func TestCreateHandlerDraftQuantizeRequiresExperimental(t *testing.T) {
cmd := &cobra.Command{}
cmd.Flags().Bool("experimental", false, "")
cmd.Flags().String("draft-quantize", "mxfp8", "")
cmd.SetContext(t.Context())
func TestCreateRequestFileNamesPreservesModelDirectoryLayout(t *testing.T) {
root := t.TempDir()
files := map[string]string{
filepath.Join(root, "model.safetensors"): "sha256:model",
filepath.Join(root, "config.json"): "sha256:config",
filepath.Join(root, "2_Dense", "config.json"): "sha256:dense-config",
filepath.Join(root, "2_Dense", "model.safetensors"): "sha256:dense-model",
}
err := CreateHandler(cmd, []string{"test-model"})
if err == nil || !strings.Contains(err.Error(), "--draft-quantize requires --experimental") {
t.Fatalf("error = %v, want draft-quantize requires experimental", err)
got := createRequestFileNames(files)
want := map[string]string{
filepath.Join(root, "model.safetensors"): "model.safetensors",
filepath.Join(root, "config.json"): "config.json",
filepath.Join(root, "2_Dense", "config.json"): "2_Dense/config.json",
filepath.Join(root, "2_Dense", "model.safetensors"): "2_Dense/model.safetensors",
}
if diff := cmp.Diff(want, got); diff != "" {
t.Fatalf("mismatch (-want +got):\n%s", diff)
}
}
func TestCreateHandlerDraftRequiresExperimental(t *testing.T) {
func TestCreateRequestFileNamesPreservesRelativeModelDirectoryLayout(t *testing.T) {
root := t.TempDir()
t.Chdir(root)
files := map[string]string{
"model.safetensors": "sha256:model",
"config.json": "sha256:config",
"2_Dense/config.json": "sha256:dense-config",
"2_Dense/model.safetensors": "sha256:dense-model",
"3_Dense/config.json": "sha256:dense-config",
"3_Dense/model.safetensors": "sha256:dense-model",
}
got := createRequestFileNames(files)
for file := range files {
if got[file] != filepath.ToSlash(file) {
t.Fatalf("%s = %q, want %q", file, got[file], filepath.ToSlash(file))
}
}
}
func TestCreateHandlerDraftQuantizeRequiresDraft(t *testing.T) {
dir := t.TempDir()
modelfile := filepath.Join(dir, "Modelfile")
if err := os.WriteFile(modelfile, []byte("FROM base\nDRAFT ./assistant\n"), 0o644); err != nil {
if err := os.WriteFile(modelfile, []byte("FROM base\n"), 0o644); err != nil {
t.Fatal(err)
}
cmd := &cobra.Command{}
cmd.Flags().Bool("experimental", false, "")
cmd.Flags().String("draft-quantize", "", "")
cmd.Flags().String("file", modelfile, "")
cmd.Flags().String("draft-quantize", "mxfp8", "")
cmd.SetContext(t.Context())
err := CreateHandler(cmd, []string{"test-model"})
if err == nil || !strings.Contains(err.Error(), "DRAFT requires --experimental") {
t.Fatalf("error = %v, want DRAFT requires --experimental", err)
if err == nil || !strings.Contains(err.Error(), "--draft-quantize requires a DRAFT model") {
t.Fatalf("error = %v, want draft-quantize requires DRAFT", err)
}
}

View file

@ -496,17 +496,6 @@ func isCloudModelName(name string) bool {
return modelref.HasExplicitCloudSource(name)
}
// filterCloudModels drops remote-only models from the given inventory.
func filterCloudModels(existing []modelInfo) []modelInfo {
filtered := existing[:0]
for _, m := range existing {
if !m.Remote {
filtered = append(filtered, m)
}
}
return filtered
}
// filterCloudItems removes cloud models from selection items.
func filterCloudItems(items []ModelItem) []ModelItem {
filtered := items[:0]

View file

@ -147,7 +147,9 @@ func (ModelParameters) KV(t *Tokenizer) KV {
}
for _, sv := range t.SpecialVocabulary {
kv[fmt.Sprintf("tokenizer.ggml.add_%s_token", sv.Key())] = sv.AddToken
if sv.AddTokenSet {
kv[fmt.Sprintf("tokenizer.ggml.add_%s_token", sv.Key())] = sv.AddToken
}
kv[fmt.Sprintf("tokenizer.ggml.%s_token_id", sv.Key())] = uint32(sv.ID)
if len(sv.IDs) > 0 {
kv[fmt.Sprintf("tokenizer.ggml.%s_token_ids", sv.Key())] = sv.IDs
@ -200,10 +202,32 @@ type ModelConverter interface {
specialTokenTypes() []string
}
// MultimodalConverter splits checkpoints with embedded vision/projector
// weights into a text model GGUF and a separate projector GGUF.
type MultimodalConverter interface {
ModelConverter
TextKV(*Tokenizer) KV
TextTensors([]Tensor, *Tokenizer) []*ggml.Tensor
ProjectorKV(*Tokenizer) KV
ProjectorTensors([]Tensor) []*ggml.Tensor
}
type moreParser interface {
parseMore(fs.FS) error
}
type extraTensorParser interface {
extraTensors(fs.FS) ([]Tensor, error)
}
type tokenizerAdjuster interface {
adjustTokenizer(*Tokenizer)
}
type tokenizerAwareTensorConverter interface {
TensorsWithTokenizer([]Tensor, *Tokenizer) []*ggml.Tensor
}
type AdapterConverter interface {
// KV maps parameters to LLM key-values
KV(ofs.Config) KV
@ -288,6 +312,8 @@ func LoadModelMetadata(fsys fs.FS) (ModelKV, *Tokenizer, error) {
conv = &gemma2Model{}
case "Gemma3ForCausalLM", "Gemma3ForConditionalGeneration":
conv = &gemma3Model{Architecture: p.Architectures[0]}
case "Gemma3TextModel":
conv = &embeddingGemmaModel{}
case "Gemma3nForConditionalGeneration":
conv = &gemma3nModel{}
case "Gemma4ForCausalLM", "Gemma4ForConditionalGeneration":
@ -348,6 +374,9 @@ func LoadModelMetadata(fsys fs.FS) (ModelKV, *Tokenizer, error) {
if err != nil {
return nil, nil, err
}
if ta, ok := conv.(tokenizerAdjuster); ok {
ta.adjustTokenizer(t)
}
vocabSize := int(cmp.Or(p.VocabSize, p.TextModel.VocabSize))
@ -375,7 +404,7 @@ func LoadModelMetadata(fsys fs.FS) (ModelKV, *Tokenizer, error) {
// and files it finds in the input path.
// Supported input model formats include safetensors.
// Supported input tokenizers files include tokenizer.json (preferred) and tokenizer.model.
func ConvertModel(fsys fs.FS, f *os.File) error {
func ConvertModel(fsys fs.FS, f *os.File, projectorFiles ...*os.File) error {
kv, t, err := LoadModelMetadata(fsys)
if err != nil {
return err
@ -387,7 +416,47 @@ func ConvertModel(fsys fs.FS, f *os.File) error {
return err
}
return writeFile(f, conv.KV(t), conv.Tensors(ts))
if tp, ok := conv.(extraTensorParser); ok {
extra, err := tp.extraTensors(fsys)
if err != nil {
return err
}
ts = append(ts, extra...)
}
if err := ensureUniqueTensorNames(ts); err != nil {
return err
}
if mc, ok := conv.(MultimodalConverter); ok && len(projectorFiles) > 0 && projectorFiles[0] != nil {
projectorTensors := mc.ProjectorTensors(ts)
if len(projectorTensors) > 0 {
if err := writeFile(f, mc.TextKV(t), mc.TextTensors(ts, t)); err != nil {
return err
}
return writeFile(projectorFiles[0], mc.ProjectorKV(t), projectorTensors)
}
}
var tensors []*ggml.Tensor
if tc, ok := conv.(tokenizerAwareTensorConverter); ok {
tensors = tc.TensorsWithTokenizer(ts, t)
} else {
tensors = conv.Tensors(ts)
}
return writeFile(f, conv.KV(t), tensors)
}
func ensureUniqueTensorNames(ts []Tensor) error {
names := make(map[string]struct{}, len(ts))
for _, t := range ts {
if _, ok := names[t.Name()]; ok {
return fmt.Errorf("duplicate tensor name '%s' was found for this model", t.Name())
}
names[t.Name()] = struct{}{}
}
return nil
}
func writeFile(f *os.File, kv KV, ts []*ggml.Tensor) error {

View file

@ -0,0 +1,280 @@
package convert
import (
"cmp"
"encoding/json"
"errors"
"fmt"
"io/fs"
"path"
"slices"
"strings"
"github.com/ollama/ollama/fs/ggml"
)
type embeddingGemmaModel struct {
gemmaModel
RopeLocalTheta float32 `json:"rope_local_base_freq"`
RopeTheta float32 `json:"rope_theta"`
SlidingWindow uint32 `json:"sliding_window"`
poolingType uint32
denseModules []embeddingGemmaDenseModule
}
type embeddingGemmaDenseModule struct {
path string
tensorName string
in, out uint32
}
var (
_ ModelConverter = (*embeddingGemmaModel)(nil)
_ moreParser = (*embeddingGemmaModel)(nil)
_ extraTensorParser = (*embeddingGemmaModel)(nil)
_ tokenizerAdjuster = (*embeddingGemmaModel)(nil)
)
func (m *embeddingGemmaModel) KV(t *Tokenizer) KV {
kv := m.ModelParameters.KV(t)
kv["general.architecture"] = "gemma-embedding"
kv["gemma-embedding.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(2048))
kv["gemma-embedding.embedding_length"] = m.HiddenSize
kv["gemma-embedding.block_count"] = m.HiddenLayers
kv["gemma-embedding.feed_forward_length"] = m.IntermediateSize
kv["gemma-embedding.attention.head_count"] = m.NumAttentionHeads
kv["gemma-embedding.attention.head_count_kv"] = m.NumKeyValueHeads
kv["gemma-embedding.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEPS, float32(1e-6))
kv["gemma-embedding.attention.key_length"] = m.HeadDim
kv["gemma-embedding.attention.value_length"] = m.HeadDim
kv["gemma-embedding.attention.sliding_window"] = m.SlidingWindow
kv["gemma-embedding.rope.freq_base"] = cmp.Or(m.RopeTheta, float32(1000000.0))
kv["gemma-embedding.rope.freq_base_swa"] = cmp.Or(m.RopeLocalTheta, float32(10000.0))
kv["gemma-embedding.pooling_type"] = cmp.Or(m.poolingType, uint32(1))
for _, dense := range m.denseModules {
kv["gemma-embedding."+dense.tensorName+"_feat_in"] = dense.in
kv["gemma-embedding."+dense.tensorName+"_feat_out"] = dense.out
}
return kv
}
func (m *embeddingGemmaModel) parseMore(fsys fs.FS) error {
bts, err := fs.ReadFile(fsys, "modules.json")
if err != nil {
if errors.Is(err, fs.ErrNotExist) {
return errors.New("embeddinggemma requires sentence-transformers modules.json")
}
return err
}
var modules []struct {
Type string `json:"type"`
Path string `json:"path"`
}
if err := json.Unmarshal(bts, &modules); err != nil {
return err
}
m.poolingType = 1
m.denseModules = nil
for _, module := range modules {
switch module.Type {
case "sentence_transformers.models.Pooling":
poolingType, err := embeddingGemmaPoolingType(fsys, module.Path)
if err != nil {
return err
}
if poolingType != 0 {
m.poolingType = poolingType
}
case "sentence_transformers.models.Dense":
dense, ok, err := embeddingGemmaDenseModuleConfig(fsys, module.Path)
if err != nil {
return err
}
if ok {
m.denseModules = append(m.denseModules, dense)
}
}
}
slices.SortFunc(m.denseModules, func(a, b embeddingGemmaDenseModule) int {
return strings.Compare(a.tensorName, b.tensorName)
})
if len(m.denseModules) != 2 ||
m.denseModules[0].tensorName != "dense_2" ||
m.denseModules[1].tensorName != "dense_3" {
return errors.New("embeddinggemma requires sentence-transformers 2_Dense and 3_Dense modules")
}
return nil
}
func (m *embeddingGemmaModel) adjustTokenizer(t *Tokenizer) {
n := int(m.VocabSize)
if n == 0 || len(t.Vocabulary.Tokens) <= n {
return
}
t.Vocabulary.Tokens = t.Vocabulary.Tokens[:n]
if len(t.Vocabulary.Scores) > n {
t.Vocabulary.Scores = t.Vocabulary.Scores[:n]
}
if len(t.Vocabulary.Types) > n {
t.Vocabulary.Types = t.Vocabulary.Types[:n]
}
}
func embeddingGemmaPoolingType(fsys fs.FS, modulePath string) (uint32, error) {
if modulePath == "" {
return 0, nil
}
bts, err := fs.ReadFile(fsys, path.Join(modulePath, "config.json"))
if err != nil {
if errors.Is(err, fs.ErrNotExist) {
return 0, nil
}
return 0, err
}
var cfg struct {
PoolingModeMeanTokens bool `json:"pooling_mode_mean_tokens"`
PoolingModeCLSToken bool `json:"pooling_mode_cls_token"`
}
if err := json.Unmarshal(bts, &cfg); err != nil {
return 0, err
}
switch {
case cfg.PoolingModeMeanTokens:
return 1, nil
case cfg.PoolingModeCLSToken:
return 2, nil
default:
return 0, nil
}
}
func embeddingGemmaDenseModuleConfig(fsys fs.FS, modulePath string) (embeddingGemmaDenseModule, bool, error) {
tensorName, ok := embeddingGemmaDenseTensorName(modulePath)
if !ok {
return embeddingGemmaDenseModule{}, false, nil
}
weightsPath := path.Join(modulePath, "model.safetensors")
if _, err := fs.Stat(fsys, weightsPath); err != nil {
if errors.Is(err, fs.ErrNotExist) {
return embeddingGemmaDenseModule{}, false, nil
}
return embeddingGemmaDenseModule{}, false, err
}
bts, err := fs.ReadFile(fsys, path.Join(modulePath, "config.json"))
if err != nil {
return embeddingGemmaDenseModule{}, false, err
}
var cfg struct {
InFeatures uint32 `json:"in_features"`
OutFeatures uint32 `json:"out_features"`
Bias bool `json:"bias"`
}
if err := json.Unmarshal(bts, &cfg); err != nil {
return embeddingGemmaDenseModule{}, false, err
}
if cfg.InFeatures == 0 || cfg.OutFeatures == 0 {
return embeddingGemmaDenseModule{}, false, errors.New("embeddinggemma dense layer config missing in/out features")
}
if cfg.Bias {
return embeddingGemmaDenseModule{}, false, fmt.Errorf("embeddinggemma dense layer %s has unsupported bias", modulePath)
}
return embeddingGemmaDenseModule{
path: weightsPath,
tensorName: tensorName,
in: cfg.InFeatures,
out: cfg.OutFeatures,
}, true, nil
}
func embeddingGemmaDenseTensorName(modulePath string) (string, bool) {
switch modulePath {
case "2_Dense":
return "dense_2", true
case "3_Dense":
return "dense_3", true
default:
return "", false
}
}
func (m *embeddingGemmaModel) extraTensors(fsys fs.FS) ([]Tensor, error) {
var extra []Tensor
for _, dense := range m.denseModules {
ts, err := parseSafetensors(fsys, strings.NewReplacer("linear.", dense.tensorName+"."), dense.path)
if err != nil {
return nil, err
}
foundWeight := false
for _, t := range ts {
if t.Name() == dense.tensorName+".weight" {
extra = append(extra, t)
foundWeight = true
}
}
if !foundWeight {
return nil, fmt.Errorf("embeddinggemma dense module %s missing linear.weight", dense.path)
}
}
return extra, nil
}
func (m *embeddingGemmaModel) Tensors(ts []Tensor) []*ggml.Tensor {
out := make([]*ggml.Tensor, 0, len(ts))
for _, t := range ts {
name := t.Name()
if name == "norm.weight" {
name = "output_norm.weight"
}
if strings.HasSuffix(name, "_norm.weight") {
t.SetRepacker(m.addOne)
}
out = append(out, &ggml.Tensor{
Name: name,
Kind: t.Kind(),
Shape: t.Shape(),
WriterTo: t,
})
}
return out
}
func (m *embeddingGemmaModel) Replacements() []string {
return []string{
"embed_tokens.", "token_embd.",
"layers.", "blk.",
"input_layernorm", "attn_norm",
"self_attn.q_proj", "attn_q",
"self_attn.q_norm", "attn_q_norm",
"self_attn.k_proj", "attn_k",
"self_attn.k_norm", "attn_k_norm",
"self_attn.v_proj", "attn_v",
"self_attn.o_proj", "attn_output",
"mlp.gate_proj", "ffn_gate",
"mlp.down_proj", "ffn_down",
"mlp.up_proj", "ffn_up",
"post_attention_layernorm", "post_attention_norm",
"pre_feedforward_layernorm", "ffn_norm",
"post_feedforward_layernorm", "post_ffw_norm",
}
}

View file

@ -0,0 +1,229 @@
package convert
import (
"bytes"
"encoding/binary"
"encoding/json"
"io"
"math"
"os"
"path/filepath"
"slices"
"testing"
"github.com/ollama/ollama/fs/ggml"
)
func TestConvertEmbeddingGemmaSentenceTransformers(t *testing.T) {
tempDir := t.TempDir()
writeJSONFile(t, filepath.Join(tempDir, "config.json"), map[string]any{
"architectures": []string{"Gemma3TextModel"},
"vocab_size": uint32(4),
"max_position_embeddings": uint32(2048),
"hidden_size": uint32(8),
"num_hidden_layers": uint32(1),
"intermediate_size": uint32(12),
"num_attention_heads": uint32(1),
"num_key_value_heads": uint32(1),
"head_dim": uint32(8),
"rms_norm_eps": float32(1e-6),
"rope_theta": float32(1000000),
"rope_local_base_freq": float32(10000),
"sliding_window": uint32(512),
"use_bidirectional_attention": true,
})
writeJSONFile(t, filepath.Join(tempDir, "tokenizer.json"), map[string]any{
"model": map[string]any{
"vocab": map[string]int{
"<pad>": 0,
"<eos>": 1,
"<bos>": 2,
"<unk>": 3,
},
},
"added_tokens": []map[string]any{
{"id": 4, "content": "<image_soft_token>", "special": true},
},
})
writeJSONFile(t, filepath.Join(tempDir, "modules.json"), []map[string]string{
{"type": "sentence_transformers.models.Transformer", "path": ""},
{"type": "sentence_transformers.models.Pooling", "path": "1_Pooling"},
{"type": "sentence_transformers.models.Dense", "path": "2_Dense"},
{"type": "sentence_transformers.models.Dense", "path": "3_Dense"},
{"type": "sentence_transformers.models.Normalize", "path": "4_Normalize"},
})
writeJSONFile(t, filepath.Join(tempDir, "1_Pooling", "config.json"), map[string]any{
"pooling_mode_mean_tokens": true,
})
writeJSONFile(t, filepath.Join(tempDir, "2_Dense", "config.json"), map[string]any{
"in_features": uint32(8),
"out_features": uint32(16),
"bias": false,
})
writeJSONFile(t, filepath.Join(tempDir, "3_Dense", "config.json"), map[string]any{
"in_features": uint32(16),
"out_features": uint32(8),
"bias": false,
})
writeSafetensorsFile(t, filepath.Join(tempDir, "model.safetensors"), []safetensorFixtureTensor{
{name: "embed_tokens.weight", shape: []int{4, 8}},
{name: "norm.weight", shape: []int{8}},
{name: "layers.0.input_layernorm.weight", shape: []int{8}},
{name: "layers.0.self_attn.q_proj.weight", shape: []int{8, 8}},
})
writeSafetensorsFile(t, filepath.Join(tempDir, "2_Dense", "model.safetensors"), []safetensorFixtureTensor{
{name: "linear.weight", shape: []int{16, 8}},
})
writeSafetensorsFile(t, filepath.Join(tempDir, "3_Dense", "model.safetensors"), []safetensorFixtureTensor{
{name: "linear.weight", shape: []int{8, 16}},
})
f, kv, tensors := convertFull(t, os.DirFS(tempDir))
defer f.Close()
if got := kv.Architecture(); got != "gemma-embedding" {
t.Fatalf("architecture = %q, want gemma-embedding", got)
}
for key, want := range map[string]uint32{
"dense_2_feat_in": 8,
"dense_2_feat_out": 16,
"dense_3_feat_in": 16,
"dense_3_feat_out": 8,
"pooling_type": 1,
"attention.sliding_window": 512,
} {
if got := kv.Uint(key); got != want {
t.Errorf("%s = %d, want %d", key, got, want)
}
}
if got := kv.Float("rope.freq_base_swa"); got != 10000 {
t.Errorf("rope.freq_base_swa = %v, want 10000", got)
}
if got := kv.Strings("tokenizer.ggml.tokens"); len(got) != 4 {
t.Errorf("token count = %d, want 4", len(got))
}
names := tensorNames(tensors)
for _, name := range []string{
"token_embd.weight",
"output_norm.weight",
"blk.0.attn_norm.weight",
"blk.0.attn_q.weight",
"dense_2.weight",
"dense_3.weight",
} {
if !slices.Contains(names, name) {
t.Errorf("missing tensor %s", name)
}
}
assertF32TensorValues(t, f, tensors, "output_norm.weight", 1)
assertF32TensorValues(t, f, tensors, "blk.0.attn_norm.weight", 1)
}
type safetensorFixtureTensor struct {
name string
shape []int
}
func writeJSONFile(t *testing.T, path string, value any) {
t.Helper()
if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
t.Fatal(err)
}
bts, err := json.Marshal(value)
if err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, bts, 0o644); err != nil {
t.Fatal(err)
}
}
func writeSafetensorsFile(t *testing.T, path string, tensors []safetensorFixtureTensor) {
t.Helper()
if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
t.Fatal(err)
}
offset := 0
metadata := map[string]*tensorData{}
for _, tensor := range tensors {
size := 4
for _, dim := range tensor.shape {
size *= dim
}
metadata[tensor.name] = &tensorData{
Offsets: []int{offset, offset + size},
Type: "F32",
Shape: tensor.shape,
}
offset += size
}
header, err := json.Marshal(metadata)
if err != nil {
t.Fatal(err)
}
var buf bytes.Buffer
if err := binary.Write(&buf, binary.LittleEndian, int64(len(header))); err != nil {
t.Fatal(err)
}
if _, err := buf.Write(header); err != nil {
t.Fatal(err)
}
if _, err := buf.Write(make([]byte, offset)); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, buf.Bytes(), 0o644); err != nil {
t.Fatal(err)
}
}
func tensorNames(tensors ggml.Tensors) []string {
names := make([]string, 0, len(tensors.Items()))
for _, tensor := range tensors.Items() {
names = append(names, tensor.Name)
}
return names
}
func assertF32TensorValues(t *testing.T, f *os.File, tensors ggml.Tensors, name string, want float32) {
t.Helper()
var tensor *ggml.Tensor
for _, item := range tensors.Items() {
if item.Name == name {
tensor = item
break
}
}
if tensor == nil {
t.Fatalf("missing tensor %s", name)
}
if tensor.Kind != uint32(ggml.TensorTypeF32) {
t.Fatalf("%s kind = %d, want F32", name, tensor.Kind)
}
bts := make([]byte, tensor.Size())
reader := io.NewSectionReader(f, int64(tensors.Offset+tensor.Offset), int64(tensor.Size()))
if _, err := io.ReadFull(reader, bts); err != nil {
t.Fatal(err)
}
for i := 0; i < len(bts); i += 4 {
if got := math.Float32frombits(binary.LittleEndian.Uint32(bts[i:])); got != want {
t.Fatalf("%s[%d] = %v, want %v", name, i/4, got, want)
}
}
}

View file

@ -2,7 +2,11 @@ package convert
import (
"cmp"
"fmt"
"slices"
"strings"
"github.com/ollama/ollama/fs/ggml"
)
type gemma3Model struct {
@ -178,3 +182,42 @@ func (p *gemma3Model) Replacements() []string {
"multi_modal_projector", "mm",
}
}
func (p *gemma3Model) TensorsWithTokenizer(ts []Tensor, t *Tokenizer) []*ggml.Tensor {
vocabSize := uint64(0)
if t != nil && t.Vocabulary != nil {
vocabSize = uint64(len(t.Vocabulary.Tokens))
}
var out []*ggml.Tensor
for _, tensor := range ts {
name := tensor.Name()
gt := &ggml.Tensor{
Name: name,
Kind: tensor.Kind(),
Shape: tensor.Shape(),
WriterTo: tensor,
}
if !strings.HasPrefix(name, "v.") && strings.HasSuffix(name, "_norm.weight") {
tensor.SetRepacker(p.addOne)
}
if vocabSize > 0 && name == "token_embd.weight" && len(gt.Shape) >= 2 && gt.Shape[0] > vocabSize {
gt.Shape = slices.Clone(gt.Shape)
embdDim := gt.Shape[1]
gt.Shape[0] = vocabSize
tensor.SetRepacker(func(_ string, data []float32, _ []uint64) ([]float32, error) {
n := vocabSize * embdDim
if uint64(len(data)) < n {
return nil, fmt.Errorf("gemma3 token_embd.weight has %d values, need %d", len(data), n)
}
return data[:n], nil
})
}
out = append(out, gt)
}
return out
}

View file

@ -0,0 +1,34 @@
package convert
import (
"slices"
"testing"
)
func TestGemma3TensorsWithTokenizerTruncatesPaddedEmbedding(t *testing.T) {
p := gemma3Model{}
embedding := &fakeTensor{
name: "token_embd.weight",
shape: []uint64{5, 2},
data: []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9},
}
out := p.TensorsWithTokenizer([]Tensor{embedding}, &Tokenizer{
Vocabulary: &Vocabulary{Tokens: []string{"a", "b", "<image>"}},
})
if len(out) != 1 {
t.Fatalf("expected 1 tensor, got %d", len(out))
}
if got, want := out[0].Shape, []uint64{3, 2}; !slices.Equal(got, want) {
t.Fatalf("token_embd.weight shape = %v, want %v", got, want)
}
got, err := embedding.repacker(embedding.name, embedding.data, embedding.shape)
if err != nil {
t.Fatalf("unexpected repacker error: %v", err)
}
if want := embedding.data[:6]; !slices.Equal(got, want) {
t.Fatalf("truncated embedding = %v, want %v", got, want)
}
}

View file

@ -1,6 +1,8 @@
package convert
import (
"encoding/json"
"fmt"
"slices"
"strings"
@ -14,30 +16,58 @@ type gemma3nModel struct {
ModelParameters
TextModel struct {
ActivationSparsityPattern []float32 `json:"activation_sparsity_pattern"`
AltupActiveIdx uint32 `json:"altup_active_idx"`
AltupCoefClip float32 `json:"altup_coef_clip"`
AltupCorrectScale bool `json:"altup_correct_scale"`
AltupLRMultiplier float32 `json:"altup_lr_multiplier"`
AltupNumInputs uint32 `json:"altup_num_inputs"`
HeadDim uint32 `json:"head_dim"`
HiddenSize uint32 `json:"hidden_size"`
HiddenSizePerLayerInput uint32 `json:"hidden_size_per_layer_input"`
IntermediateSize uint32 `json:"intermediate_size"`
MaxPositionEmbeddings uint32 `json:"max_position_embeddings"`
NumAttentionHeads uint32 `json:"num_attention_heads"`
NumHiddenLayers uint32 `json:"num_hidden_layers"`
NumKeyValueHeads uint32 `json:"num_key_value_heads"`
NumKVSharedLayers uint32 `json:"num_kv_shared_layers"`
RMSNormEPS float32 `json:"rms_norm_eps"`
RopeLocalBaseFreq float32 `json:"rope_local_base_freq"`
RopeTheta float32 `json:"rope_theta"`
SlidingWindow uint32 `json:"sliding_window"`
LayerTypes []string `json:"layer_types"`
ActivationSparsityPattern []float32 `json:"activation_sparsity_pattern"`
AltupActiveIdx uint32 `json:"altup_active_idx"`
AltupCoefClip float32 `json:"altup_coef_clip"`
AltupCorrectScale bool `json:"altup_correct_scale"`
AltupLRMultiplier float32 `json:"altup_lr_multiplier"`
AltupNumInputs uint32 `json:"altup_num_inputs"`
HeadDim uint32 `json:"head_dim"`
HiddenSize uint32 `json:"hidden_size"`
HiddenSizePerLayerInput uint32 `json:"hidden_size_per_layer_input"`
IntermediateSize gemma3nIntermediateSize `json:"intermediate_size"`
MaxPositionEmbeddings uint32 `json:"max_position_embeddings"`
NumAttentionHeads uint32 `json:"num_attention_heads"`
NumHiddenLayers uint32 `json:"num_hidden_layers"`
NumKeyValueHeads uint32 `json:"num_key_value_heads"`
NumKVSharedLayers uint32 `json:"num_kv_shared_layers"`
RMSNormEPS float32 `json:"rms_norm_eps"`
RopeLocalBaseFreq float32 `json:"rope_local_base_freq"`
RopeTheta float32 `json:"rope_theta"`
SlidingWindow uint32 `json:"sliding_window"`
LayerTypes []string `json:"layer_types"`
} `json:"text_config"`
VisionModel struct{} `json:"vision_config"`
}
type gemma3nIntermediateSize uint32
func (s *gemma3nIntermediateSize) UnmarshalJSON(data []byte) error {
var scalar uint32
if err := json.Unmarshal(data, &scalar); err == nil {
*s = gemma3nIntermediateSize(scalar)
return nil
}
var values []uint32
if err := json.Unmarshal(data, &values); err != nil {
return err
}
if len(values) == 0 {
return fmt.Errorf("intermediate_size must not be empty")
}
first := values[0]
for _, v := range values[1:] {
if v != first {
return fmt.Errorf("intermediate_size values must match")
}
}
*s = gemma3nIntermediateSize(first)
return nil
}
func (m *gemma3nModel) KV(t *Tokenizer) KV {
kv := m.ModelParameters.KV(t)
kv["general.architecture"] = "gemma3n"
@ -69,7 +99,7 @@ func (m *gemma3nModel) KV(t *Tokenizer) KV {
kv["gemma3n.context_length"] = m.TextModel.MaxPositionEmbeddings
kv["gemma3n.embedding_length_per_layer_input"] = m.TextModel.HiddenSizePerLayerInput
kv["gemma3n.embedding_length"] = m.TextModel.HiddenSize
kv["gemma3n.feed_forward_length"] = m.TextModel.IntermediateSize
kv["gemma3n.feed_forward_length"] = uint32(m.TextModel.IntermediateSize)
kv["gemma3n.head_dim"] = m.TextModel.HeadDim
kv["gemma3n.rope.freq_base_local"] = m.TextModel.RopeLocalBaseFreq
kv["gemma3n.rope.freq_base"] = m.TextModel.RopeTheta

View file

@ -0,0 +1,55 @@
package convert
import (
"encoding/json"
"testing"
)
func TestGemma3nIntermediateSize(t *testing.T) {
tests := []struct {
name string
json string
want gemma3nIntermediateSize
wantErr bool
}{
{
name: "scalar",
json: `8192`,
want: 8192,
},
{
name: "uniform array",
json: `[8192,8192,8192]`,
want: 8192,
},
{
name: "mixed array",
json: `[8192,4096]`,
wantErr: true,
},
{
name: "empty array",
json: `[]`,
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
var got gemma3nIntermediateSize
err := json.Unmarshal([]byte(tt.json), &got)
if tt.wantErr {
if err == nil {
t.Fatal("expected error")
}
return
}
if err != nil {
t.Fatal(err)
}
if got != tt.want {
t.Fatalf("got %d, want %d", got, tt.want)
}
})
}
}

View file

@ -39,48 +39,72 @@ type glm4MoeLiteModel struct {
ExpertWeightsScale float32 `json:"routed_scaling_factor"`
LeadingDenseBlockCount uint32 `json:"first_k_dense_replace"`
ExpertGroupCount uint32 `json:"n_group"`
ExpertGroupUsedCount uint32 `json:"topk_group"`
}
func (p *glm4MoeLiteModel) KV(t *Tokenizer) KV {
kv := p.ModelParameters.KV(t)
kv["general.architecture"] = "glm4moelite"
kv["general.architecture"] = "deepseek2"
kv["general.type"] = "model"
kv["glm4moelite.block_count"] = p.HiddenLayers
kv["deepseek2.block_count"] = p.HiddenLayers
numHeads := p.NumAttentionHeads
numKVHeads := p.NumKeyValueHeads
kv["glm4moelite.attention.head_count"] = numHeads
kv["glm4moelite.attention.head_count_kv"] = numKVHeads
kv["glm4moelite.attention.key_length"] = p.QKNopeHeadDim + p.QKRopeHeadDim
kv["glm4moelite.attention.kv_lora_rank"] = p.KVLoraRank
kv["glm4moelite.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
kv["glm4moelite.attention.q_lora_rank"] = p.QLoraRank
kv["glm4moelite.attention.value_length"] = p.VHeadDim
kv["glm4moelite.context_length"] = p.MaxPositionEmbeddings
kv["glm4moelite.embedding_length"] = p.HiddenSize
kv["glm4moelite.expert_count"] = p.ExpertCount
kv["glm4moelite.expert_feed_forward_length"] = p.ExpertIntermediateSize
kv["glm4moelite.expert_shared_count"] = p.ExpertSharedCount
kv["deepseek2.attention.head_count"] = numHeads
kv["deepseek2.attention.head_count_kv"] = uint32(1)
kv["deepseek2.attention.key_length"] = p.KVLoraRank + p.QKRopeHeadDim
kv["deepseek2.attention.kv_lora_rank"] = p.KVLoraRank
kv["deepseek2.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
kv["deepseek2.attention.q_lora_rank"] = p.QLoraRank
kv["deepseek2.attention.value_length"] = p.KVLoraRank
kv["deepseek2.context_length"] = p.MaxPositionEmbeddings
kv["deepseek2.embedding_length"] = p.HiddenSize
kv["deepseek2.expert_count"] = p.ExpertCount
kv["deepseek2.expert_feed_forward_length"] = p.ExpertIntermediateSize
kv["deepseek2.expert_shared_count"] = p.ExpertSharedCount
kv["glm4moelite.expert_gating_func"] = uint32(2)
kv["glm4moelite.expert_used_count"] = p.ExpertUsedCount
kv["glm4moelite.expert_weights_norm"] = p.ExpertWeightsNorm
kv["glm4moelite.expert_weights_scale"] = p.ExpertWeightsScale
kv["glm4moelite.feed_forward_length"] = p.IntermediateSize
kv["glm4moelite.leading_dense_block_count"] = p.LeadingDenseBlockCount
kv["deepseek2.expert_gating_func"] = uint32(2)
kv["deepseek2.expert_group_count"] = cmp.Or(p.ExpertGroupCount, uint32(1))
kv["deepseek2.expert_group_used_count"] = cmp.Or(p.ExpertGroupUsedCount, uint32(1))
kv["deepseek2.expert_used_count"] = p.ExpertUsedCount
kv["deepseek2.expert_weights_norm"] = p.ExpertWeightsNorm
kv["deepseek2.expert_weights_scale"] = p.ExpertWeightsScale
kv["deepseek2.feed_forward_length"] = p.IntermediateSize
kv["deepseek2.leading_dense_block_count"] = p.LeadingDenseBlockCount
kv["glm4moelite.rope.dimension_count"] = p.QKRopeHeadDim
kv["glm4moelite.rope.freq_base"] = cmp.Or(p.RopeTheta, float32(1000000.0))
kv["deepseek2.rope.dimension_count"] = p.QKRopeHeadDim
kv["deepseek2.rope.freq_base"] = cmp.Or(p.RopeTheta, float32(1000000.0))
kv["glm4moelite.attention.key_length_mla"] = p.KVLoraRank + p.QKRopeHeadDim
kv["glm4moelite.attention.value_length_mla"] = p.KVLoraRank
kv["deepseek2.attention.key_length_mla"] = p.QKNopeHeadDim + p.QKRopeHeadDim
kv["deepseek2.attention.value_length_mla"] = p.VHeadDim
kv["tokenizer.ggml.pre"] = "glm4"
setGLM4MoeLiteExtraEOGFromEOSIDs(kv)
return kv
}
func setGLM4MoeLiteExtraEOGFromEOSIDs(kv KV) {
switch ids := kv["tokenizer.ggml.eos_token_ids"].(type) {
case []int32:
if len(ids) >= 2 && ids[1] >= 0 {
kv["tokenizer.ggml.eot_token_id"] = uint32(ids[1])
}
if len(ids) >= 3 && ids[2] >= 0 {
kv["tokenizer.ggml.eom_token_id"] = uint32(ids[2])
}
case []uint32:
if len(ids) >= 2 {
kv["tokenizer.ggml.eot_token_id"] = ids[1]
}
if len(ids) >= 3 {
kv["tokenizer.ggml.eom_token_id"] = ids[2]
}
}
}
func (p *glm4MoeLiteModel) Replacements() []string {
return []string{
"lm_head", "output",

View file

@ -0,0 +1,68 @@
package convert
import "testing"
func TestGLM4MoeLiteKVUsesLlamaCppMetadata(t *testing.T) {
p := glm4MoeLiteModel{
ModelParameters: ModelParameters{VocabSize: 151552},
MaxPositionEmbeddings: 202752,
HiddenSize: 2048,
HiddenLayers: 47,
IntermediateSize: 10240,
NumAttentionHeads: 20,
NumKeyValueHeads: 20,
RMSNormEPS: 1e-5,
RopeTheta: 1000000,
QKNopeHeadDim: 128,
QKRopeHeadDim: 64,
KVLoraRank: 512,
QLoraRank: 768,
VHeadDim: 128,
ExpertCount: 64,
ExpertSharedCount: 1,
ExpertUsedCount: 4,
ExpertWeightsNorm: true,
ExpertWeightsScale: 1.8,
}
kv := p.KV(&Tokenizer{Vocabulary: &Vocabulary{Model: "gpt2", Tokens: []string{"a"}}})
if got := kv.Architecture(); got != "deepseek2" {
t.Fatalf("architecture = %q, want deepseek2", got)
}
for key, want := range map[string]uint32{
"attention.head_count": 20,
"attention.head_count_kv": 1,
"attention.key_length": 576,
"attention.value_length": 512,
"attention.key_length_mla": 192,
"attention.value_length_mla": 128,
"expert_group_count": 1,
"expert_group_used_count": 1,
"expert_gating_func": 2,
"rope.dimension_count": 64,
} {
if got := kv.Uint(key); got != want {
t.Errorf("%s = %d, want %d", key, got, want)
}
}
if got := kv.String("tokenizer.ggml.pre"); got != "glm4" {
t.Errorf("tokenizer.ggml.pre = %q, want glm4", got)
}
}
func TestGLM4MoeLiteKVPromotesExtraEOSIDs(t *testing.T) {
kv := KV{
"general.architecture": "deepseek2",
"tokenizer.ggml.eos_token_ids": []int32{151329, 151330, 151336},
}
setGLM4MoeLiteExtraEOGFromEOSIDs(kv)
if got := kv.Uint("tokenizer.ggml.eot_token_id"); got != 151330 {
t.Errorf("eot token = %d, want 151330", got)
}
if got := kv.Uint("tokenizer.ggml.eom_token_id"); got != 151336 {
t.Errorf("eom token = %d, want 151336", got)
}
}

View file

@ -83,6 +83,7 @@ type glmOcrModel struct {
HiddenSize uint32 `json:"hidden_size"`
IntermediateSize uint32 `json:"intermediate_size"`
NumHiddenLayers uint32 `json:"num_hidden_layers"`
NumNextNPredict uint32 `json:"num_nextn_predict_layers"`
NumAttentionHeads uint32 `json:"num_attention_heads"`
NumKeyValueHeads uint32 `json:"num_key_value_heads"`
HeadDim uint32 `json:"head_dim"`
@ -131,7 +132,7 @@ type glmOcrModel struct {
} `json:"-"`
}
var _ ModelConverter = (*glmOcrModel)(nil)
var _ MultimodalConverter = (*glmOcrModel)(nil)
func (m *glmOcrModel) parseMore(fsys fs.FS) error {
bts, err := fs.ReadFile(fsys, "preprocessor_config.json")
@ -145,9 +146,14 @@ func (m *glmOcrModel) parseMore(fsys fs.FS) error {
func (m *glmOcrModel) KV(t *Tokenizer) KV {
kv := m.ModelParameters.KV(t)
kv["general.architecture"] = "glmocr"
applyGlmOcrTokenizerKV(kv, t)
// Text model parameters
kv["glmocr.block_count"] = cmp.Or(m.TextConfig.NumHiddenLayers, 16)
numHiddenLayers := cmp.Or(m.TextConfig.NumHiddenLayers, 16)
kv["glmocr.block_count"] = numHiddenLayers + m.TextConfig.NumNextNPredict
if m.TextConfig.NumNextNPredict > 0 {
kv["glmocr.nextn_predict_layers"] = m.TextConfig.NumNextNPredict
}
kv["glmocr.embedding_length"] = cmp.Or(m.TextConfig.HiddenSize, 1536)
kv["glmocr.attention.head_count"] = cmp.Or(m.TextConfig.NumAttentionHeads, 16)
kv["glmocr.attention.head_count_kv"] = cmp.Or(m.TextConfig.NumKeyValueHeads, 8)
@ -175,8 +181,6 @@ func (m *glmOcrModel) KV(t *Tokenizer) KV {
kv["glmocr.vision.intermediate_size"] = cmp.Or(m.VisionConfig.IntermediateSize, 4096)
kv["glmocr.vision.attention.layer_norm_rms_epsilon"] = cmp.Or(m.VisionConfig.RMSNormEps, 1e-5)
// Preprocessor-derived image settings (min/max pixels and normalization)
// Note: fs.Config.keyValue() auto-prepends architecture prefix, so use full key
if m.Preprocessor.Size.ShortestEdge > 0 {
kv["glmocr.vision.min_pixels"] = m.Preprocessor.Size.ShortestEdge
}
@ -190,7 +194,6 @@ func (m *glmOcrModel) KV(t *Tokenizer) KV {
kv["glmocr.vision.image_std"] = m.Preprocessor.ImageStd
}
// Special tokens
kv["glmocr.image_token_id"] = m.ImageTokenID
kv["glmocr.image_start_token_id"] = m.ImageStartTokenID
kv["glmocr.image_end_token_id"] = m.ImageEndTokenID
@ -201,32 +204,249 @@ func (m *glmOcrModel) KV(t *Tokenizer) KV {
return kv
}
func applyGlmOcrTokenizerKV(kv KV, t *Tokenizer) {
kv["tokenizer.ggml.pre"] = "chatglm-bpe"
if id, ok := glmOcrTokenID(t, "<|endoftext|>"); ok {
kv["tokenizer.ggml.bos_token_id"] = uint32(id)
kv["tokenizer.ggml.unknown_token_id"] = uint32(id)
}
if id, ok := glmOcrTokenID(t, "<|user|>"); ok {
kv["tokenizer.ggml.eot_token_id"] = uint32(id)
}
}
func (m *glmOcrModel) TextKV(t *Tokenizer) KV {
kv := m.ModelParameters.KV(t)
kv["general.architecture"] = "glm4"
applyGlmOcrTokenizerKV(kv, t)
numHiddenLayers := cmp.Or(m.TextConfig.NumHiddenLayers, 16)
kv["block_count"] = numHiddenLayers + m.TextConfig.NumNextNPredict
if m.TextConfig.NumNextNPredict > 0 {
kv["nextn_predict_layers"] = m.TextConfig.NumNextNPredict
}
kv["embedding_length"] = cmp.Or(m.TextConfig.HiddenSize, 1536)
kv["attention.head_count"] = cmp.Or(m.TextConfig.NumAttentionHeads, 16)
kv["attention.head_count_kv"] = cmp.Or(m.TextConfig.NumKeyValueHeads, 8)
headDim := cmp.Or(m.TextConfig.HeadDim, m.TextConfig.HiddenSize/m.TextConfig.NumAttentionHeads)
kv["attention.key_length"] = headDim
kv["attention.value_length"] = headDim
kv["feed_forward_length"] = cmp.Or(m.TextConfig.IntermediateSize, 4608)
kv["attention.layer_norm_rms_epsilon"] = cmp.Or(m.TextConfig.RMSNormEps, 1e-5)
kv["context_length"] = cmp.Or(m.TextConfig.MaxPositionEmbed, 131072)
kv["rope.freq_base"] = cmp.Or(m.TextConfig.RopeParameters.RopeTheta, float32(10000))
partialRotaryFactor := cmp.Or(m.TextConfig.RopeParameters.PartialRotaryFactor, m.TextConfig.PartialRotaryFactor, float32(1.0))
kv["rope.dimension_count"] = uint32(float32(headDim) * partialRotaryFactor)
if len(m.TextConfig.RopeParameters.MRopeSection) > 0 {
sections := append([]int32(nil), m.TextConfig.RopeParameters.MRopeSection...)
for len(sections) < 4 {
sections = append(sections, 0)
}
kv["rope.dimension_sections"] = sections
}
return kv
}
func (m *glmOcrModel) ProjectorKV(*Tokenizer) KV {
kv := KV{
"general.architecture": "clip",
"general.type": "mmproj",
"general.file_type": uint32(1),
"general.quantization_version": uint32(2),
"clip.has_vision_encoder": true,
"clip.projector_type": "glm4v",
"clip.use_silu": true,
"clip.vision.block_count": cmp.Or(m.VisionConfig.Depth, 24),
"clip.vision.embedding_length": cmp.Or(m.VisionConfig.HiddenSize, 1024),
"clip.vision.attention.head_count": cmp.Or(m.VisionConfig.NumHeads, 16),
"clip.vision.image_size": cmp.Or(m.VisionConfig.ImageSize, 336),
"clip.vision.patch_size": cmp.Or(m.VisionConfig.PatchSize, m.Preprocessor.PatchSize, 14),
"clip.vision.spatial_merge_size": cmp.Or(m.VisionConfig.SpatialMergeSize, m.Preprocessor.MergeSize, 2),
"clip.vision.temporal_patch_size": cmp.Or(m.VisionConfig.TemporalPatchSize, m.Preprocessor.TemporalPatchSize, 2),
"clip.vision.projection_dim": cmp.Or(m.VisionConfig.OutHiddenSize, 1536),
"clip.vision.out_hidden_size": cmp.Or(m.VisionConfig.OutHiddenSize, 1536),
"clip.vision.feed_forward_length": cmp.Or(m.VisionConfig.IntermediateSize, 4096),
"clip.vision.intermediate_size": cmp.Or(m.VisionConfig.IntermediateSize, 4096),
"clip.vision.attention.layer_norm_epsilon": cmp.Or(m.VisionConfig.RMSNormEps, 1e-5),
"clip.vision.image_token_id": m.ImageTokenID,
"clip.vision.image_start_token_id": m.ImageStartTokenID,
"clip.vision.image_end_token_id": m.ImageEndTokenID,
}
if m.Preprocessor.Size.ShortestEdge > 0 {
kv["clip.vision.min_pixels"] = m.Preprocessor.Size.ShortestEdge
}
if m.Preprocessor.Size.LongestEdge > 0 {
kv["clip.vision.max_pixels"] = m.Preprocessor.Size.LongestEdge
}
if len(m.Preprocessor.ImageMean) == 3 {
kv["clip.vision.image_mean"] = m.Preprocessor.ImageMean
}
if len(m.Preprocessor.ImageStd) == 3 {
kv["clip.vision.image_std"] = m.Preprocessor.ImageStd
}
return kv
}
func glmOcrTokenID(t *Tokenizer, token string) (int, bool) {
if t == nil || t.Vocabulary == nil {
return 0, false
}
for i, candidate := range t.Vocabulary.Tokens {
if candidate == token {
return i, true
}
}
return 0, false
}
func isGlmOcrVisionTensor(name string) bool {
return strings.HasPrefix(name, "v.") || strings.HasPrefix(name, "mm.")
}
func (m *glmOcrModel) TextTensors(ts []Tensor, t *Tokenizer) []*ggml.Tensor {
textOnly := make([]Tensor, 0, len(ts))
for _, tensor := range ts {
if !isGlmOcrVisionTensor(tensor.Name()) {
textOnly = append(textOnly, tensor)
}
}
return m.Tensors(textOnly)
}
func (m *glmOcrModel) ProjectorTensors(ts []Tensor) []*ggml.Tensor {
var out []*ggml.Tensor
for _, t := range ts {
if !isGlmOcrVisionTensor(t.Name()) {
continue
}
name := t.Name()
switch {
case strings.HasSuffix(name, "patch_embd_0.weight"):
name = strings.Replace(name, "patch_embd_0.weight", "patch_embd.weight", 1)
case strings.HasSuffix(name, "patch_embd_1.weight"):
name = strings.Replace(name, "patch_embd_1.weight", "patch_embd.weight.1", 1)
case strings.HasSuffix(name, "patch_embd.weight.0"):
name = strings.Replace(name, "patch_embd.weight.0", "patch_embd.weight", 1)
}
if strings.HasSuffix(name, "patch_embd.weight") {
shape := t.Shape()
if len(shape) == 5 && shape[2] == 2 {
newShape := []uint64{shape[0], shape[1], shape[3], shape[4]}
t0 := t.Clone()
t0.SetRepacker(func(_ string, data []float32, shape []uint64) ([]float32, error) {
dims := make([]int, len(shape))
for i := range shape {
dims[i] = int(shape[i])
}
var tt tensor.Tensor = tensor.New(tensor.WithShape(dims...), tensor.WithBacking(data))
tt, err := tt.Slice(nil, nil, tensor.S(0, 1), nil, nil)
if err != nil {
return nil, err
}
tt = tensor.Materialize(tt)
newDims := []int{int(shape[0]), int(shape[1]), int(shape[3]), int(shape[4])}
if err := tt.Reshape(newDims...); err != nil {
return nil, err
}
if err := tt.Reshape(tt.Shape().TotalSize()); err != nil {
return nil, err
}
return native.VectorF32(tt.(*tensor.Dense))
})
out = append(out, &ggml.Tensor{
Name: strings.Replace(name, "patch_embd.weight", "patch_embd.weight", 1),
Kind: t.Kind(),
Shape: newShape,
WriterTo: t0,
})
t1 := t.Clone()
t1.SetRepacker(func(_ string, data []float32, shape []uint64) ([]float32, error) {
dims := make([]int, len(shape))
for i := range shape {
dims[i] = int(shape[i])
}
var tt tensor.Tensor = tensor.New(tensor.WithShape(dims...), tensor.WithBacking(data))
tt, err := tt.Slice(nil, nil, tensor.S(1, 2), nil, nil)
if err != nil {
return nil, err
}
tt = tensor.Materialize(tt)
newDims := []int{int(shape[0]), int(shape[1]), int(shape[3]), int(shape[4])}
if err := tt.Reshape(newDims...); err != nil {
return nil, err
}
if err := tt.Reshape(tt.Shape().TotalSize()); err != nil {
return nil, err
}
return native.VectorF32(tt.(*tensor.Dense))
})
out = append(out, &ggml.Tensor{
Name: strings.Replace(name, "patch_embd.weight", "patch_embd.weight.1", 1),
Kind: t.Kind(),
Shape: newShape,
WriterTo: t1,
})
continue
}
}
out = append(out, &ggml.Tensor{
Name: name,
Kind: t.Kind(),
Shape: t.Shape(),
WriterTo: t,
})
}
return out
}
func (m *glmOcrModel) Tensors(ts []Tensor) []*ggml.Tensor {
var out []*ggml.Tensor
// Skip layers >= num_hidden_layers (Multi-Token Prediction layers not needed for basic inference)
numLayers := int(cmp.Or(m.TextConfig.NumHiddenLayers, 16))
skipLayer := func(name string) bool {
// Tensor names are already replaced to "blk.N.xxx" format
re := regexp.MustCompile(`^blk\.(\d+)`)
matches := re.FindStringSubmatch(name)
maxLayers := numLayers + int(m.TextConfig.NumNextNPredict)
layerRe := regexp.MustCompile(`^blk\.(\d+)`)
layerIndex := func(name string) (int, bool) {
matches := layerRe.FindStringSubmatch(name)
if matches == nil {
return false
return 0, false
}
blkNum, err := strconv.Atoi(matches[1])
if err != nil {
return false
return 0, false
}
return blkNum >= numLayers
return blkNum, true
}
for _, t := range ts {
name := t.Name()
// Skip next-n prediction layers (layers >= num_hidden_layers)
if skipLayer(name) {
blkNum, hasLayer := layerIndex(name)
if hasLayer && blkNum >= maxLayers {
continue
}
if hasLayer && blkNum >= numLayers {
switch {
case strings.HasSuffix(name, ".embed_tokens.weight"):
name = strings.Replace(name, ".embed_tokens.weight", ".nextn.embed_tokens.weight", 1)
case strings.HasSuffix(name, ".eh_proj.weight"):
name = strings.Replace(name, ".eh_proj.weight", ".nextn.eh_proj.weight", 1)
case strings.HasSuffix(name, ".enorm.weight"):
name = strings.Replace(name, ".enorm.weight", ".nextn.enorm.weight", 1)
case strings.HasSuffix(name, ".hnorm.weight"):
name = strings.Replace(name, ".hnorm.weight", ".nextn.hnorm.weight", 1)
case strings.HasSuffix(name, ".shared_head.head.weight"):
name = strings.Replace(name, ".shared_head.head.weight", ".nextn.shared_head_head.weight", 1)
case strings.HasSuffix(name, ".shared_head.norm.weight"):
name = strings.Replace(name, ".shared_head.norm.weight", ".nextn.shared_head_norm.weight", 1)
}
}
// Split ffn_gate_up into separate gate and up projections
if strings.Contains(name, "ffn_gate_up") {
@ -440,16 +660,16 @@ func (m *glmOcrModel) Replacements() []string {
"self_attn.q_proj", "attn_q",
"self_attn.k_proj", "attn_k",
"self_attn.v_proj", "attn_v",
"self_attn.o_proj", "attn_out",
"self_attn.o_proj", "attn_output",
// Language model norms
"input_layernorm", "attn_norm",
"post_attention_layernorm", "ffn_norm",
"post_self_attn_layernorm", "post_attn_norm",
"post_mlp_layernorm", "post_ffn_norm",
"post_self_attn_layernorm", "post_attention_norm",
"post_mlp_layernorm", "post_ffw_norm",
// Language model MLP (remove mlp. prefix so ffn_* names work)
"mlp.gate_up_proj", "ffn_gate_up",
// Language model MLP
"mlp.gate_up_proj", "ffn_up",
"mlp.down_proj", "ffn_down",
}
}

View file

@ -30,7 +30,11 @@ type gptossModel struct {
RopeTheta float32 `json:"rope_theta"`
RopeScalingFactor float32 `json:"rope_scaling_factor"`
RopeScaling struct {
Factor float32 `json:"factor"`
Type string `json:"rope_type"`
Factor float32 `json:"factor"`
OriginalMaxPositionEmbeddings uint32 `json:"original_max_position_embeddings"`
BetaFast float32 `json:"beta_fast"`
BetaSlow float32 `json:"beta_slow"`
} `json:"rope_scaling"`
SlidingWindow uint32 `json:"sliding_window"`
}
@ -39,23 +43,32 @@ var _ ModelConverter = (*gptossModel)(nil)
func (m *gptossModel) KV(t *Tokenizer) KV {
kv := m.ModelParameters.KV(t)
kv["general.architecture"] = "gptoss"
kv["general.architecture"] = "gpt-oss"
kv["general.file_type"] = uint32(4)
kv["gptoss.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(m.RopeScalingFactor*float32(m.InitialContextLength)))
kv["gptoss.block_count"] = m.HiddenLayers
kv["gptoss.embedding_length"] = m.HiddenSize
kv["gptoss.feed_forward_length"] = m.IntermediateSize
kv["gptoss.expert_count"] = cmp.Or(m.Experts, m.LocalExperts)
kv["gptoss.expert_used_count"] = m.ExpertsPerToken
kv["gptoss.attention.head_count"] = m.AttentionHeads
kv["gptoss.attention.head_count_kv"] = m.KeyValueHeads
kv["gptoss.attention.key_length"] = m.HeadDim
kv["gptoss.attention.value_length"] = m.HeadDim
kv["gptoss.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEpsilon, 1e-5)
kv["gptoss.attention.sliding_window"] = m.SlidingWindow
kv["gptoss.rope.freq_base"] = m.RopeTheta
kv["gptoss.rope.scaling.factor"] = cmp.Or(m.RopeScalingFactor, m.RopeScaling.Factor)
kv["gptoss.rope.scaling.original_context_length"] = m.InitialContextLength
kv["gpt-oss.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(m.RopeScalingFactor*float32(m.InitialContextLength)))
kv["gpt-oss.block_count"] = m.HiddenLayers
kv["gpt-oss.embedding_length"] = m.HiddenSize
kv["gpt-oss.feed_forward_length"] = m.IntermediateSize
kv["gpt-oss.expert_feed_forward_length"] = m.IntermediateSize
kv["gpt-oss.expert_count"] = cmp.Or(m.Experts, m.LocalExperts)
kv["gpt-oss.expert_used_count"] = m.ExpertsPerToken
kv["gpt-oss.attention.head_count"] = m.AttentionHeads
kv["gpt-oss.attention.head_count_kv"] = m.KeyValueHeads
kv["gpt-oss.attention.key_length"] = m.HeadDim
kv["gpt-oss.attention.value_length"] = m.HeadDim
kv["gpt-oss.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEpsilon, 1e-5)
kv["gpt-oss.attention.sliding_window"] = m.SlidingWindow
kv["gpt-oss.rope.freq_base"] = m.RopeTheta
kv["gpt-oss.rope.scaling.type"] = cmp.Or(m.RopeScaling.Type, "yarn")
kv["gpt-oss.rope.scaling.factor"] = cmp.Or(m.RopeScalingFactor, m.RopeScaling.Factor)
kv["gpt-oss.rope.scaling.original_context_length"] = cmp.Or(m.RopeScaling.OriginalMaxPositionEmbeddings, m.InitialContextLength)
if m.RopeScaling.BetaFast != 0 {
kv["gpt-oss.rope.scaling.yarn_beta_fast"] = m.RopeScaling.BetaFast
}
if m.RopeScaling.BetaSlow != 0 {
kv["gpt-oss.rope.scaling.yarn_beta_slow"] = m.RopeScaling.BetaSlow
}
kv["tokenizer.ggml.pre"] = "gpt-4o"
kv["tokenizer.ggml.bos_token_id"] = uint32(199998) // <|startoftext|>
kv["tokenizer.ggml.add_bos_token"] = false
kv["tokenizer.ggml.eos_token_id"] = uint32(199999) // <|endoftext|>
@ -152,9 +165,9 @@ func (m *gptossModel) Replacements() []string {
"self_attn.q_proj", "attn_q",
"self_attn.k_proj", "attn_k",
"self_attn.v_proj", "attn_v",
"self_attn.o_proj", "attn_out",
"self_attn.sinks", "attn_sinks",
"post_attention_layernorm", "ffn_norm",
"self_attn.o_proj", "attn_output",
"self_attn.sinks", "attn_sinks.weight",
"post_attention_layernorm", "post_attention_norm",
"mlp.router", "ffn_gate_inp",
"mlp.experts.gate_up_proj_", "ffn_gate_up_exps.",
"mlp.experts.down_proj_", "ffn_down_exps.",
@ -169,9 +182,9 @@ func (m *gptossModel) Replacements() []string {
"block", "blk",
"attn.norm", "attn_norm",
"attn.qkv", "attn_qkv",
"attn.sinks", "attn_sinks",
"attn.out", "attn_out",
"mlp.norm", "ffn_norm",
"attn.sinks", "attn_sinks.weight",
"attn.out", "attn_output",
"mlp.norm", "post_attention_norm",
"mlp.gate", "ffn_gate_inp",
"mlp.mlp1_", "ffn_gate_up_exps.",
"mlp.mlp2_", "ffn_down_exps.",

View file

@ -0,0 +1,73 @@
package convert
import (
"strings"
"testing"
)
func TestGptOssCreatesLlamaCppMetadataAndNames(t *testing.T) {
m := &gptossModel{
HiddenLayers: 24,
MaxPositionEmbeddings: 131072,
HiddenSize: 2880,
IntermediateSize: 2880,
AttentionHeads: 64,
KeyValueHeads: 8,
HeadDim: 64,
LocalExperts: 32,
ExpertsPerToken: 4,
RopeTheta: 150000,
InitialContextLength: 4096,
SlidingWindow: 128,
}
m.RopeScaling.Type = "yarn"
m.RopeScaling.Factor = 32
m.RopeScaling.OriginalMaxPositionEmbeddings = 4096
m.RopeScaling.BetaFast = 32
m.RopeScaling.BetaSlow = 1
kv := m.KV(&Tokenizer{Vocabulary: &Vocabulary{Model: "gpt2"}, Pre: "default"})
for k, want := range map[string]any{
"general.architecture": "gpt-oss",
"tokenizer.ggml.pre": "gpt-4o",
"gpt-oss.context_length": uint32(131072),
"gpt-oss.expert_feed_forward_length": uint32(2880),
"gpt-oss.rope.scaling.type": "yarn",
"gpt-oss.rope.scaling.factor": float32(32),
"gpt-oss.rope.scaling.original_context_length": uint32(4096),
"gpt-oss.rope.scaling.yarn_beta_fast": float32(32),
"gpt-oss.rope.scaling.yarn_beta_slow": float32(1),
} {
if got := kv[k]; got != want {
t.Fatalf("%s = %v (%T), want %v (%T)", k, got, got, want, want)
}
}
if _, ok := kv["gptoss.context_length"]; ok {
t.Fatal("unexpected Ollama-format gptoss metadata")
}
replacer := strings.NewReplacer(m.Replacements()...)
for name, want := range map[string]string{
"model.layers.0.self_attn.o_proj.weight": "blk.0.attn_output.weight",
"model.layers.0.self_attn.sinks": "blk.0.attn_sinks.weight",
"model.layers.0.post_attention_layernorm.weight": "blk.0.post_attention_norm.weight",
"model.layers.0.mlp.experts.gate_up_proj_blocks": "blk.0.ffn_gate_up_exps.blocks",
"model.layers.0.mlp.experts.down_proj_scales": "blk.0.ffn_down_exps.scales",
} {
if got := replacer.Replace(name); got != want {
t.Fatalf("Replace(%q) = %q, want %q", name, got, want)
}
}
m.MaxPositionEmbeddings = 0
replacer = strings.NewReplacer(m.Replacements()...)
for name, want := range map[string]string{
"block.0.attn.out.weight": "blk.0.attn_output.weight",
"block.0.attn.sinks": "blk.0.attn_sinks.weight",
"block.0.mlp.norm.weight": "blk.0.post_attention_norm.weight",
} {
if got := replacer.Replace(name); got != want {
t.Fatalf("Replace(%q) = %q, want %q", name, got, want)
}
}
}

View file

@ -34,8 +34,6 @@ type llamaModel struct {
LowFrequencyFactor float32 `json:"low_freq_factor"`
HighFrequencyFactor float32 `json:"high_freq_factor"`
OriginalMaxPositionEmbeddings uint32 `json:"original_max_position_embeddings"`
factors ropeFactor
} `json:"rope_scaling"`
RMSNormEPS float32 `json:"rms_norm_eps"`
LayerNormEPS float32 `json:"layer_norm_eps"`
@ -83,27 +81,6 @@ func (p *llamaModel) KV(t *Tokenizer) KV {
if p.RopeScaling.Type == "linear" {
kv["llama.rope.scaling.type"] = p.RopeScaling.Type
kv["llama.rope.scaling.factor"] = p.RopeScaling.Factor
} else if p.RopeScaling.RopeType == "llama3" {
dim := p.HiddenSize / p.NumAttentionHeads
for i := uint32(0); i < dim; i += 2 {
factor := cmp.Or(p.RopeScaling.Factor, 8.0)
factorLow := cmp.Or(p.RopeScaling.LowFrequencyFactor, 1.0)
factorHigh := cmp.Or(p.RopeScaling.HighFrequencyFactor, 4.0)
original := cmp.Or(p.RopeScaling.OriginalMaxPositionEmbeddings, 8192)
lambdaLow := float32(original) / factorLow
lambdaHigh := float32(original) / factorHigh
lambda := 2 * math.Pi * math.Pow(float64(p.RopeTheta), float64(i)/float64(dim))
if lambda < float64(lambdaHigh) {
p.RopeScaling.factors = append(p.RopeScaling.factors, 1.0)
} else if lambda > float64(lambdaLow) {
p.RopeScaling.factors = append(p.RopeScaling.factors, factor)
} else {
smooth := (float32(original)/float32(lambda) - factorLow) / (factorHigh - factorLow)
p.RopeScaling.factors = append(p.RopeScaling.factors, 1.0/((1-smooth)/factor+smooth))
}
}
}
if p.NumKeyValueHeads > 0 {
@ -129,12 +106,12 @@ func (p *llamaModel) KV(t *Tokenizer) KV {
func (p *llamaModel) Tensors(ts []Tensor) []*ggml.Tensor {
var out []*ggml.Tensor
if p.RopeScaling.factors != nil {
if factors := p.ropeFactors(); factors != nil {
out = append(out, &ggml.Tensor{
Name: "rope_freqs.weight",
Kind: 0,
Shape: []uint64{uint64(len(p.RopeScaling.factors))},
WriterTo: p.RopeScaling.factors,
Shape: []uint64{uint64(len(factors))},
WriterTo: factors,
})
}
@ -157,6 +134,40 @@ func (p *llamaModel) Tensors(ts []Tensor) []*ggml.Tensor {
return out
}
func (p *llamaModel) ropeFactors() ropeFactor {
if p.RopeScaling.RopeType != "llama3" || p.HiddenSize == 0 || p.NumAttentionHeads == 0 || p.RopeTheta == 0 {
return nil
}
dim := p.HiddenSize / p.NumAttentionHeads
if dim == 0 {
return nil
}
factors := make(ropeFactor, 0, dim/2)
for i := uint32(0); i < dim; i += 2 {
factor := cmp.Or(p.RopeScaling.Factor, float32(8))
factorLow := cmp.Or(p.RopeScaling.LowFrequencyFactor, float32(1))
factorHigh := cmp.Or(p.RopeScaling.HighFrequencyFactor, float32(4))
original := cmp.Or(p.RopeScaling.OriginalMaxPositionEmbeddings, uint32(8192))
lambdaLow := float32(original) / factorLow
lambdaHigh := float32(original) / factorHigh
lambda := 2 * math.Pi * math.Pow(float64(p.RopeTheta), float64(i)/float64(dim))
if lambda < float64(lambdaHigh) {
factors = append(factors, 1)
} else if lambda > float64(lambdaLow) {
factors = append(factors, factor)
} else {
smooth := (float32(original)/float32(lambda) - factorLow) / (factorHigh - factorLow)
factors = append(factors, 1/((1-smooth)/factor+smooth))
}
}
return factors
}
func (p *llamaModel) Replacements() []string {
return []string{
"lm_head", "output",

View file

@ -0,0 +1,34 @@
package convert
import "testing"
func TestLlama3RopeFactorsTensorDoesNotDependOnKVOrder(t *testing.T) {
m := &llamaModel{
HiddenSize: 2048,
NumAttentionHeads: 32,
RopeTheta: 500000,
}
m.RopeScaling.RopeType = "llama3"
m.RopeScaling.Factor = 32
m.RopeScaling.LowFrequencyFactor = 1
m.RopeScaling.HighFrequencyFactor = 4
m.RopeScaling.OriginalMaxPositionEmbeddings = 8192
tensors := m.Tensors(nil)
if len(tensors) != 1 {
t.Fatalf("expected rope tensor only, got %d tensors", len(tensors))
}
if tensors[0].Name != "rope_freqs.weight" {
t.Fatalf("expected rope_freqs.weight, got %q", tensors[0].Name)
}
if len(tensors[0].Shape) != 1 || tensors[0].Shape[0] != 32 {
t.Fatalf("expected rope tensor shape [32], got %v", tensors[0].Shape)
}
_ = m.KV(&Tokenizer{Vocabulary: &Vocabulary{}})
afterKV := m.Tensors(nil)
if len(afterKV) != 1 || afterKV[0].Name != "rope_freqs.weight" {
t.Fatalf("expected one rope tensor after KV call, got %#v", afterKV)
}
}

View file

@ -79,20 +79,17 @@ func (p *mistral3Model) KV(t *Tokenizer) KV {
kv["mistral3.rope.freq_base"] = cmp.Or(p.TextModel.RopeTheta, p.TextModel.RopeParameters.RopeTheta)
kv["mistral3.rope.scaling.factor"] = p.TextModel.RopeParameters.Factor
kv["mistral3.rope.scaling.type"] = p.TextModel.RopeParameters.RopeType
kv["mistral3.rope.scaling.beta_fast"] = p.TextModel.RopeParameters.BetaFast
kv["mistral3.rope.scaling.beta_slow"] = p.TextModel.RopeParameters.BetaSlow
kv["mistral3.rope.scaling.yarn_beta_fast"] = p.TextModel.RopeParameters.BetaFast
kv["mistral3.rope.scaling.yarn_beta_slow"] = p.TextModel.RopeParameters.BetaSlow
if p.TextModel.RopeParameters.Mscale != nil {
kv["mistral3.rope.scaling.mscale"] = *p.TextModel.RopeParameters.Mscale
}
if p.TextModel.RopeParameters.MscaleAllDim != nil {
kv["mistral3.rope.scaling.mscale_all_dim"] = *p.TextModel.RopeParameters.MscaleAllDim
kv["mistral3.rope.scaling.yarn_log_multiplier"] = *p.TextModel.RopeParameters.MscaleAllDim
}
if p.TextModel.RopeParameters.OrigMaxPositionEmbeddings > 0 {
kv["mistral3.rope.scaling.original_context_length"] = p.TextModel.RopeParameters.OrigMaxPositionEmbeddings
}
if p.TextModel.RopeParameters.Llama4ScalingBeta != nil {
kv["mistral3.rope.scaling_beta"] = *p.TextModel.RopeParameters.Llama4ScalingBeta
kv["mistral3.attention.temperature_scale"] = *p.TextModel.RopeParameters.Llama4ScalingBeta
}
// Vision configuration

View file

@ -58,24 +58,19 @@ func (p *mistral3CausalModel) KV(t *Tokenizer) KV {
kv["mistral3.rope.freq_base"] = cmp.Or(p.RopeTheta, p.RopeParameters.RopeTheta)
kv["mistral3.rope.scaling.factor"] = p.RopeParameters.Factor
kv["mistral3.rope.scaling.type"] = p.RopeParameters.RopeType
kv["mistral3.rope.scaling.beta_fast"] = p.RopeParameters.BetaFast
kv["mistral3.rope.scaling.beta_slow"] = p.RopeParameters.BetaSlow
if p.RopeParameters.Mscale != nil {
kv["mistral3.rope.scaling.mscale"] = *p.RopeParameters.Mscale
}
kv["mistral3.rope.scaling.yarn_beta_fast"] = p.RopeParameters.BetaFast
kv["mistral3.rope.scaling.yarn_beta_slow"] = p.RopeParameters.BetaSlow
if p.RopeParameters.MscaleAllDim != nil {
kv["mistral3.rope.scaling.mscale_all_dim"] = *p.RopeParameters.MscaleAllDim
kv["mistral3.rope.scaling.yarn_log_multiplier"] = *p.RopeParameters.MscaleAllDim
}
if p.RopeParameters.OrigMaxPositionEmbeddings > 0 {
kv["mistral3.rope.scaling.original_context_length"] = p.RopeParameters.OrigMaxPositionEmbeddings
kv["mistral3.rope.scaling_beta"] = *p.RopeParameters.Llama4ScalingBeta
}
if p.RopeParameters.Llama4ScalingBeta != nil {
kv["mistral3.rope.scaling_beta"] = *p.RopeParameters.Llama4ScalingBeta
kv["mistral3.attention.temperature_scale"] = *p.RopeParameters.Llama4ScalingBeta
}
return kv

View file

@ -0,0 +1,70 @@
package convert
import "testing"
func TestMistral3KVUsesLlamaCppRopeScalingKeys(t *testing.T) {
mscale := float32(0.75)
mscaleAllDim := float32(0)
temperatureScale := float32(0.125)
multimodal := &mistral3Model{}
multimodal.TextModel.NumAttentionHeads = 1
multimodal.TextModel.HeadDim = 64
multimodal.TextModel.RopeParameters.BetaFast = 32
multimodal.TextModel.RopeParameters.BetaSlow = 1
multimodal.TextModel.RopeParameters.Mscale = &mscale
multimodal.TextModel.RopeParameters.MscaleAllDim = &mscaleAllDim
multimodal.TextModel.RopeParameters.Llama4ScalingBeta = &temperatureScale
causal := &mistral3CausalModel{NumAttentionHeads: 1, HeadDim: 64}
causal.RopeParameters.BetaFast = 32
causal.RopeParameters.BetaSlow = 1
causal.RopeParameters.Mscale = &mscale
causal.RopeParameters.MscaleAllDim = &mscaleAllDim
causal.RopeParameters.Llama4ScalingBeta = &temperatureScale
tests := []struct {
name string
kv KV
}{
{name: "multimodal", kv: multimodal.KV(mistralTestTokenizer())},
{name: "causal", kv: causal.KV(mistralTestTokenizer())},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
assertKVEquals(t, tt.kv, "mistral3.rope.scaling.yarn_beta_fast", float32(32))
assertKVEquals(t, tt.kv, "mistral3.rope.scaling.yarn_beta_slow", float32(1))
assertKVEquals(t, tt.kv, "mistral3.rope.scaling.yarn_log_multiplier", mscaleAllDim)
assertKVEquals(t, tt.kv, "mistral3.attention.temperature_scale", temperatureScale)
for _, key := range []string{
"mistral3.rope.scaling.beta_fast",
"mistral3.rope.scaling.beta_slow",
"mistral3.rope.scaling.mscale",
"mistral3.rope.scaling.mscale_all_dim",
"mistral3.rope.scaling_beta",
} {
if _, ok := tt.kv[key]; ok {
t.Fatalf("unexpected legacy key %q", key)
}
}
})
}
}
func mistralTestTokenizer() *Tokenizer {
return &Tokenizer{Vocabulary: &Vocabulary{}}
}
func assertKVEquals[T comparable](t *testing.T, kv KV, key string, want T) {
t.Helper()
got, ok := kv[key]
if !ok {
t.Fatalf("missing key %q", key)
}
if got != want {
t.Fatalf("%s = %v, want %v", key, got, want)
}
}

View file

@ -131,8 +131,10 @@ type radioConfig struct {
} `json:"args"`
}
var _ ModelConverter = (*nemotronHModel)(nil)
var _ ModelConverter = (*nemotronHNanoVLModel)(nil)
var (
_ ModelConverter = (*nemotronHModel)(nil)
_ ModelConverter = (*nemotronHNanoVLModel)(nil)
)
func (n *nemotronHNanoVLModel) parseMore(fsys fs.FS) error {
if n.MaxSequenceLength > 0 {

View file

@ -36,39 +36,39 @@ var _ ModelConverter = (*olmoModel)(nil)
func (p *olmoModel) KV(t *Tokenizer) KV {
kv := p.ModelParameters.KV(t)
kv["general.architecture"] = "olmo3"
kv["olmo3.block_count"] = p.NumHiddenLayers
kv["olmo3.context_length"] = p.MaxPositionEmbeddings
kv["olmo3.embedding_length"] = p.HiddenSize
kv["olmo3.feed_forward_length"] = p.IntermediateSize
kv["olmo3.attention.head_count"] = p.NumAttentionHeads
kv["olmo3.attention.head_count_kv"] = cmp.Or(p.NumKeyValueHeads, p.NumAttentionHeads)
kv["general.architecture"] = "olmo2"
kv["olmo2.block_count"] = p.NumHiddenLayers
kv["olmo2.context_length"] = p.MaxPositionEmbeddings
kv["olmo2.embedding_length"] = p.HiddenSize
kv["olmo2.feed_forward_length"] = p.IntermediateSize
kv["olmo2.attention.head_count"] = p.NumAttentionHeads
kv["olmo2.attention.head_count_kv"] = cmp.Or(p.NumKeyValueHeads, p.NumAttentionHeads)
if p.RopeTheta > 0 {
kv["olmo3.rope.freq_base"] = p.RopeTheta
kv["olmo2.rope.freq_base"] = p.RopeTheta
}
if p.RopeScaling != nil {
if p.RopeScaling.Factor > 0 {
kv["olmo3.rope.scaling.factor"] = p.RopeScaling.Factor
kv["olmo2.rope.scaling.factor"] = p.RopeScaling.Factor
}
if p.RopeScaling.OriginalMaxPositionEmbeds > 0 {
kv["olmo3.rope.scaling.original_context_length"] = p.RopeScaling.OriginalMaxPositionEmbeds
kv["olmo2.rope.scaling.original_context_length"] = p.RopeScaling.OriginalMaxPositionEmbeds
}
if p.RopeScaling.AttentionFactor > 0 {
kv["olmo3.rope.scaling.attn_factor"] = p.RopeScaling.AttentionFactor
kv["olmo2.rope.scaling.attn_factor"] = p.RopeScaling.AttentionFactor
}
if p.RopeScaling.RopeType != "" {
kv["olmo3.rope.scaling.type"] = p.RopeScaling.RopeType
kv["olmo2.rope.scaling.type"] = p.RopeScaling.RopeType
}
}
if p.RMSNormEPS > 0 {
kv["olmo3.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
kv["olmo2.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
}
if p.SlidingWindow > 0 {
kv["olmo3.attention.sliding_window"] = p.SlidingWindow
kv["olmo2.attention.sliding_window"] = p.SlidingWindow
}
if len(p.LayerTypes) > 0 {
@ -76,7 +76,7 @@ func (p *olmoModel) KV(t *Tokenizer) KV {
for i, layerType := range p.LayerTypes {
slidingPattern[i] = (layerType == "sliding_attention")
}
kv["olmo3.attention.sliding_window_pattern"] = slidingPattern
kv["olmo2.attention.sliding_window_pattern"] = slidingPattern
}
return kv

View file

@ -1,15 +1,24 @@
package convert
import (
"bufio"
"bytes"
"encoding/binary"
"encoding/json"
"fmt"
"io"
"io/fs"
"maps"
"math"
"os"
"slices"
"strconv"
"strings"
"github.com/d4l3k/go-bfloat16"
"github.com/pdevine/tensor"
"github.com/pdevine/tensor/native"
"github.com/x448/float16"
"github.com/ollama/ollama/fs/ggml"
)
@ -32,6 +41,8 @@ type qwen3NextTextConfig struct {
MaxPositionEmbeddings uint32 `json:"max_position_embeddings"`
HiddenSize uint32 `json:"hidden_size"`
NumHiddenLayers uint32 `json:"num_hidden_layers"`
NumNextNPredictLayers uint32 `json:"num_nextn_predict_layers"`
MTPNumHiddenLayers uint32 `json:"mtp_num_hidden_layers"`
IntermediateSize uint32 `json:"intermediate_size"`
NumAttentionHeads uint32 `json:"num_attention_heads"`
NumKeyValueHeads uint32 `json:"num_key_value_heads"`
@ -66,8 +77,11 @@ type qwen3NextTextConfig struct {
type qwen3NextVisionConfig struct {
Depth uint32 `json:"depth"`
HiddenSize uint32 `json:"hidden_size"`
IntermediateSize uint32 `json:"intermediate_size"`
NumHeads uint32 `json:"num_heads"`
NumPositionEmbeddings uint32 `json:"num_position_embeddings"`
InChannels uint32 `json:"in_channels"`
OutHiddenSize uint32 `json:"out_hidden_size"`
PatchSize uint32 `json:"patch_size"`
SpatialMergeSize uint32 `json:"spatial_merge_size"`
RMSNormEps float32 `json:"layer_norm_epsilon"`
@ -96,12 +110,25 @@ type qwen3NextModel struct {
VisionEndTokenID uint32 `json:"vision_end_token_id"`
}
var _ ModelConverter = (*qwen3NextModel)(nil)
var (
_ ModelConverter = (*qwen3NextModel)(nil)
_ MultimodalConverter = (*qwen3NextModel)(nil)
)
func (q *qwen3NextModel) parseMore(fsys fs.FS) error {
if q.TextConfig != nil {
q.qwen3NextTextConfig = *q.TextConfig
}
if q.NumNextNPredictLayers == 0 {
q.NumNextNPredictLayers = q.MTPNumHiddenLayers
}
if q.NumNextNPredictLayers == 0 {
nextn, err := qwen3NextInferNextNPredictLayers(fsys)
if err != nil {
return err
}
q.NumNextNPredictLayers = nextn
}
if q.RopeTheta == 0 {
q.RopeTheta = q.RopeParameters.RopeTheta
@ -182,6 +209,150 @@ func (q *qwen3NextModel) parseMore(fsys fs.FS) error {
return nil
}
func qwen3NextInferNextNPredictLayers(fsys fs.FS) (uint32, error) {
paths, err := fs.Glob(fsys, "*.safetensors")
if err != nil {
return 0, err
}
maxLayer := -1
hasMTP := false
for _, p := range paths {
f, err := fsys.Open(p)
if err != nil {
return 0, err
}
var n int64
if err := binary.Read(f, binary.LittleEndian, &n); err != nil {
f.Close()
return 0, err
}
b := bytes.NewBuffer(make([]byte, 0, n))
if _, err = io.CopyN(b, f, n); err != nil {
f.Close()
return 0, err
}
f.Close()
var headers map[string]safetensorMetadata
if err := json.NewDecoder(b).Decode(&headers); err != nil {
return 0, err
}
for name, value := range headers {
if value.Type == "" || !strings.HasPrefix(name, "mtp.") {
continue
}
hasMTP = true
rest := strings.TrimPrefix(name, "mtp.layers.")
layer, suffix, ok := strings.Cut(rest, ".")
if !ok {
continue
}
n, err := strconv.Atoi(layer)
if err == nil && n > maxLayer && suffix != "" {
maxLayer = n
}
}
}
if maxLayer >= 0 {
return uint32(maxLayer + 1), nil
}
if hasMTP {
return 1, nil
}
return 0, nil
}
func ConvertQwen35MTPDraft(fsys fs.FS, f *os.File, baseKV ggml.KV, baseTensors []*ggml.Tensor) error {
arch := baseKV.Architecture()
if arch != "qwen35" && arch != "qwen35moe" {
return fmt.Errorf("MTP draft safetensors require a qwen3.5 base model, got %q", arch)
}
baseBlocks := baseKV.Uint("block_count")
if baseBlocks == 0 {
return fmt.Errorf("MTP draft safetensors require a base model with block_count")
}
if baseKV.Uint("nextn_predict_layers") > 0 {
return fmt.Errorf("MTP draft safetensors require a base model without embedded MTP layers")
}
nextn, err := qwen3NextInferNextNPredictLayers(fsys)
if err != nil {
return err
}
if nextn == 0 {
return fmt.Errorf("MTP draft safetensors did not contain mtp tensors")
}
q := &qwen3NextModel{
qwen3NextTextConfig: qwen3NextTextConfig{
NumHiddenLayers: baseBlocks,
NumNextNPredictLayers: nextn,
},
}
ts, err := parseTensors(fsys, strings.NewReplacer(q.Replacements()...))
if err != nil {
return err
}
if err := ensureUniqueTensorNames(ts); err != nil {
return err
}
mtpTensors := q.Tensors(ts)
if len(mtpTensors) == 0 {
return fmt.Errorf("MTP draft safetensors did not produce GGUF tensors")
}
for _, tensor := range mtpTensors {
if !qwen35MTPDraftTensorName(tensor.Name, baseBlocks, nextn) {
return fmt.Errorf("MTP draft safetensors produced unexpected tensor %q", tensor.Name)
}
tensor.Shape = slices.Clone(tensor.Shape)
slices.Reverse(tensor.Shape)
}
kv := maps.Clone(baseKV)
qwen35RemoveSplitMetadata(kv, arch)
kv[arch+".block_count"] = baseBlocks + nextn
kv[arch+".nextn_predict_layers"] = nextn
tensors := make([]*ggml.Tensor, 0, len(baseTensors)+len(mtpTensors))
tensors = append(tensors, baseTensors...)
tensors = append(tensors, mtpTensors...)
var parameters uint64
for _, tensor := range tensors {
parameters += tensor.Elements()
}
kv["general.parameter_count"] = parameters
return ggml.WriteGGUF(f, kv, tensors)
}
func qwen35RemoveSplitMetadata(kv ggml.KV, arch string) {
for _, key := range []string{
"split.no",
"split.count",
"split.tensors.count",
} {
delete(kv, key)
delete(kv, arch+"."+key)
}
}
func qwen35MTPDraftTensorName(name string, base, nextn uint32) bool {
for i := range nextn {
if strings.HasPrefix(name, fmt.Sprintf("blk.%d.", base+i)) {
return true
}
}
return false
}
func (q *qwen3NextModel) kvHeadCounts() ([]uint32, error) {
if len(q.LayerTypes) > 0 {
kv := make([]uint32, q.NumHiddenLayers)
@ -259,7 +430,10 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
}
kv["general.architecture"] = arch
kv["tokenizer.ggml.pre"] = "qwen35"
kv["block_count"] = q.NumHiddenLayers
kv["block_count"] = q.NumHiddenLayers + q.NumNextNPredictLayers
if q.NumNextNPredictLayers > 0 {
kv["nextn_predict_layers"] = q.NumNextNPredictLayers
}
kv["context_length"] = q.MaxPositionEmbeddings
kv["embedding_length"] = q.HiddenSize
kv["feed_forward_length"] = q.IntermediateSize
@ -282,7 +456,11 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
if sections := q.ropeSections(); len(sections) > 0 {
kv["mrope_sections"] = sections
kv["rope.mrope_section"] = sections
kv["rope.dimension_sections"] = sections
dimensionSections := append([]int32(nil), sections...)
if len(dimensionSections) == 3 {
dimensionSections = append(dimensionSections, 0)
}
kv["rope.dimension_sections"] = dimensionSections
}
if q.RopeParameters.MRopeInterleaved {
kv["rope.mrope_interleaved"] = true
@ -321,12 +499,21 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
}
if headCounts, err := q.kvHeadCounts(); err == nil {
kv["attention.head_count_kv"] = headCounts
var maxKV uint32
for _, count := range headCounts {
if count > maxKV {
maxKV = count
}
}
kv["attention.head_count_kv"] = maxKV
}
if q.VisionModel.Depth > 0 {
kv["vision.block_count"] = q.VisionModel.Depth
kv["vision.embedding_length"] = q.VisionModel.HiddenSize
if q.VisionModel.IntermediateSize > 0 {
kv["vision.feed_forward_length"] = q.VisionModel.IntermediateSize
}
kv["vision.attention.head_count"] = q.VisionModel.NumHeads
kv["vision.num_channels"] = q.VisionModel.InChannels
if q.VisionModel.PatchSize > 0 {
@ -372,6 +559,378 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
return kv
}
func (q *qwen3NextModel) TextKV(t *Tokenizer) KV {
kv := q.KV(t)
for _, key := range []string{
"vision.block_count",
"vision.embedding_length",
"vision.feed_forward_length",
"vision.attention.head_count",
"vision.num_channels",
"vision.patch_size",
"vision.spatial_merge_size",
"vision.attention.layer_norm_epsilon",
"vision.rope.freq_base",
"vision.temporal_patch_size",
"vision.deepstack_visual_indexes",
"vision.shortest_edge",
"vision.longest_edge",
"vision.image_mean",
"vision.image_std",
"image_token_id",
"vision_start_token_id",
"vision_end_token_id",
"mrope_sections",
"rope.mrope_section",
"rope.mrope_interleaved",
"ssm.v_head_reordered",
} {
delete(kv, key)
}
return kv
}
func (q *qwen3NextModel) ProjectorKV(*Tokenizer) KV {
depth := q.VisionModel.Depth
deepstack := make([]bool, depth)
for _, idx := range q.VisionModel.DeepstackVisualIndexes {
if idx >= 0 && uint32(idx) < depth {
deepstack[idx] = true
}
}
imageSize := uint32(768)
if q.VisionModel.NumPositionEmbeddings > 0 && q.VisionModel.PatchSize > 0 {
root := uint32(math.Sqrt(float64(q.VisionModel.NumPositionEmbeddings)))
if root*root == q.VisionModel.NumPositionEmbeddings {
imageSize = root * q.VisionModel.PatchSize
}
}
projectionDim := q.VisionModel.OutHiddenSize
if projectionDim == 0 {
projectionDim = q.HiddenSize
}
layerNormEps := q.VisionModel.RMSNormEps
if layerNormEps == 0 {
layerNormEps = 1e-6
}
kv := KV{
"general.architecture": "clip",
"general.type": "mmproj",
"general.file_type": uint32(1),
"general.quantization_version": uint32(2),
"clip.has_vision_encoder": true,
"clip.projector_type": "qwen3vl_merger",
"clip.use_gelu": true,
"clip.vision.block_count": depth,
"clip.vision.embedding_length": q.VisionModel.HiddenSize,
"clip.vision.feed_forward_length": q.VisionModel.IntermediateSize,
"clip.vision.attention.head_count": q.VisionModel.NumHeads,
"clip.vision.image_size": imageSize,
"clip.vision.patch_size": q.VisionModel.PatchSize,
"clip.vision.projection_dim": projectionDim,
"clip.vision.spatial_merge_size": q.VisionModel.SpatialMergeSize,
"clip.vision.attention.layer_norm_epsilon": layerNormEps,
"clip.vision.is_deepstack_layers": deepstack,
}
if len(q.VisionModel.ImageMean) > 0 {
kv["clip.vision.image_mean"] = q.VisionModel.ImageMean
}
if len(q.VisionModel.ImageStd) > 0 {
kv["clip.vision.image_std"] = q.VisionModel.ImageStd
}
return kv
}
func (q *qwen3NextModel) TextTensors(ts []Tensor, _ *Tokenizer) []*ggml.Tensor {
var text []Tensor
for _, t := range ts {
if qwen3NextVisionTensor(t.Name()) {
continue
}
text = append(text, t)
}
return q.Tensors(text)
}
func (q *qwen3NextModel) ProjectorTensors(ts []Tensor) []*ggml.Tensor {
if q.VisionModel.Depth == 0 {
return nil
}
rename := strings.NewReplacer(
"v.pos_embed", "v.position_embd",
"v.patch_embed", "v.patch_embd",
"v.merger.norm", "v.post_ln",
"v.merger.linear_fc1", "mm.0",
"v.merger.linear_fc2", "mm.2",
".mlp.linear_fc1", ".ffn_up",
".mlp.linear_fc2", ".ffn_down",
".norm1", ".ln1",
".norm2", ".ln2",
)
var out []*ggml.Tensor
for _, t := range ts {
name := t.Name()
if !qwen3NextVisionTensor(name) {
continue
}
if name == "v.patch_embed.weight" {
out = append(out, q.qwen35PatchEmbedTensors(t)...)
continue
}
outName := rename.Replace(name)
kind := t.Kind()
writer := io.WriterTo(t)
if outName == "v.position_embd.weight" {
kind = tensorKindFP32
writer = tensorFloat32Writer{tensor: t}
} else if sourceDType(t) == "BF16" && kind == tensorKindFP16 {
kind = tensorKindBF16
writer = tensorBF16Writer{tensor: t}
}
out = append(out, &ggml.Tensor{
Name: outName,
Kind: kind,
Shape: slices.Clone(t.Shape()),
WriterTo: writer,
})
}
return out
}
func qwen3NextVisionTensor(name string) bool {
return strings.HasPrefix(name, "v.")
}
func (q *qwen3NextModel) qwen35PatchEmbedTensors(t Tensor) []*ggml.Tensor {
shape := t.Shape()
if len(shape) != 5 || shape[2] != 2 {
return nil
}
outShape := []uint64{shape[0], shape[1], shape[3], shape[4]}
return []*ggml.Tensor{
{
Name: "v.patch_embd.weight",
Kind: tensorKindFP32,
Shape: slices.Clone(outShape),
WriterTo: tensorFloat32Writer{tensor: t, repacker: q.qwen35PatchEmbedSlice(0)},
},
{
Name: "v.patch_embd.weight.1",
Kind: tensorKindFP32,
Shape: slices.Clone(outShape),
WriterTo: tensorFloat32Writer{tensor: t, repacker: q.qwen35PatchEmbedSlice(1)},
},
}
}
func (q *qwen3NextModel) qwen35PatchEmbedSlice(slice int) Repacker {
return func(_ string, data []float32, shape []uint64) ([]float32, error) {
if len(shape) != 5 || shape[2] != 2 {
return nil, fmt.Errorf("qwen3next: unexpected patch_embed shape %v", shape)
}
outChannels := int(shape[0])
inChannels := int(shape[1])
frames := int(shape[2])
height := int(shape[3])
width := int(shape[4])
if slice < 0 || slice >= frames {
return nil, fmt.Errorf("qwen3next: patch_embed slice %d out of range", slice)
}
expected := outChannels * inChannels * frames * height * width
if len(data) != expected {
return nil, fmt.Errorf("qwen3next: patch_embed data size %d, expected %d", len(data), expected)
}
out := make([]float32, outChannels*inChannels*height*width)
for oc := range outChannels {
for ic := range inChannels {
for y := range height {
for x := range width {
src := ((((oc*inChannels+ic)*frames+slice)*height + y) * width) + x
dst := (((oc*inChannels+ic)*height + y) * width) + x
out[dst] = data[src]
}
}
}
}
return out, nil
}
}
type tensorBF16Writer struct {
tensor Tensor
repacker Repacker
}
func (w tensorBF16Writer) WriteTo(dst io.Writer) (int64, error) {
data, err := tensorFloat32Data(w.tensor)
if err != nil {
return 0, err
}
if w.repacker != nil {
data, err = w.repacker(w.tensor.Name(), data, w.tensor.Shape())
if err != nil {
return 0, err
}
}
u8s := bfloat16.EncodeFloat32(data)
if _, err := dst.Write(u8s); err != nil {
return 0, err
}
return int64(len(u8s)), nil
}
type tensorFloat32Writer struct {
tensor Tensor
repacker Repacker
}
func (w tensorFloat32Writer) WriteTo(dst io.Writer) (int64, error) {
data, err := tensorFloat32Data(w.tensor)
if err != nil {
return 0, err
}
if w.repacker != nil {
data, err = w.repacker(w.tensor.Name(), data, w.tensor.Shape())
if err != nil {
return 0, err
}
}
if err := binary.Write(dst, binary.LittleEndian, data); err != nil {
return 0, err
}
return int64(len(data) * 4), nil
}
func tensorFloat32Data(t Tensor) ([]float32, error) {
if st, ok := tensorSafetensor(t); ok {
return safetensorFloat32Data(st)
}
var buf bytes.Buffer
if _, err := t.WriteTo(&buf); err != nil {
return nil, err
}
switch t.Kind() {
case tensorKindFP32:
out := make([]float32, buf.Len()/4)
if err := binary.Read(bytes.NewReader(buf.Bytes()), binary.LittleEndian, out); err != nil {
return nil, err
}
return out, nil
case tensorKindFP16:
raw := make([]uint16, buf.Len()/2)
if err := binary.Read(bytes.NewReader(buf.Bytes()), binary.LittleEndian, raw); err != nil {
return nil, err
}
out := make([]float32, len(raw))
for i, v := range raw {
out[i] = float16.Frombits(v).Float32()
}
return out, nil
case tensorKindBF16:
return bfloat16.DecodeFloat32(buf.Bytes()), nil
default:
return nil, fmt.Errorf("unsupported tensor kind %d for F32 writer", t.Kind())
}
}
func tensorSafetensor(t Tensor) (safetensor, bool) {
switch t := t.(type) {
case safetensor:
return t, true
case *safetensor:
return *t, true
default:
return safetensor{}, false
}
}
func safetensorFloat32Data(st safetensor) ([]float32, error) {
f, err := st.fs.Open(st.path)
if err != nil {
return nil, err
}
defer f.Close()
var r io.Reader
if readerAt, ok := f.(io.ReaderAt); ok {
r = io.NewSectionReader(readerAt, st.offset, st.size)
} else if seeker, ok := f.(io.Seeker); ok {
if _, err := seeker.Seek(st.offset, io.SeekStart); err != nil {
return nil, err
}
r = f
} else {
if _, err := io.CopyN(io.Discard, f, st.offset); err != nil {
return nil, err
}
r = f
}
br := bufio.NewReaderSize(r, min(32<<10, int(st.size)))
var out []float32
switch st.dtype {
case "F32":
out = make([]float32, st.size/4)
if err := binary.Read(br, binary.LittleEndian, out); err != nil {
return nil, err
}
case "F16":
raw := make([]uint16, st.size/2)
if err := binary.Read(br, binary.LittleEndian, raw); err != nil {
return nil, err
}
out = make([]float32, len(raw))
for i, v := range raw {
out[i] = float16.Frombits(v).Float32()
}
case "BF16":
raw := make([]uint8, st.size)
if err := binary.Read(br, binary.LittleEndian, raw); err != nil {
return nil, err
}
out = bfloat16.DecodeFloat32(raw)
case "F8_E4M3":
raw := make([]uint8, st.size)
if err := binary.Read(br, binary.LittleEndian, raw); err != nil {
return nil, err
}
out, err = st.decodeFP8E4M3(raw)
if err != nil {
return nil, err
}
default:
return nil, fmt.Errorf("unsupported safetensor dtype %q", st.dtype)
}
if st.repacker != nil {
out, err = st.repacker(st.Name(), out, st.Shape())
if err != nil {
return nil, err
}
}
return out, nil
}
func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
var out []*ggml.Tensor
@ -398,6 +957,13 @@ func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
name := t.Name()
shape := t.Shape()
if names := q.mtpTensorNames(name); len(names) > 0 {
for _, name := range names {
out = q.appendDirectTensor(out, t, name)
}
continue
}
if strings.HasSuffix(name, ".ssm_in.weight") {
if qkv, gate, ok := q.splitQKVZTensor(t); ok {
out = append(out, qkv, gate)
@ -464,7 +1030,7 @@ func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
}
out = append(out, &ggml.Tensor{Name: name, Kind: t.Kind(), Shape: slices.Clone(shape), WriterTo: t})
case strings.HasSuffix(name, ".ssm_dt"):
case strings.HasSuffix(name, ".ssm_dt"), strings.HasSuffix(name, ".ssm_dt.bias"):
if q.shouldReorderVHeads() {
t.SetRepacker(q.repackReorderDim(0, 1))
}
@ -499,6 +1065,73 @@ func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
return out
}
func (q *qwen3NextModel) appendDirectTensor(out []*ggml.Tensor, t Tensor, name string) []*ggml.Tensor {
if qwen3NextShouldShiftNorm(name) {
t = t.Clone()
t.SetRepacker(q.addOne)
}
return append(out, &ggml.Tensor{Name: name, Kind: t.Kind(), Shape: slices.Clone(t.Shape()), WriterTo: t})
}
func qwen3NextShouldShiftNorm(name string) bool {
if strings.HasSuffix(name, ".ssm_norm.weight") {
return false
}
return strings.HasSuffix(name, "_norm.weight") ||
strings.HasSuffix(name, ".nextn.enorm.weight") ||
strings.HasSuffix(name, ".nextn.hnorm.weight")
}
func (q *qwen3NextModel) mtpTensorNames(name string) []string {
if !strings.HasPrefix(name, "mtp.") {
return nil
}
base := q.NumHiddenLayers
nextn := q.NumNextNPredictLayers
if nextn == 0 {
nextn = 1
}
if rest := strings.TrimPrefix(name, "mtp.layers."); rest != name {
layer, suffix, ok := strings.Cut(rest, ".")
if !ok {
return nil
}
idx, err := strconv.ParseUint(layer, 10, 32)
if err != nil {
return nil
}
return []string{fmt.Sprintf("blk.%d.%s", base+uint32(idx), suffix)}
}
var suffix string
switch name {
case "mtp.fc.weight":
suffix = "nextn.eh_proj.weight"
case "mtp.pre_fc_norm_embedding.weight":
suffix = "nextn.enorm.weight"
case "mtp.pre_fc_norm_hidden.weight":
suffix = "nextn.hnorm.weight"
case "mtp.norm.weight":
suffix = "nextn.shared_head_norm.weight"
case "mtp.embed_tokens.weight":
suffix = "nextn.embed_tokens.weight"
case "mtp.shared_head.head.weight":
suffix = "nextn.shared_head_head.weight"
case "mtp.shared_head.norm.weight":
suffix = "nextn.shared_head_norm.weight"
default:
return nil
}
names := make([]string, 0, nextn)
for i := range nextn {
names = append(names, fmt.Sprintf("blk.%d.%s", base+i, suffix))
}
return names
}
func (q *qwen3NextModel) repackReorderDim(dim, headDim int) Repacker {
return func(_ string, data []float32, shape []uint64) ([]float32, error) {
if !q.shouldReorderVHeads() {
@ -925,7 +1558,7 @@ func (q *qwen3NextModel) Replacements() []string {
"linear_attn.in_proj_b", "ssm_beta",
"linear_attn.conv1d", "ssm_conv1d",
"linear_attn.dt_bias", "ssm_dt",
"linear_attn.dt_bias", "ssm_dt.bias",
"linear_attn.dt_proj", "ssm_dt",
"linear_attn.A_log", "ssm_a",
"linear_attn.norm", "ssm_norm",

View file

@ -4,10 +4,12 @@ import (
"bytes"
"encoding/binary"
"os"
"path/filepath"
"slices"
"strings"
"testing"
"github.com/d4l3k/go-bfloat16"
"github.com/ollama/ollama/fs/ggml"
)
@ -106,11 +108,7 @@ func TestQwen3NextKVLegacyConfig(t *testing.T) {
t.Fatalf("unexpected tokenizer pre: got %v want %v", got, want)
}
headCountKV, ok := kv["attention.head_count_kv"].([]uint32)
if !ok {
t.Fatalf("attention.head_count_kv has unexpected type: %T", kv["attention.head_count_kv"])
}
if got, want := headCountKV, []uint32{0, 2, 0, 2}; !slices.Equal(got, want) {
if got, want := kv["attention.head_count_kv"], uint32(2); got != want {
t.Fatalf("unexpected attention.head_count_kv: got %v want %v", got, want)
}
@ -198,6 +196,7 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
VisionModel: qwen3NextVisionConfig{
Depth: 2,
HiddenSize: 128,
IntermediateSize: 512,
NumHeads: 4,
InChannels: 3,
PatchSize: 16,
@ -225,11 +224,7 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
t.Fatalf("unexpected architecture: got %v want %v", got, want)
}
headCountKV, ok := kv["attention.head_count_kv"].([]uint32)
if !ok {
t.Fatalf("attention.head_count_kv has unexpected type: %T", kv["attention.head_count_kv"])
}
if got, want := headCountKV, []uint32{0, 4, 0, 4}; !slices.Equal(got, want) {
if got, want := kv["attention.head_count_kv"], uint32(4); got != want {
t.Fatalf("unexpected attention.head_count_kv: got %v want %v", got, want)
}
@ -248,7 +243,7 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
if !ok {
t.Fatalf("rope.dimension_sections has unexpected type: %T", kv["rope.dimension_sections"])
}
if got, want := ropeSections, []int32{11, 11, 10}; !slices.Equal(got, want) {
if got, want := ropeSections, []int32{11, 11, 10, 0}; !slices.Equal(got, want) {
t.Fatalf("unexpected rope.dimension_sections: got %v want %v", got, want)
}
@ -259,6 +254,254 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
if got, want := kv["vision.block_count"], uint32(2); got != want {
t.Fatalf("unexpected vision.block_count: got %v want %v", got, want)
}
if got, want := kv["vision.feed_forward_length"], uint32(512); got != want {
t.Fatalf("unexpected vision.feed_forward_length: got %v want %v", got, want)
}
}
func TestQwen35MTPTensors(t *testing.T) {
m := &qwen3NextModel{
ModelParameters: ModelParameters{
ModelType: "qwen3_5",
},
qwen3NextTextConfig: qwen3NextTextConfig{
NumHiddenLayers: 32,
NumNextNPredictLayers: 1,
},
}
kv := m.KV(&Tokenizer{Vocabulary: &Vocabulary{}})
if got, want := kv["block_count"], uint32(33); got != want {
t.Fatalf("unexpected block_count: got %v want %v", got, want)
}
if got, want := kv["nextn_predict_layers"], uint32(1); got != want {
t.Fatalf("unexpected nextn_predict_layers: got %v want %v", got, want)
}
tensors := m.Tensors([]Tensor{
&fakeTensor{name: "mtp.fc.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
&fakeTensor{name: "mtp.pre_fc_norm_embedding.weight", shape: []uint64{2}, data: []float32{0, 1}},
&fakeTensor{name: "mtp.pre_fc_norm_hidden.weight", shape: []uint64{2}, data: []float32{0, 1}},
&fakeTensor{name: "mtp.norm.weight", shape: []uint64{2}, data: []float32{0, 1}},
&fakeTensor{name: "mtp.layers.0.attn_q.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
&fakeTensor{name: "mtp.layers.0.ffn_down.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
})
byName := map[string]*ggml.Tensor{}
for _, tensor := range tensors {
byName[tensor.Name] = tensor
}
for _, name := range []string{
"blk.32.nextn.eh_proj.weight",
"blk.32.nextn.enorm.weight",
"blk.32.nextn.hnorm.weight",
"blk.32.nextn.shared_head_norm.weight",
"blk.32.attn_q.weight",
"blk.32.ffn_down.weight",
} {
if _, ok := byName[name]; !ok {
t.Fatalf("missing MTP tensor %q", name)
}
}
for _, name := range []string{
"blk.32.nextn.enorm.weight",
"blk.32.nextn.hnorm.weight",
"blk.32.nextn.shared_head_norm.weight",
} {
if got, want := readTensorData(t, byName[name]), []float32{1, 2}; !slices.Equal(got, want) {
t.Fatalf("unexpected shifted norm values for %s: got %v want %v", name, got, want)
}
}
}
func TestQwen35NativeSplitKV(t *testing.T) {
m := &qwen3NextModel{
ModelParameters: ModelParameters{
ModelType: "qwen3_5",
},
TextConfig: &qwen3NextTextConfig{
MaxPositionEmbeddings: 16384,
HiddenSize: 2560,
NumHiddenLayers: 4,
IntermediateSize: 9216,
NumAttentionHeads: 16,
NumKeyValueHeads: 4,
HeadDim: 256,
RMSNormEPS: 1e-6,
FullAttentionInterval: 2,
LinearConvKernelDim: 4,
LinearKeyHeadDim: 128,
LinearNumKeyHeads: 16,
LinearNumValueHeads: 32,
LinearValueHeadDim: 128,
RopeParameters: qwen3NextRopeParams{
MRopeInterleaved: true,
MropeSection: []int32{11, 11, 10},
RopeTheta: 10_000_000,
PartialRotaryFactor: 0.25,
},
},
VisionModel: qwen3NextVisionConfig{
Depth: 24,
HiddenSize: 1024,
IntermediateSize: 4096,
NumHeads: 16,
NumPositionEmbeddings: 2304,
InChannels: 3,
OutHiddenSize: 2560,
PatchSize: 16,
SpatialMergeSize: 2,
},
ImageTokenID: 248056,
VisionStartTokenID: 248053,
VisionEndTokenID: 248054,
}
m.VisionModel.ImageMean = []float32{0.5, 0.5, 0.5}
m.VisionModel.ImageStd = []float32{0.5, 0.5, 0.5}
if err := m.parseMore(os.DirFS(t.TempDir())); err != nil {
t.Fatal(err)
}
textKV := m.TextKV(&Tokenizer{Vocabulary: &Vocabulary{}})
for _, key := range []string{
"vision.block_count",
"image_token_id",
"vision_start_token_id",
"vision_end_token_id",
"mrope_sections",
"rope.mrope_section",
"rope.mrope_interleaved",
"ssm.v_head_reordered",
} {
if _, ok := textKV[key]; ok {
t.Fatalf("TextKV retained %q", key)
}
}
if got, want := textKV["rope.dimension_sections"], []int32{11, 11, 10, 0}; !slices.Equal(got.([]int32), want) {
t.Fatalf("unexpected rope.dimension_sections: got %v want %v", got, want)
}
projectorKV := m.ProjectorKV(&Tokenizer{Vocabulary: &Vocabulary{}})
if got, want := projectorKV["general.architecture"], "clip"; got != want {
t.Fatalf("unexpected projector architecture: got %v want %v", got, want)
}
if got, want := projectorKV["clip.projector_type"], "qwen3vl_merger"; got != want {
t.Fatalf("unexpected projector type: got %v want %v", got, want)
}
if got, want := projectorKV["clip.vision.feed_forward_length"], uint32(4096); got != want {
t.Fatalf("unexpected projector feed_forward_length: got %v want %v", got, want)
}
if got, want := projectorKV["clip.vision.image_size"], uint32(768); got != want {
t.Fatalf("unexpected projector image_size: got %v want %v", got, want)
}
if got, want := projectorKV["clip.vision.projection_dim"], uint32(2560); got != want {
t.Fatalf("unexpected projector projection_dim: got %v want %v", got, want)
}
}
func TestQwen35ProjectorTensors(t *testing.T) {
m := &qwen3NextModel{
VisionModel: qwen3NextVisionConfig{Depth: 1},
}
patch := &fakeTensor{
name: "v.patch_embed.weight",
shape: []uint64{2, 2, 2, 1, 2},
data: []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
}
tensors := m.ProjectorTensors([]Tensor{
patch,
&fakeTensor{name: "v.pos_embed.weight", shape: []uint64{4, 2}, data: []float32{0, 1, 2, 3, 4, 5, 6, 7}},
&fakeTensor{name: "v.blk.0.attn_qkv.weight", shape: []uint64{6, 2}, data: make([]float32, 12), sourceDType: "BF16", kind: tensorKindFP16},
&fakeTensor{name: "v.blk.0.mlp.linear_fc1.weight", shape: []uint64{8, 2}, data: make([]float32, 16), sourceDType: "BF16", kind: tensorKindFP16},
&fakeTensor{name: "token_embd.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
&fakeTensor{name: "mtp.fc.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
})
byName := map[string]*ggml.Tensor{}
for _, tensor := range tensors {
byName[tensor.Name] = tensor
}
if _, ok := byName["token_embd.weight"]; ok {
t.Fatalf("projector tensors included text tensor")
}
if _, ok := byName["mtp.fc.weight"]; ok {
t.Fatalf("projector tensors included MTP tensor")
}
if got := byName["v.position_embd.weight"]; got == nil || got.Kind != tensorKindFP32 {
t.Fatalf("position embedding was not promoted to F32: %#v", got)
}
if got := byName["v.blk.0.attn_qkv.weight"]; got == nil {
t.Fatalf("attn_qkv tensor missing")
} else if got.Kind != tensorKindBF16 {
t.Fatalf("attn_qkv tensor was not preserved as BF16: %#v", got)
}
if got := byName["v.blk.0.ffn_up.weight"]; got == nil {
t.Fatalf("ffn_up tensor missing")
} else if got.Kind != tensorKindBF16 {
t.Fatalf("ffn_up tensor was not preserved as BF16: %#v", got)
}
first := byName["v.patch_embd.weight"]
if first == nil {
t.Fatalf("first patch embedding slice missing")
}
if got, want := first.Shape, []uint64{2, 2, 1, 2}; !slices.Equal(got, want) {
t.Fatalf("unexpected first patch shape: got %v want %v", got, want)
}
if got, want := readTensorData(t, first), []float32{0, 1, 4, 5, 8, 9, 12, 13}; !slices.Equal(got, want) {
t.Fatalf("unexpected first patch data: got %v want %v", got, want)
}
second := byName["v.patch_embd.weight.1"]
if second == nil {
t.Fatalf("second patch embedding slice missing")
}
if got, want := readTensorData(t, second), []float32{2, 3, 6, 7, 10, 11, 14, 15}; !slices.Equal(got, want) {
t.Fatalf("unexpected second patch data: got %v want %v", got, want)
}
}
func TestQwen35BF16ProjectorWriterPreservesSource(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "tensor.bin")
values := []float32{1, -2, 3.5, 4.25}
raw := bfloat16.EncodeFloat32(values)
if err := os.WriteFile(path, raw, 0o644); err != nil {
t.Fatal(err)
}
st := safetensor{
fs: os.DirFS(dir),
path: "tensor.bin",
dtype: "BF16",
offset: 0,
size: int64(len(raw)),
tensorBase: &tensorBase{
name: "v.blk.0.attn_qkv.weight",
shape: []uint64{2, 2},
},
}
tensor := &ggml.Tensor{
Name: "v.blk.0.attn_qkv.weight",
Kind: tensorKindBF16,
Shape: []uint64{2, 2},
WriterTo: tensorBF16Writer{tensor: st},
}
var got bytes.Buffer
if n, err := tensor.WriteTo(&got); err != nil {
t.Fatal(err)
} else if n != int64(len(raw)) {
t.Fatalf("unexpected byte count: got %d want %d", n, len(raw))
}
if !bytes.Equal(got.Bytes(), raw) {
t.Fatalf("BF16 writer changed source bytes: got %x want %x", got.Bytes(), raw)
}
}
func TestQwen3NextReplacements(t *testing.T) {
@ -273,6 +516,12 @@ func TestQwen3NextReplacements(t *testing.T) {
if got, want := r.Replace("model.layers.1.linear_attn.in_proj_qkvz.weight"), "blk.1.ssm_in.weight"; got != want {
t.Fatalf("unexpected legacy replacement: got %q want %q", got, want)
}
if got, want := r.Replace("model.layers.1.linear_attn.dt_bias"), "blk.1.ssm_dt.bias"; got != want {
t.Fatalf("unexpected dt bias replacement: got %q want %q", got, want)
}
if got, want := r.Replace("model.layers.1.linear_attn.dt_proj.weight"), "blk.1.ssm_dt.weight"; got != want {
t.Fatalf("unexpected dt projection replacement: got %q want %q", got, want)
}
}
func TestQwen35ReordersVHeads(t *testing.T) {
@ -399,6 +648,33 @@ func TestQwen35ReordersSsmBetaRows(t *testing.T) {
}
}
func TestQwen35ReordersSsmDtBias(t *testing.T) {
m := &qwen3NextModel{
ModelParameters: ModelParameters{
ModelType: "qwen3_5",
},
qwen3NextTextConfig: qwen3NextTextConfig{
LinearNumKeyHeads: 2,
LinearNumValueHeads: 4,
},
}
out := m.Tensors([]Tensor{
&fakeTensor{
name: "blk.0.ssm_dt.bias",
shape: []uint64{4},
data: []float32{0, 1, 2, 3},
},
})
if len(out) != 1 {
t.Fatalf("unexpected output tensor count: got %d want 1", len(out))
}
if got, want := readTensorData(t, out[0]), []float32{0, 2, 1, 3}; !slices.Equal(got, want) {
t.Fatalf("unexpected ssm_dt.bias data: got %v want %v", got, want)
}
}
func TestQwen35ReordersConv1DChannelDim(t *testing.T) {
m := &qwen3NextModel{
ModelParameters: ModelParameters{

View file

@ -3,8 +3,13 @@ package convert
import (
"cmp"
"encoding/json"
"fmt"
"io"
"io/fs"
"math"
"regexp"
"slices"
"strconv"
"strings"
"github.com/ollama/ollama/fs/ggml"
@ -25,6 +30,9 @@ type qwen3VLModel struct {
RopeTheta float32 `json:"rope_theta"`
TemporalPatchSize uint32 `json:"temporal_patch_size"`
DeepstackVisualIndexes []int32 `json:"deepstack_visual_indexes"`
IntermediateSize uint32 `json:"intermediate_size"`
OutHiddenSize uint32 `json:"out_hidden_size"`
NumPositionEmbeddings uint32 `json:"num_position_embeddings"`
Size struct {
ShortestEdge uint32 `json:"shortest_edge"`
@ -36,6 +44,8 @@ type qwen3VLModel struct {
} `json:"vision_config"`
}
var _ MultimodalConverter = (*qwen3VLModel)(nil)
func (m *qwen3VLModel) parseMore(fsys fs.FS) error {
bts, err := fs.ReadFile(fsys, "preprocessor_config.json")
if err != nil {
@ -55,8 +65,20 @@ func (m *qwen3VLModel) KV(t *Tokenizer) KV {
// override architecture
kv["general.architecture"] = arch
if sections := m.RopeScaling.MropeSection; len(sections) > 0 {
dimensionSections := append([]int32(nil), sections...)
if len(dimensionSections) == 3 {
dimensionSections = append(dimensionSections, 0)
}
kv["rope.dimension_sections"] = dimensionSections
}
kv["n_deepstack_layers"] = uint32(len(m.VisionModel.DeepstackVisualIndexes))
kv["vision.block_count"] = cmp.Or(m.VisionModel.Depth, 32)
kv["vision.embedding_length"] = m.VisionModel.HiddenSize
if m.VisionModel.IntermediateSize > 0 {
kv["vision.feed_forward_length"] = m.VisionModel.IntermediateSize
}
kv["vision.attention.head_count"] = cmp.Or(m.VisionModel.NumHeads, 16)
kv["vision.num_channels"] = m.VisionModel.InChannels
kv["vision.patch_size"] = cmp.Or(m.VisionModel.PatchSize, 14)
@ -75,6 +97,234 @@ func (m *qwen3VLModel) KV(t *Tokenizer) KV {
return kv
}
func (m *qwen3VLModel) TextKV(t *Tokenizer) KV {
kv := m.KV(t)
for _, key := range []string{
"vision.block_count",
"vision.embedding_length",
"vision.feed_forward_length",
"vision.attention.head_count",
"vision.num_channels",
"vision.patch_size",
"vision.spatial_merge_size",
"vision.attention.layer_norm_epsilon",
"vision.rope.freq_base",
"vision.temporal_patch_size",
"vision.deepstack_visual_indexes",
"vision.shortest_edge",
"vision.longest_edge",
"vision.image_mean",
"vision.image_std",
"rope.mrope_section",
} {
delete(kv, key)
}
return kv
}
func (m *qwen3VLModel) ProjectorKV(*Tokenizer) KV {
depth := cmp.Or(m.VisionModel.Depth, uint32(32))
deepstack := make([]bool, depth)
for _, idx := range m.VisionModel.DeepstackVisualIndexes {
if idx >= 0 && uint32(idx) < depth {
deepstack[idx] = true
}
}
projectionDim := m.VisionModel.OutHiddenSize
if projectionDim == 0 {
projectionDim = m.HiddenSize
}
layerNormEps := m.VisionModel.RMSNormEps
if layerNormEps == 0 {
layerNormEps = 1e-6
}
kv := KV{
"general.architecture": "clip",
"general.type": "mmproj",
"general.file_type": uint32(1),
"general.quantization_version": uint32(2),
"clip.has_vision_encoder": true,
"clip.projector_type": "qwen3vl_merger",
"clip.use_gelu": true,
"clip.vision.block_count": depth,
"clip.vision.embedding_length": m.VisionModel.HiddenSize,
"clip.vision.feed_forward_length": cmp.Or(m.VisionModel.IntermediateSize, m.VisionModel.HiddenSize*4),
"clip.vision.attention.head_count": cmp.Or(m.VisionModel.NumHeads, uint32(16)),
"clip.vision.attention.layer_norm_epsilon": layerNormEps,
"clip.vision.num_channels": m.VisionModel.InChannels,
"clip.vision.patch_size": cmp.Or(m.VisionModel.PatchSize, uint32(14)),
"clip.vision.spatial_merge_size": cmp.Or(m.VisionModel.SpatialMergeSize, uint32(2)),
"clip.vision.image_size": m.projectorImageSize(),
"clip.vision.projection_dim": projectionDim,
"clip.vision.temporal_patch_size": cmp.Or(m.VisionModel.TemporalPatchSize, uint32(2)),
"clip.vision.rope.freq_base": cmp.Or(m.VisionModel.RopeTheta, float32(1e4)),
"clip.vision.is_deepstack_layers": deepstack,
}
if m.VisionModel.Size.ShortestEdge > 0 {
kv["clip.vision.image_min_pixels"] = m.VisionModel.Size.ShortestEdge
}
if m.VisionModel.Size.LongestEdge > 0 {
kv["clip.vision.image_max_pixels"] = m.VisionModel.Size.LongestEdge
}
if len(m.VisionModel.ImageMean) == 3 {
kv["clip.vision.image_mean"] = m.VisionModel.ImageMean
}
if len(m.VisionModel.ImageStd) == 3 {
kv["clip.vision.image_std"] = m.VisionModel.ImageStd
}
return kv
}
func (m *qwen3VLModel) projectorImageSize() uint32 {
if m.VisionModel.NumPositionEmbeddings > 0 && m.VisionModel.PatchSize > 0 {
root := uint32(math.Sqrt(float64(m.VisionModel.NumPositionEmbeddings)))
if root*root == m.VisionModel.NumPositionEmbeddings {
return root * m.VisionModel.PatchSize
}
}
return uint32(768)
}
func qwen3VLVisionTensor(name string) bool {
return strings.HasPrefix(name, "v.") || strings.HasPrefix(name, "mm.")
}
func (m *qwen3VLModel) TextTensors(ts []Tensor, _ *Tokenizer) []*ggml.Tensor {
var textOnly []Tensor
for _, t := range ts {
if qwen3VLVisionTensor(t.Name()) {
continue
}
textOnly = append(textOnly, t)
}
return m.qwen3Model.Tensors(textOnly)
}
func (m *qwen3VLModel) qwen3VLProjectorRename(name string) string {
if strings.HasPrefix(name, "v.merger.") {
name = strings.Replace(name, "v.merger.linear_fc1", "mm.0", 1)
name = strings.Replace(name, "v.merger.linear_fc2", "mm.2", 1)
name = strings.Replace(name, "v.merger.norm", "v.post_ln", 1)
return name
}
if strings.HasPrefix(name, "v.deepstack.") {
re := regexp.MustCompile(`^v\.deepstack\.(\d+)\.(.+)$`)
if matches := re.FindStringSubmatch(name); matches != nil {
seqIdx, err := strconv.Atoi(matches[1])
if err == nil && seqIdx < len(m.VisionModel.DeepstackVisualIndexes) {
blockIdx := m.VisionModel.DeepstackVisualIndexes[seqIdx]
suffix := matches[2]
suffix = strings.Replace(suffix, "linear_fc1", "fc1", 1)
suffix = strings.Replace(suffix, "linear_fc2", "fc2", 1)
return fmt.Sprintf("v.deepstack.%d.%s", blockIdx, suffix)
}
}
}
return name
}
func (m *qwen3VLModel) ProjectorTensors(ts []Tensor) []*ggml.Tensor {
var out []*ggml.Tensor
for _, t := range ts {
if !qwen3VLVisionTensor(t.Name()) {
continue
}
name := m.qwen3VLProjectorRename(t.Name())
if name == "v.patch_embd.weight" {
out = append(out, m.qwen3VLPatchEmbedTensors(t)...)
continue
}
kind := t.Kind()
var writer io.WriterTo = t
if name == "v.position_embd.weight" {
kind = tensorKindFP32
writer = tensorFloat32Writer{tensor: t}
} else if sourceDType(t) == "BF16" && kind == tensorKindFP16 {
kind = tensorKindBF16
writer = tensorBF16Writer{tensor: t}
}
out = append(out, &ggml.Tensor{
Name: name,
Kind: kind,
Shape: slices.Clone(t.Shape()),
WriterTo: writer,
})
}
return out
}
func (m *qwen3VLModel) qwen3VLPatchEmbedTensors(t Tensor) []*ggml.Tensor {
shape := t.Shape()
if len(shape) != 5 || shape[2] != 2 {
return nil
}
outShape := []uint64{shape[0], shape[1], shape[3], shape[4]}
return []*ggml.Tensor{
{
Name: "v.patch_embd.weight",
Kind: tensorKindFP32,
Shape: slices.Clone(outShape),
WriterTo: tensorFloat32Writer{tensor: t, repacker: qwenTemporalPatchEmbedSlice(0)},
},
{
Name: "v.patch_embd.weight.1",
Kind: tensorKindFP32,
Shape: slices.Clone(outShape),
WriterTo: tensorFloat32Writer{tensor: t, repacker: qwenTemporalPatchEmbedSlice(1)},
},
}
}
func qwenTemporalPatchEmbedSlice(slice int) Repacker {
return func(_ string, data []float32, shape []uint64) ([]float32, error) {
if len(shape) != 5 || shape[2] != 2 {
return nil, fmt.Errorf("qwen temporal patch embedding shape %v", shape)
}
outChannels := int(shape[0])
inChannels := int(shape[1])
frames := int(shape[2])
height := int(shape[3])
width := int(shape[4])
if slice < 0 || slice >= frames {
return nil, fmt.Errorf("qwen temporal patch embedding slice %d out of range", slice)
}
expected := outChannels * inChannels * frames * height * width
if len(data) != expected {
return nil, fmt.Errorf("qwen temporal patch embedding data size %d, expected %d", len(data), expected)
}
out := make([]float32, outChannels*inChannels*height*width)
for oc := range outChannels {
for ic := range inChannels {
for y := range height {
for x := range width {
src := ((((oc*inChannels+ic)*frames+slice)*height + y) * width) + x
dst := (((oc*inChannels+ic)*height + y) * width) + x
out[dst] = data[src]
}
}
}
}
return out, nil
}
}
func (m *qwen3VLModel) Tensors(ts []Tensor) []*ggml.Tensor {
var rest []Tensor
var out []*ggml.Tensor
@ -107,10 +357,15 @@ func (m *qwen3VLModel) Replacements() []string {
m.qwen3Model.Replacements(),
"model.language_", "",
"model.visual", "v",
"patch_embed.proj", "patch_embed",
"patch_embed.proj", "patch_embd",
"pos_embed", "position_embd",
"blocks", "blk",
"attn.qkv", "attn_qkv",
"attn.proj", "attn_out",
"deepstack_merger_list", "deepstack_merger",
"norm1", "ln1",
"norm2", "ln2",
"mlp.linear_fc1", "ffn_up",
"mlp.linear_fc2", "ffn_down",
"deepstack_merger_list", "deepstack",
)
}

View file

@ -0,0 +1,147 @@
package convert
import (
"slices"
"testing"
"github.com/ollama/ollama/fs/ggml"
)
func TestQwen3VLTextAndProjectorKV(t *testing.T) {
m := &qwen3VLModel{
qwen3Model: qwen3Model{
HiddenSize: 2048,
},
}
m.RopeScaling.Type = "mrope"
m.RopeScaling.MropeSection = []int32{24, 20, 20}
m.VisionModel.Depth = 24
m.VisionModel.HiddenSize = 1024
m.VisionModel.IntermediateSize = 4096
m.VisionModel.OutHiddenSize = 2048
m.VisionModel.NumHeads = 16
m.VisionModel.InChannels = 3
m.VisionModel.PatchSize = 16
m.VisionModel.SpatialMergeSize = 2
m.VisionModel.NumPositionEmbeddings = 2304
m.VisionModel.TemporalPatchSize = 2
m.VisionModel.RMSNormEps = 1e-6
m.VisionModel.RopeTheta = 10000
m.VisionModel.DeepstackVisualIndexes = []int32{5, 11, 17}
m.VisionModel.ImageMean = []float32{0.5, 0.5, 0.5}
m.VisionModel.ImageStd = []float32{0.5, 0.5, 0.5}
textKV := m.TextKV(&Tokenizer{Vocabulary: &Vocabulary{}})
if got, want := textKV["general.architecture"], "qwen3vl"; got != want {
t.Fatalf("unexpected text architecture: got %v want %v", got, want)
}
if got, want := textKV["rope.dimension_sections"], []int32{24, 20, 20, 0}; !slices.Equal(got.([]int32), want) {
t.Fatalf("unexpected rope.dimension_sections: got %v want %v", got, want)
}
if got, want := textKV["n_deepstack_layers"], uint32(3); got != want {
t.Fatalf("unexpected n_deepstack_layers: got %v want %v", got, want)
}
for _, key := range []string{"vision.block_count", "vision.deepstack_visual_indexes", "rope.mrope_section"} {
if _, ok := textKV[key]; ok {
t.Fatalf("TextKV retained %q", key)
}
}
projectorKV := m.ProjectorKV(&Tokenizer{Vocabulary: &Vocabulary{}})
if got, want := projectorKV["general.architecture"], "clip"; got != want {
t.Fatalf("unexpected projector architecture: got %v want %v", got, want)
}
if got, want := projectorKV["general.type"], "mmproj"; got != want {
t.Fatalf("unexpected projector type: got %v want %v", got, want)
}
if got, want := projectorKV["clip.projector_type"], "qwen3vl_merger"; got != want {
t.Fatalf("unexpected projector type: got %v want %v", got, want)
}
if got, want := projectorKV["clip.vision.feed_forward_length"], uint32(4096); got != want {
t.Fatalf("unexpected feed_forward_length: got %v want %v", got, want)
}
if got, want := projectorKV["clip.vision.image_size"], uint32(768); got != want {
t.Fatalf("unexpected image_size: got %v want %v", got, want)
}
mask, ok := projectorKV["clip.vision.is_deepstack_layers"].([]bool)
if !ok {
t.Fatalf("deepstack mask has unexpected type: %T", projectorKV["clip.vision.is_deepstack_layers"])
}
if len(mask) != 24 || !mask[5] || !mask[11] || !mask[17] {
t.Fatalf("unexpected deepstack mask: %v", mask)
}
}
func TestQwen3VLProjectorTensors(t *testing.T) {
m := &qwen3VLModel{}
m.VisionModel.DeepstackVisualIndexes = []int32{5, 11, 17}
tensors := m.ProjectorTensors([]Tensor{
&fakeTensor{
name: "v.patch_embd.weight",
shape: []uint64{2, 2, 2, 1, 2},
data: []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
},
&fakeTensor{name: "v.position_embd.weight", shape: []uint64{4, 2}, data: []float32{0, 1, 2, 3, 4, 5, 6, 7}},
&fakeTensor{name: "v.merger.linear_fc1.weight", shape: []uint64{4, 2}, data: make([]float32, 8)},
&fakeTensor{name: "v.merger.linear_fc2.bias", shape: []uint64{4}, data: make([]float32, 4)},
&fakeTensor{name: "v.merger.norm.weight", shape: []uint64{2}, data: make([]float32, 2)},
&fakeTensor{name: "v.deepstack.0.linear_fc1.weight", shape: []uint64{4, 2}, data: make([]float32, 8)},
&fakeTensor{name: "v.deepstack.1.norm.bias", shape: []uint64{2}, data: make([]float32, 2)},
&fakeTensor{name: "v.blk.0.attn_qkv.weight", shape: []uint64{6, 2}, data: make([]float32, 12), sourceDType: "BF16", kind: tensorKindFP16},
&fakeTensor{name: "token_embd.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
})
byName := map[string]uint32{}
for _, tensor := range tensors {
byName[tensor.Name] = tensor.Kind
}
if _, ok := byName["token_embd.weight"]; ok {
t.Fatalf("projector tensors included text tensor")
}
if got := byName["v.position_embd.weight"]; got != tensorKindFP32 {
t.Fatalf("position embedding was not promoted to F32: %d", got)
}
if got := byName["v.blk.0.attn_qkv.weight"]; got != tensorKindBF16 {
t.Fatalf("BF16 projector tensor was not preserved: %d", got)
}
for _, name := range []string{
"mm.0.weight",
"mm.2.bias",
"v.post_ln.weight",
"v.deepstack.5.fc1.weight",
"v.deepstack.11.norm.bias",
} {
if _, ok := byName[name]; !ok {
t.Fatalf("missing projector tensor %q", name)
}
}
firstTensor := tensorsByName(tensors)["v.patch_embd.weight"]
if firstTensor == nil {
t.Fatalf("first patch embedding slice missing")
}
if got, want := firstTensor.Shape, []uint64{2, 2, 1, 2}; !slices.Equal(got, want) {
t.Fatalf("unexpected first patch shape: got %v want %v", got, want)
}
if got, want := readTensorData(t, firstTensor), []float32{0, 1, 4, 5, 8, 9, 12, 13}; !slices.Equal(got, want) {
t.Fatalf("unexpected first patch data: got %v want %v", got, want)
}
secondTensor := tensorsByName(tensors)["v.patch_embd.weight.1"]
if secondTensor == nil {
t.Fatalf("second patch embedding slice missing")
}
if got, want := readTensorData(t, secondTensor), []float32{2, 3, 6, 7, 10, 11, 14, 15}; !slices.Equal(got, want) {
t.Fatalf("unexpected second patch data: got %v want %v", got, want)
}
}
func tensorsByName(tensors []*ggml.Tensor) map[string]*ggml.Tensor {
byName := map[string]*ggml.Tensor{}
for _, tensor := range tensors {
byName[tensor.Name] = tensor
}
return byName
}

View file

@ -22,6 +22,7 @@ type fakeTensor struct {
data []float32
sourceDType string
kind uint32
repacker Repacker
}
@ -34,6 +35,9 @@ func (f fakeTensor) Shape() []uint64 {
}
func (f fakeTensor) Kind() uint32 {
if f.kind != 0 {
return f.kind
}
return 0
}
@ -51,6 +55,7 @@ func (f fakeTensor) Clone() Tensor {
shape: slices.Clone(f.shape),
data: slices.Clone(f.data),
sourceDType: f.sourceDType,
kind: f.kind,
repacker: f.repacker,
}
}

View file

@ -149,6 +149,7 @@ func parseTokenizer(fsys fs.FS, specialTokenTypes []string) (*Tokenizer, error)
if err := json.Unmarshal(bts, &sv.AddToken); err != nil {
return nil, err
}
sv.AddTokenSet = true
}
if bts, ok := p[fmt.Sprintf("%s_token", st)]; ok {
@ -314,6 +315,10 @@ type SpecialVocabulary struct {
ID int
Content string
AddToken bool
// AddTokenSet tracks whether tokenizer_config.json explicitly defined the
// add_*_token setting. Missing and explicit false have different GGUF
// semantics for some tokenizers.
AddTokenSet bool
// IDs is populated by generation_config.json
IDs []int32

View file

@ -184,8 +184,8 @@ func TestParseTokenizer(t *testing.T) {
},
SpecialVocabulary: []*SpecialVocabulary{
{Type: "pad", Content: "<pad>", ID: 0, AddToken: false},
{Type: "eos", Content: "<eos>", ID: 1, AddToken: false},
{Type: "bos", Content: "<bos>", ID: 2, AddToken: true},
{Type: "eos", Content: "<eos>", ID: 1, AddToken: false, AddTokenSet: true},
{Type: "bos", Content: "<bos>", ID: 2, AddToken: true, AddTokenSet: true},
{Type: "unk", Content: "<unk>", ID: 3, AddToken: false},
},
Pre: "default",
@ -380,8 +380,8 @@ func TestParseTokenizer(t *testing.T) {
Types: []int32{3, 3, 3, 3},
},
SpecialVocabulary: []*SpecialVocabulary{
{Type: "eos", Content: "<eos>", ID: 1, IDs: []int32{1, 2, 3}, AddToken: false},
{Type: "bos", Content: "<bos>", ID: 0, AddToken: true},
{Type: "eos", Content: "<eos>", ID: 1, IDs: []int32{1, 2, 3}, AddToken: false, AddTokenSet: true},
{Type: "bos", Content: "<bos>", ID: 0, AddToken: true, AddTokenSet: true},
},
Pre: "default",
},
@ -423,3 +423,23 @@ func TestParseTokenizer(t *testing.T) {
})
}
}
func TestModelParametersKVOmitsMissingAddToken(t *testing.T) {
kv := ModelParameters{}.KV(&Tokenizer{
Vocabulary: &Vocabulary{Model: "gpt2"},
SpecialVocabulary: []*SpecialVocabulary{
{Type: "bos", Content: "<bos>", ID: 1},
{Type: "eos", Content: "<eos>", ID: 2, AddToken: false, AddTokenSet: true},
},
})
if _, ok := kv["tokenizer.ggml.add_bos_token"]; ok {
t.Errorf("tokenizer.ggml.add_bos_token should be omitted when add_bos_token is absent")
}
if got := kv["tokenizer.ggml.bos_token_id"]; got != uint32(1) {
t.Errorf("tokenizer.ggml.bos_token_id = %v, want 1", got)
}
if got, ok := kv["tokenizer.ggml.add_eos_token"]; !ok || got != false {
t.Errorf("tokenizer.ggml.add_eos_token = %v, %v; want explicit false", got, ok)
}
}

487
discover/amd.go Normal file
View file

@ -0,0 +1,487 @@
// AMD discovery needs a small amount of backend-specific handling beyond the
// generic llama-server device list. ROCm devices expose their real capability
// as gfx targets, and the shipped rocBLAS kernels define which of those
// targets are actually usable. On Linux, KFD topology and DRM sysfs attributes
// provide the integrated-vs-discrete signal needed for scheduler decisions. On
// Windows, older HIP driver installs can also leave ROCm libraries present but
// too old to support GPU inference. These helpers keep that extra validation
// and warning logic in one place.
package discover
import (
"bufio"
"log/slog"
"os"
"os/exec"
"path/filepath"
"regexp"
"runtime"
"sort"
"strconv"
"strings"
"github.com/ollama/ollama/ml"
)
// gfxTargetRegex matches ROCm stderr lines like:
//
// Device 0: AMD Radeon RX 6700 XT, gfx1031 (0x1031), VMM: no, Wave Size: 32, VRAM: 12272 MiB
// Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
var gfxTargetRegex = regexp.MustCompile(
`Device\s+(\d+):.*,\s+(gfx[0-9a-f]+)[\s:(]`,
)
var pciIDRegex = regexp.MustCompile(`^[0-9a-fA-F]{4}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-7]$`)
func parseROCmGFXTargets(output string) map[int]string {
gfxByIndex := make(map[int]string)
scanner := bufio.NewScanner(strings.NewReader(output))
for scanner.Scan() {
if matches := gfxTargetRegex.FindStringSubmatch(scanner.Text()); matches != nil {
idx, _ := strconv.Atoi(matches[1])
gfxByIndex[idx] = matches[2]
}
}
return gfxByIndex
}
func parseGFXTarget(gfx string) (int, int) {
gfx, ok := strings.CutPrefix(gfx, "gfx")
if !ok || len(gfx) < 3 {
return 0, 0
}
major, err := strconv.ParseInt(gfx[:len(gfx)-2], 16, 32)
if err != nil {
return 0, 0
}
minor, err := strconv.ParseInt(gfx[len(gfx)-2:], 16, 32)
if err != nil {
return 0, 0
}
return int(major), int(minor)
}
// HSA_OVERRIDE_GFX_VERSION changes the effective HIP/rocBLAS target even
// though KFD/sysfs still reports the physical ASIC.
func hsaOverrideGFXTarget() string {
return rocmGFXTargetOverride(os.Getenv("HSA_OVERRIDE_GFX_VERSION"))
}
func rocmGFXTargetOverride(value string) string {
value = strings.TrimSpace(value)
if value == "" {
return ""
}
if strings.HasPrefix(value, "gfx") {
if major, minor := parseGFXTarget(value); major != 0 || minor != 0 {
return value
}
return ""
}
parts := strings.Split(value, ".")
if len(parts) != 3 {
return ""
}
var digits [3]uint64
for i, part := range parts {
digit, err := strconv.ParseUint(part, 10, 8)
if err != nil || digit > 0xf {
return ""
}
digits[i] = digit
}
return "gfx" +
strconv.FormatUint(digits[0], 10) +
strconv.FormatUint(digits[1], 16) +
strconv.FormatUint(digits[2], 16)
}
func setROCmGFXTarget(device *ml.DeviceInfo, gfx string) {
if gfx == "" || device.Library != "ROCm" {
return
}
device.GFXTarget = gfx
device.ComputeMajor, device.ComputeMinor = parseGFXTarget(gfx)
}
// rocblasGFXTargets scans the rocblas library directory for supported gfx targets
// by looking for TensileLibrary_lazy_gfxNNNN.dat files.
func rocblasGFXTargets(libDirs []string) map[string]bool {
targets := make(map[string]bool)
for _, dir := range libDirs {
files, _ := filepath.Glob(filepath.Join(dir, "rocblas", "library", "TensileLibrary_lazy_gfx*.dat"))
for _, f := range files {
base := filepath.Base(f)
if t, ok := strings.CutPrefix(base, "TensileLibrary_lazy_"); ok {
if t, ok = strings.CutSuffix(t, ".dat"); ok {
targets[t] = true
}
}
}
}
return targets
}
type rocmLinuxSysfsDevice struct {
pciID string
gfxTarget string
integrated bool
known bool
}
func refineLinuxROCmDevices(devices []ml.DeviceInfo) []ml.DeviceInfo {
if runtime.GOOS != "linux" {
return devices
}
applyLinuxROCmRefinement(devices, "/sys")
return devices
}
func applyLinuxROCmRefinement(devices []ml.DeviceInfo, sysfsRoot string) bool {
var rocmIndexes []int
for i, device := range devices {
if device.Library == "ROCm" {
rocmIndexes = append(rocmIndexes, i)
}
}
if len(rocmIndexes) == 0 {
return false
}
sysfsDevices, err := readROCmLinuxSysfsDevices(sysfsRoot)
if err != nil {
slog.Debug("linux rocm device refinement unavailable", "error", err)
return false
}
if len(sysfsDevices) != len(rocmIndexes) {
slog.Debug("linux rocm device refinement skipped: device count mismatch",
"llama_server_count", len(rocmIndexes), "kfd_count", len(sysfsDevices))
return false
}
byPCI := map[string]rocmLinuxSysfsDevice{}
byGFX := uniqueROCmSysfsDevicesByGFX(sysfsDevices)
for _, sysfsDevice := range sysfsDevices {
if sysfsDevice.pciID != "" {
byPCI[strings.ToLower(sysfsDevice.pciID)] = sysfsDevice
}
}
refined := 0
for i, rocmIndex := range rocmIndexes {
device := &devices[rocmIndex]
sysfsDevice, ok := matchROCmLinuxSysfsDevice(*device, i, sysfsDevices, byPCI, byGFX)
if !ok {
slog.Debug("linux rocm device refinement skipped: no stable match",
"device", device.Name, "pci_id", device.PCIID, "gfx", device.GFXTarget)
continue
}
applyROCmLinuxSysfsDevice(device, sysfsDevice)
refined++
}
if refined == 0 {
return false
}
slog.Debug("linux rocm device refinement applied", "devices", refined)
return true
}
func uniqueROCmSysfsDevicesByGFX(sysfsDevices []rocmLinuxSysfsDevice) map[string]rocmLinuxSysfsDevice {
byGFX := map[string]rocmLinuxSysfsDevice{}
duplicates := map[string]bool{}
for _, sysfsDevice := range sysfsDevices {
if sysfsDevice.gfxTarget == "" {
continue
}
if _, ok := byGFX[sysfsDevice.gfxTarget]; ok {
duplicates[sysfsDevice.gfxTarget] = true
continue
}
byGFX[sysfsDevice.gfxTarget] = sysfsDevice
}
for gfx := range duplicates {
delete(byGFX, gfx)
}
return byGFX
}
func matchROCmLinuxSysfsDevice(device ml.DeviceInfo, index int, sysfsDevices []rocmLinuxSysfsDevice, byPCI, byGFX map[string]rocmLinuxSysfsDevice) (rocmLinuxSysfsDevice, bool) {
// ROCm visibility envs can remap backend ordinals while sysfs stays in
// physical KFD order, so prefer stable identity before index fallback.
if device.PCIID != "" {
if sysfsDevice, ok := byPCI[strings.ToLower(device.PCIID)]; ok {
return sysfsDevice, true
}
}
if device.GFXTarget != "" {
if sysfsDevice, ok := byGFX[device.GFXTarget]; ok {
return sysfsDevice, true
}
}
if index >= len(sysfsDevices) {
return rocmLinuxSysfsDevice{}, false
}
sysfsDevice := sysfsDevices[index]
if sysfsDevice.gfxTarget != "" && device.GFXTarget != "" && sysfsDevice.gfxTarget != device.GFXTarget {
slog.Debug("linux rocm device refinement index mismatch",
"device", device.Name, "llama_server_gfx", device.GFXTarget, "kfd_gfx", sysfsDevice.gfxTarget)
return rocmLinuxSysfsDevice{}, false
}
return sysfsDevice, true
}
func applyROCmLinuxSysfsDevice(device *ml.DeviceInfo, sysfsDevice rocmLinuxSysfsDevice) {
if sysfsDevice.pciID != "" {
device.PCIID = sysfsDevice.pciID
}
if sysfsDevice.known {
device.Integrated = sysfsDevice.integrated
}
}
func readROCmLinuxSysfsDevices(sysfsRoot string) ([]rocmLinuxSysfsDevice, error) {
nodeRoot := filepath.Join(sysfsRoot, "class", "kfd", "kfd", "topology", "nodes")
entries, err := os.ReadDir(nodeRoot)
if err != nil {
return nil, err
}
sort.Slice(entries, func(i, j int) bool {
left, _ := strconv.Atoi(entries[i].Name())
right, _ := strconv.Atoi(entries[j].Name())
return left < right
})
var devices []rocmLinuxSysfsDevice
for _, entry := range entries {
if !entry.IsDir() {
continue
}
properties, err := readKFDNodeProperties(filepath.Join(nodeRoot, entry.Name(), "properties"))
if err != nil || !properties.isGPU() {
continue
}
device, err := readROCmDRMDevice(sysfsRoot, properties.drmRenderMinor)
if err != nil {
slog.Debug("linux rocm sysfs device skipped", "node", entry.Name(), "error", err)
continue
}
device.gfxTarget = gfxTargetFromKFDVersion(properties.gfxTargetVersion)
devices = append(devices, device)
}
return devices, nil
}
type kfdNodeProperties struct {
vendorID uint64
deviceID uint64
drmRenderMinor int
gfxTargetVersion uint64
}
func (p kfdNodeProperties) isGPU() bool {
return p.vendorID != 0 && p.deviceID != 0 && p.drmRenderMinor != 0
}
func readKFDNodeProperties(path string) (kfdNodeProperties, error) {
file, err := os.Open(path)
if err != nil {
return kfdNodeProperties{}, err
}
defer file.Close()
values := make(map[string]string)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fields := strings.Fields(scanner.Text())
if len(fields) >= 2 {
values[fields[0]] = fields[1]
}
}
if err := scanner.Err(); err != nil {
return kfdNodeProperties{}, err
}
vendorID, _ := parseSysfsUint(values["vendor_id"])
deviceID, _ := parseSysfsUint(values["device_id"])
renderMinor, _ := parseSysfsUint(values["drm_render_minor"])
gfxVersion, _ := parseSysfsUint(values["gfx_target_version"])
return kfdNodeProperties{
vendorID: vendorID,
deviceID: deviceID,
drmRenderMinor: int(renderMinor),
gfxTargetVersion: gfxVersion,
}, nil
}
func readROCmDRMDevice(sysfsRoot string, renderMinor int) (rocmLinuxSysfsDevice, error) {
devicePath := filepath.Join(sysfsRoot, "class", "drm", "renderD"+strconv.Itoa(renderMinor), "device")
resolvedDevicePath, err := filepath.EvalSymlinks(devicePath)
if err != nil {
return rocmLinuxSysfsDevice{}, err
}
vendor, err := readSysfsString(filepath.Join(resolvedDevicePath, "vendor"))
if err != nil {
return rocmLinuxSysfsDevice{}, err
}
if !strings.EqualFold(vendor, "0x1002") {
return rocmLinuxSysfsDevice{}, nil
}
driver, err := readSysfsDriverName(filepath.Join(resolvedDevicePath, "driver"))
if err != nil {
return rocmLinuxSysfsDevice{}, err
}
if driver != "amdgpu" {
return rocmLinuxSysfsDevice{}, nil
}
device := rocmLinuxSysfsDevice{pciID: pciIDFromPath(resolvedDevicePath)}
if sysfsFileExists(filepath.Join(resolvedDevicePath, "mem_info_vram_vendor")) ||
sysfsFileExists(filepath.Join(resolvedDevicePath, "board_info")) {
device.known = true
return device, nil
}
vramTotal, ok := readROCmLinuxMemoryInfo(resolvedDevicePath, "mem_info_vram_total")
if !ok {
return device, nil
}
gttTotal, ok := readROCmLinuxMemoryInfo(resolvedDevicePath, "mem_info_gtt_total")
if !ok {
return device, nil
}
const (
maxIntegratedVRAM = 4 << 30
minSharedGTT = 8 << 30
)
if vramTotal > 0 && vramTotal <= maxIntegratedVRAM && gttTotal >= minSharedGTT && gttTotal >= 4*vramTotal {
device.integrated = true
device.known = true
}
return device, nil
}
func readROCmLinuxMemoryInfo(devicePath, name string) (uint64, bool) {
value, err := readSysfsUint(filepath.Join(devicePath, name))
return value, err == nil
}
func readSysfsString(path string) (string, error) {
data, err := os.ReadFile(path)
if err != nil {
return "", err
}
return strings.TrimSpace(string(data)), nil
}
func readSysfsDriverName(path string) (string, error) {
driver, readErr := readSysfsString(path)
if readErr == nil {
return driver, nil
}
driverPath, err := filepath.EvalSymlinks(path)
if err == nil {
return filepath.Base(driverPath), nil
}
return "", readErr
}
func readSysfsUint(path string) (uint64, error) {
value, err := readSysfsString(path)
if err != nil {
return 0, err
}
return parseSysfsUint(value)
}
func parseSysfsUint(value string) (uint64, error) {
return strconv.ParseUint(strings.TrimSpace(value), 0, 64)
}
func sysfsFileExists(path string) bool {
_, err := os.Stat(path)
return err == nil
}
func pciIDFromPath(path string) string {
base := filepath.Base(path)
if pciIDRegex.MatchString(base) {
return base
}
return ""
}
func gfxTargetFromKFDVersion(version uint64) string {
if version == 0 {
return ""
}
major := version / 10000
minor := (version / 100) % 100
stepping := version % 100
if minor > 0xf || stepping > 0xf {
return ""
}
return "gfx" + strconv.FormatUint(major, 10) + strconv.FormatUint(minor, 16) + strconv.FormatUint(stepping, 16)
}
// filterUnsupportedROCmDevices removes ROCm devices whose gfx target doesn't have
// matching rocblas kernels bundled.
func filterUnsupportedROCmDevices(devices []ml.DeviceInfo, libDirs []string) []ml.DeviceInfo {
supported := rocblasGFXTargets(libDirs)
if len(supported) == 0 {
return devices
}
override := hsaOverrideGFXTarget()
var filtered []ml.DeviceInfo
for _, dev := range devices {
if dev.Library != "ROCm" {
filtered = append(filtered, dev)
continue
}
setROCmGFXTarget(&dev, override)
gfx := dev.GFXTarget
if gfx == "" {
filtered = append(filtered, dev)
continue
}
if supported[gfx] {
filtered = append(filtered, dev)
} else {
slog.Warn("dropping ROCm device — no rocblas support for gfx target",
"device", dev.Name, "gfx_target", gfx, "supported", supported,
"hint", "set HSA_OVERRIDE_GFX_VERSION to map to a supported target")
}
}
return filtered
}
func detectOldAMDDriverWindows() {
if runtime.GOOS != "windows" {
return
}
_, errV6 := exec.LookPath("amdhip64_6.dll")
_, errV7 := exec.LookPath("amdhip64_7.dll")
if errV6 == nil && errV7 != nil {
slog.Warn("AMD driver is too old. Update your AMD driver to enable GPU inference.")
}
}

289
discover/amd_test.go Normal file
View file

@ -0,0 +1,289 @@
package discover
import (
"os"
"path/filepath"
"runtime"
"strconv"
"testing"
"github.com/ollama/ollama/ml"
)
func TestApplyLinuxROCmRefinement(t *testing.T) {
if runtime.GOOS == "windows" {
t.Skip("fake Linux PCI sysfs paths use ':' which is not valid in Windows filenames")
}
tests := []struct {
name string
nodes []fakeROCmNode
devices []ml.DeviceInfo
applied bool
wantIntegrated []bool
wantPCIIDs []string
}{
{
name: "apu is integrated",
nodes: []fakeROCmNode{{
node: 1,
renderMinor: 128,
gfxVersion: "90012",
vramTotal: 2 << 30,
gttTotal: 32 << 30,
}},
devices: []ml.DeviceInfo{{
DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"},
Name: "ROCm0",
GFXTarget: "gfx90c",
}},
applied: true,
wantIntegrated: []bool{true},
},
{
name: "low vram dgpu is not integrated",
nodes: []fakeROCmNode{{
node: 1,
renderMinor: 128,
gfxVersion: "100601",
vramTotal: 4 << 30,
gttTotal: 32 << 30,
vramVendor: true,
boardInfo: true,
}},
devices: []ml.DeviceInfo{{
DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"},
Name: "ROCm0",
GFXTarget: "gfx1061",
}},
applied: true,
wantIntegrated: []bool{false},
},
{
name: "mixed system follows kfd order not drm order",
nodes: []fakeROCmNode{
{
node: 1,
renderMinor: 129,
gfxVersion: "110000",
vramTotal: 48 << 30,
gttTotal: 64 << 30,
vramVendor: true,
boardInfo: true,
},
{
node: 2,
renderMinor: 128,
gfxVersion: "110003",
vramTotal: 512 << 20,
gttTotal: 32 << 30,
},
},
devices: []ml.DeviceInfo{
{DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"}, Name: "ROCm0", GFXTarget: "gfx1100"},
{DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"}, Name: "ROCm1", GFXTarget: "gfx1103"},
},
applied: true,
wantIntegrated: []bool{false, true},
},
{
name: "remapped visible order matches existing pci identity",
nodes: []fakeROCmNode{
{
node: 1,
renderMinor: 128,
pciID: "0000:e3:00.0",
gfxVersion: "110000",
vramTotal: 48 << 30,
gttTotal: 64 << 30,
vramVendor: true,
boardInfo: true,
},
{
node: 2,
renderMinor: 129,
pciID: "0000:c3:00.0",
gfxVersion: "120000",
vramTotal: 2 << 30,
gttTotal: 32 << 30,
},
},
devices: []ml.DeviceInfo{
{DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"}, Name: "ROCm0", GFXTarget: "gfx1200", PCIID: "0000:c3:00.0"},
{DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"}, Name: "ROCm1", GFXTarget: "gfx1100", PCIID: "0000:e3:00.0"},
},
applied: true,
wantIntegrated: []bool{true, false},
wantPCIIDs: []string{"0000:c3:00.0", "0000:e3:00.0"},
},
{
name: "remapped visible order matches unique gfx when pci is absent",
nodes: []fakeROCmNode{
{
node: 1,
renderMinor: 128,
pciID: "0000:e3:00.0",
gfxVersion: "110000",
vramTotal: 48 << 30,
gttTotal: 64 << 30,
vramVendor: true,
boardInfo: true,
},
{
node: 2,
renderMinor: 129,
pciID: "0000:c3:00.0",
gfxVersion: "120000",
vramTotal: 2 << 30,
gttTotal: 32 << 30,
},
},
devices: []ml.DeviceInfo{
{DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"}, Name: "ROCm0", GFXTarget: "gfx1200"},
{DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"}, Name: "ROCm1", GFXTarget: "gfx1100"},
},
applied: true,
wantIntegrated: []bool{true, false},
wantPCIIDs: []string{"0000:c3:00.0", "0000:e3:00.0"},
},
{
name: "missing kfd data leaves devices unchanged",
devices: []ml.DeviceInfo{{
DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"},
Name: "ROCm0",
Integrated: true,
}},
wantIntegrated: []bool{true},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
sysfsRoot := t.TempDir()
for _, node := range tt.nodes {
writeFakeROCmNode(t, sysfsRoot, node)
}
devices := append([]ml.DeviceInfo(nil), tt.devices...)
applied := applyLinuxROCmRefinement(devices, sysfsRoot)
if applied != tt.applied {
t.Fatalf("applied = %v, want %v", applied, tt.applied)
}
for i, want := range tt.wantIntegrated {
if devices[i].Integrated != want {
t.Fatalf("device %d integrated = %v, want %v", i, devices[i].Integrated, want)
}
}
for i, want := range tt.wantPCIIDs {
if devices[i].PCIID != want {
t.Fatalf("device %d PCIID = %q, want %q", i, devices[i].PCIID, want)
}
}
})
}
}
func TestSameRefreshDeviceMatchesROCmByPCI(t *testing.T) {
updated := ml.DeviceInfo{
DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"},
PCIID: "0000:c3:00.0",
}
existing := ml.DeviceInfo{
DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"},
PCIID: "0000:C3:00.0",
}
if !sameRefreshDevice(updated, existing) {
t.Fatal("sameRefreshDevice did not match remapped ROCm device by PCI ID")
}
}
func TestFilterUnsupportedROCmDevicesRespectsHSAOverride(t *testing.T) {
t.Setenv("HSA_OVERRIDE_GFX_VERSION", "10.3.0")
libDir := t.TempDir()
rocblasDir := filepath.Join(libDir, "rocblas", "library")
if err := os.MkdirAll(rocblasDir, 0o755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(rocblasDir, "TensileLibrary_lazy_gfx1030.dat"), nil, 0o644); err != nil {
t.Fatal(err)
}
devices := filterUnsupportedROCmDevices([]ml.DeviceInfo{{
DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"},
Name: "ROCm0",
GFXTarget: "gfx1031",
ComputeMajor: 0x10,
ComputeMinor: 0x31,
}}, []string{libDir})
if len(devices) != 1 {
t.Fatalf("got %d devices, want 1", len(devices))
}
if got := devices[0].GFXTarget; got != "gfx1030" {
t.Fatalf("GFXTarget = %q, want gfx1030", got)
}
if got := devices[0].Compute(); got != "gfx1030" {
t.Fatalf("Compute() = %q, want gfx1030", got)
}
}
type fakeROCmNode struct {
node int
renderMinor int
pciID string
gfxVersion string
vramTotal uint64
gttTotal uint64
vramVendor bool
boardInfo bool
}
func writeFakeROCmNode(t *testing.T, sysfsRoot string, node fakeROCmNode) {
t.Helper()
nodeDir := filepath.Join(sysfsRoot, "class", "kfd", "kfd", "topology", "nodes", strconv.Itoa(node.node))
if err := os.MkdirAll(nodeDir, 0o755); err != nil {
t.Fatal(err)
}
properties := "vendor_id 4098\n" +
"device_id 1234\n" +
"drm_render_minor " + strconv.Itoa(node.renderMinor) + "\n" +
"gfx_target_version " + node.gfxVersion + "\n"
if err := os.WriteFile(filepath.Join(nodeDir, "properties"), []byte(properties), 0o644); err != nil {
t.Fatal(err)
}
deviceDir := filepath.Join(sysfsRoot, "class", "drm", "renderD"+strconv.Itoa(node.renderMinor), "device")
if node.pciID != "" {
targetDir := filepath.Join(sysfsRoot, "devices", node.pciID)
if err := os.MkdirAll(targetDir, 0o755); err != nil {
t.Fatal(err)
}
if err := os.MkdirAll(filepath.Dir(deviceDir), 0o755); err != nil {
t.Fatal(err)
}
if err := os.Symlink(targetDir, deviceDir); err != nil {
t.Skipf("symlink unavailable for fake sysfs PCI path: %v", err)
}
deviceDir = targetDir
} else if err := os.MkdirAll(deviceDir, 0o755); err != nil {
t.Fatal(err)
}
writeFakeSysfsFile(t, deviceDir, "vendor", "0x1002\n")
writeFakeSysfsFile(t, deviceDir, "driver", "amdgpu\n")
writeFakeSysfsFile(t, deviceDir, "mem_info_vram_total", strconv.FormatUint(node.vramTotal, 10)+"\n")
writeFakeSysfsFile(t, deviceDir, "mem_info_gtt_total", strconv.FormatUint(node.gttTotal, 10)+"\n")
if node.vramVendor {
writeFakeSysfsFile(t, deviceDir, "mem_info_vram_vendor", "samsung\n")
}
if node.boardInfo {
writeFakeSysfsFile(t, deviceDir, "board_info", "type : cem\n")
}
}
func writeFakeSysfsFile(t *testing.T, dir, name, content string) {
t.Helper()
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
t.Fatal(err)
}
}

View file

@ -4,13 +4,8 @@ import (
"bufio"
"errors"
"fmt"
"io"
"log/slog"
"os"
"path/filepath"
"reflect"
"regexp"
"sort"
"strconv"
"strings"
@ -92,143 +87,6 @@ func getUint64ValueFromFile(path string) (uint64, error) {
return 0, errors.New("empty file content")
}
const CpuInfoFilename = "/proc/cpuinfo"
type linuxCpuInfo struct {
ID string `cpuinfo:"processor"`
VendorID string `cpuinfo:"vendor_id"`
ModelName string `cpuinfo:"model name"`
PhysicalID string `cpuinfo:"physical id"`
Siblings string `cpuinfo:"siblings"`
CoreID string `cpuinfo:"core id"`
}
func GetCPUDetails() []CPU {
file, err := os.Open(CpuInfoFilename)
if err != nil {
slog.Warn("failed to get CPU details", "error", err)
return nil
}
defer file.Close()
cpus := linuxCPUDetails(file)
return overwriteThreadCountByLinuxCgroups(cpus)
}
func overwriteThreadCountByLinuxCgroups(cpus []CPU) []CPU {
file, err := os.Open("/sys/fs/cgroup/cpu.max")
if err != nil {
return cpus
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if sl := strings.Split(line, " "); len(sl) == 2 {
allowdUs, err := strconv.ParseInt(sl[0], 10, 64)
if err != nil {
slog.Warn("failed to parse CPU allowed micro secs", "error", err)
return cpus
}
unitUs, err := strconv.ParseInt(sl[1], 10, 64)
if err != nil {
slog.Warn("failed to parse CPU unit micro secs", "error", err)
return cpus
}
threads := int(max(allowdUs/unitUs, 1))
cpu := cpus[0]
cpu.CoreCount = threads
cpu.ThreadCount = threads
return []CPU{cpu}
}
}
return cpus
}
func linuxCPUDetails(file io.Reader) []CPU {
reColumns := regexp.MustCompile("\t+: ")
scanner := bufio.NewScanner(file)
cpuInfos := []linuxCpuInfo{}
cpu := &linuxCpuInfo{}
for scanner.Scan() {
line := scanner.Text()
if sl := reColumns.Split(line, 2); len(sl) > 1 {
t := reflect.TypeOf(cpu).Elem()
s := reflect.ValueOf(cpu).Elem()
for i := range t.NumField() {
field := t.Field(i)
tag := field.Tag.Get("cpuinfo")
if tag == sl[0] {
s.FieldByName(field.Name).SetString(sl[1])
break
}
}
} else if strings.TrimSpace(line) == "" && cpu.ID != "" {
cpuInfos = append(cpuInfos, *cpu)
cpu = &linuxCpuInfo{}
}
}
if cpu.ID != "" {
cpuInfos = append(cpuInfos, *cpu)
}
// Process the sockets/cores/threads
socketByID := map[string]*CPU{}
coreBySocket := map[string]map[string]struct{}{}
threadsByCoreBySocket := map[string]map[string]int{}
for _, c := range cpuInfos {
if _, found := socketByID[c.PhysicalID]; !found {
socketByID[c.PhysicalID] = &CPU{
ID: c.PhysicalID,
VendorID: c.VendorID,
ModelName: c.ModelName,
}
coreBySocket[c.PhysicalID] = map[string]struct{}{}
threadsByCoreBySocket[c.PhysicalID] = map[string]int{}
}
if c.CoreID != "" {
coreBySocket[c.PhysicalID][c.PhysicalID+":"+c.CoreID] = struct{}{}
threadsByCoreBySocket[c.PhysicalID][c.PhysicalID+":"+c.CoreID]++
} else {
coreBySocket[c.PhysicalID][c.PhysicalID+":"+c.ID] = struct{}{}
threadsByCoreBySocket[c.PhysicalID][c.PhysicalID+":"+c.ID]++
}
}
// Tally up the values from the tracking maps
for id, s := range socketByID {
s.CoreCount = len(coreBySocket[id])
s.ThreadCount = 0
// This only works if HT is enabled, consider a more reliable model, maybe cache size comparisons?
efficiencyCoreCount := 0
for _, threads := range threadsByCoreBySocket[id] {
s.ThreadCount += threads
if threads == 1 {
efficiencyCoreCount++
}
}
if efficiencyCoreCount == s.CoreCount {
// 1:1 mapping means they're not actually efficiency cores, but regular cores
s.EfficiencyCoreCount = 0
} else {
s.EfficiencyCoreCount = efficiencyCoreCount
}
}
keys := make([]string, 0, len(socketByID))
result := make([]CPU, 0, len(socketByID))
for k := range socketByID {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
result = append(result, *socketByID[k])
}
return result
}
func IsNUMA() bool {
ids := map[string]any{}
packageIds, _ := filepath.Glob("/sys/devices/system/cpu/cpu*/topology/physical_package_id")

File diff suppressed because it is too large Load diff

View file

@ -2,11 +2,8 @@ package discover
import (
"fmt"
"log/slog"
"syscall"
"unsafe"
"github.com/ollama/ollama/logutil"
)
type MEMORYSTATUSEX struct {
@ -22,10 +19,9 @@ type MEMORYSTATUSEX struct {
}
var (
k32 = syscall.NewLazyDLL("kernel32.dll")
globalMemoryStatusExProc = k32.NewProc("GlobalMemoryStatusEx")
sizeofMemoryStatusEx = uint32(unsafe.Sizeof(MEMORYSTATUSEX{}))
GetLogicalProcessorInformationEx = k32.NewProc("GetLogicalProcessorInformationEx")
k32 = syscall.NewLazyDLL("kernel32.dll")
globalMemoryStatusExProc = k32.NewProc("GlobalMemoryStatusEx")
sizeofMemoryStatusEx = uint32(unsafe.Sizeof(MEMORYSTATUSEX{}))
)
func GetCPUMem() (memInfo, error) {
@ -37,184 +33,6 @@ func GetCPUMem() (memInfo, error) {
return memInfo{TotalMemory: memStatus.TotalPhys, FreeMemory: memStatus.AvailPhys, FreeSwap: memStatus.AvailPageFile}, nil
}
type LOGICAL_PROCESSOR_RELATIONSHIP uint32
const (
RelationProcessorCore LOGICAL_PROCESSOR_RELATIONSHIP = iota
RelationNumaNode
RelationCache
RelationProcessorPackage
RelationGroup
RelationProcessorDie
RelationNumaNodeEx
RelationProcessorModule
)
const RelationAll LOGICAL_PROCESSOR_RELATIONSHIP = 0xffff
type GROUP_AFFINITY struct {
Mask uintptr // KAFFINITY
Group uint16
Reserved [3]uint16
}
type PROCESSOR_RELATIONSHIP struct {
Flags byte
EfficiencyClass byte
Reserved [20]byte
GroupCount uint16
GroupMask [1]GROUP_AFFINITY // len GroupCount
}
// Omitted unused structs: NUMA_NODE_RELATIONSHIP CACHE_RELATIONSHIP GROUP_RELATIONSHIP
type SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX struct {
Relationship LOGICAL_PROCESSOR_RELATIONSHIP
Size uint32
U [1]byte // Union len Size
// PROCESSOR_RELATIONSHIP
// NUMA_NODE_RELATIONSHIP
// CACHE_RELATIONSHIP
// GROUP_RELATIONSHIP
}
func (group *GROUP_AFFINITY) IsMember(target *GROUP_AFFINITY) bool {
if group == nil || target == nil {
return false
}
return group.Mask&target.Mask != 0
}
type winPackage struct {
groups []*GROUP_AFFINITY
coreCount int // performance cores = coreCount - efficiencyCoreCount
efficiencyCoreCount int
threadCount int
}
func (pkg *winPackage) IsMember(target *GROUP_AFFINITY) bool {
for _, group := range pkg.groups {
if group.IsMember(target) {
return true
}
}
return false
}
func getLogicalProcessorInformationEx() ([]byte, error) {
buf := make([]byte, 1)
bufSize := len(buf)
ret, _, err := GetLogicalProcessorInformationEx.Call(
uintptr(RelationAll),
uintptr(unsafe.Pointer(&buf[0])),
uintptr(unsafe.Pointer(&bufSize)),
)
if ret != 0 {
logutil.Trace("failed to retrieve CPU payload size", "ret", ret, "size", bufSize, "error", err)
return nil, fmt.Errorf("failed to determine size info ret:%d %w", ret, err)
}
buf = make([]byte, bufSize)
ret, _, err = GetLogicalProcessorInformationEx.Call(
uintptr(RelationAll),
uintptr(unsafe.Pointer(&buf[0])),
uintptr(unsafe.Pointer(&bufSize)),
)
if ret == 0 {
logutil.Trace("failed to retrieve CPU information", "ret", ret, "size", len(buf), "new_size", bufSize, "error", err)
return nil, fmt.Errorf("failed to gather processor information ret:%d buflen:%d %w", ret, bufSize, err)
}
return buf, nil
}
func processSystemLogicalProcessorInforationList(buf []byte) []*winPackage {
var slpi *SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX
// Find all the packages first
packages := []*winPackage{}
for bufOffset := 0; bufOffset < len(buf); bufOffset += int(slpi.Size) {
slpi = (*SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(unsafe.Pointer(&buf[bufOffset]))
if slpi.Relationship != RelationProcessorPackage {
continue
}
pr := (*PROCESSOR_RELATIONSHIP)(unsafe.Pointer(&slpi.U[0]))
pkg := &winPackage{}
ga0 := unsafe.Pointer(&pr.GroupMask[0])
for j := range pr.GroupCount {
gm := (*GROUP_AFFINITY)(unsafe.Pointer(uintptr(ga0) + uintptr(j)*unsafe.Sizeof(GROUP_AFFINITY{})))
pkg.groups = append(pkg.groups, gm)
}
packages = append(packages, pkg)
}
slog.Info("packages", "count", len(packages))
// To identify efficiency cores we have to compare the relative values
// Larger values are "less efficient" (aka, more performant)
var maxEfficiencyClass byte
for bufOffset := 0; bufOffset < len(buf); bufOffset += int(slpi.Size) {
slpi = (*SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(unsafe.Pointer(&buf[bufOffset]))
if slpi.Relationship != RelationProcessorCore {
continue
}
pr := (*PROCESSOR_RELATIONSHIP)(unsafe.Pointer(&slpi.U[0]))
if pr.EfficiencyClass > maxEfficiencyClass {
maxEfficiencyClass = pr.EfficiencyClass
}
}
if maxEfficiencyClass > 0 {
slog.Info("efficiency cores detected", "maxEfficiencyClass", maxEfficiencyClass)
}
// then match up the Cores to the Packages, count up cores, threads and efficiency cores
for bufOffset := 0; bufOffset < len(buf); bufOffset += int(slpi.Size) {
slpi = (*SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(unsafe.Pointer(&buf[bufOffset]))
if slpi.Relationship != RelationProcessorCore {
continue
}
pr := (*PROCESSOR_RELATIONSHIP)(unsafe.Pointer(&slpi.U[0]))
ga0 := unsafe.Pointer(&pr.GroupMask[0])
for j := range pr.GroupCount {
gm := (*GROUP_AFFINITY)(unsafe.Pointer(uintptr(ga0) + uintptr(j)*unsafe.Sizeof(GROUP_AFFINITY{})))
for _, pkg := range packages {
if pkg.IsMember(gm) {
pkg.coreCount++
if pr.Flags == 0 {
pkg.threadCount++
} else {
pkg.threadCount += 2
}
if pr.EfficiencyClass < maxEfficiencyClass {
pkg.efficiencyCoreCount++
}
}
}
}
}
// Summarize the results
for i, pkg := range packages {
slog.Info("", "package", i, "cores", pkg.coreCount, "efficiency", pkg.efficiencyCoreCount, "threads", pkg.threadCount)
}
return packages
}
func GetCPUDetails() []CPU {
buf, err := getLogicalProcessorInformationEx()
if err != nil {
slog.Warn("failed to get CPU details", "error", err)
return nil
}
packages := processSystemLogicalProcessorInforationList(buf)
cpus := make([]CPU, len(packages))
for i, pkg := range packages {
cpus[i].CoreCount = pkg.coreCount
cpus[i].EfficiencyCoreCount = pkg.efficiencyCoreCount
cpus[i].ThreadCount = pkg.threadCount
}
return cpus
}
func IsNUMA() bool {
// numa support in ggml is linux only
return false

File diff suppressed because one or more lines are too long

54
discover/cuda_compat.go Normal file
View file

@ -0,0 +1,54 @@
package discover
import (
"context"
"log/slog"
"github.com/ollama/ollama/ml"
)
func filterOldCUDADriver(_ context.Context, devices []ml.DeviceInfo) []ml.DeviceInfo {
oldCUDA := func(dev ml.DeviceInfo) bool {
return dev.Library == "CUDA" && dev.ComputeMajor > 0 && dev.ComputeMajor < 7
}
needsCheck := false
for _, dev := range devices {
if oldCUDA(dev) {
needsCheck = true
break
}
}
if !needsCheck {
return devices
}
driver := nvidiaDriverMajorFromDevices(devices)
if driver == 0 {
slog.Warn("could not verify NVIDIA driver compatibility for an older NVIDIA GPU")
return devices
}
if driver >= 570 {
return devices
}
filtered := devices[:0]
for _, dev := range devices {
if oldCUDA(dev) {
slog.Warn("NVIDIA driver too old",
"device", dev.Description, "compute", dev.Compute(), "driver", driver, "required_driver", "570 or newer")
continue
}
filtered = append(filtered, dev)
}
return filtered
}
func nvidiaDriverMajorFromDevices(devices []ml.DeviceInfo) int {
for _, dev := range devices {
if dev.Library == "CUDA" && dev.NVIDIADriverMajor > 0 {
return dev.NVIDIADriverMajor
}
}
return 0
}

View file

@ -17,31 +17,20 @@ import (
// Included to drive logic for reducing Ollama-allocated overhead on L4T/Jetson devices.
var CudaTegra string = os.Getenv("JETSON_JETPACK")
// GetSystemInfo returns the last cached state of the GPUs on the system
// GetSystemInfo returns host memory information used by scheduling.
func GetSystemInfo() ml.SystemInfo {
logutil.Trace("performing CPU discovery")
logutil.Trace("performing system memory discovery")
startDiscovery := time.Now()
defer func() {
logutil.Trace("CPU discovery completed", "duration", time.Since(startDiscovery))
logutil.Trace("system memory discovery completed", "duration", time.Since(startDiscovery))
}()
memInfo, err := GetCPUMem()
if err != nil {
slog.Warn("error looking up system memory", "error", err)
}
var threadCount int
cpus := GetCPUDetails()
for _, c := range cpus {
threadCount += c.CoreCount - c.EfficiencyCoreCount
}
if threadCount == 0 {
// Fall back to Go's num CPU
threadCount = runtime.NumCPU()
}
return ml.SystemInfo{
ThreadCount: threadCount,
TotalMemory: memInfo.TotalMemory,
FreeMemory: memInfo.FreeMemory,
FreeSwap: memInfo.FreeSwap,

View file

@ -8,9 +8,6 @@ package discover
import "C"
import (
"log/slog"
"syscall"
"github.com/ollama/ollama/format"
)
@ -26,28 +23,6 @@ func GetCPUMem() (memInfo, error) {
}, nil
}
func GetCPUDetails() []CPU {
query := "hw.perflevel0.physicalcpu"
perfCores, err := syscall.SysctlUint32(query)
if err != nil {
slog.Warn("failed to discover physical CPU details", "query", query, "error", err)
}
query = "hw.perflevel1.physicalcpu"
efficiencyCores, _ := syscall.SysctlUint32(query) // On x86 xeon this wont return data
// Determine thread count
query = "hw.logicalcpu"
logicalCores, _ := syscall.SysctlUint32(query)
return []CPU{
{
CoreCount: int(perfCores + efficiencyCores),
EfficiencyCoreCount: int(efficiencyCores),
ThreadCount: int(logicalCores),
},
}
}
func IsNUMA() bool {
// numa support in ggml is linux only
return false

View file

@ -27,9 +27,17 @@ uint64_t getFreeMemory() {
return 0;
}
uint64_t free_memory = (uint64_t)vm_stat.free_count * pagesize;
free_memory += (uint64_t)vm_stat.speculative_count * pagesize;
free_memory += (uint64_t)vm_stat.inactive_count * pagesize;
uint64_t used = (uint64_t)vm_stat.active_count * pagesize
+ (uint64_t)vm_stat.inactive_count * pagesize
+ (uint64_t)vm_stat.speculative_count * pagesize
+ (uint64_t)vm_stat.wire_count * pagesize
+ (uint64_t)vm_stat.compressor_page_count * pagesize
- (uint64_t)vm_stat.purgeable_count * pagesize
- (uint64_t)vm_stat.external_page_count * pagesize;
return free_memory;
uint64_t total_memory = [NSProcessInfo processInfo].physicalMemory;
if (used >= total_memory) {
return 0;
}
return total_memory - used;
}

513
discover/llama_server.go Normal file
View file

@ -0,0 +1,513 @@
package discover
import (
"bufio"
"context"
"fmt"
"io"
"log/slog"
"os"
"os/exec"
"path/filepath"
"regexp"
"runtime"
"strconv"
"strings"
"time"
"github.com/ollama/ollama/llm"
"github.com/ollama/ollama/logutil"
"github.com/ollama/ollama/ml"
)
// llamaServerDiscoveryWaitDelay bounds how long Wait can hang after we stop
// the short-lived discovery subprocess.
const llamaServerDiscoveryWaitDelay = 5 * time.Second
// llamaServerDiscoverDevices spawns llama-server briefly (without a model) to
// discover GPU devices and their capabilities. The server prints device info
// and system_info (including compiled CUDA architectures) on startup before
// any model load, then we kill it.
//
// Captured from combined stderr output:
//
// Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
// Device 0: AMD Radeon RX 6700 XT, gfx1031 (0x1031), VMM: no, Wave Size: 32, VRAM: 12272 MiB
//
// Captured from stdout device list:
//
// CUDA0: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
// Metal: Apple M3 Max (98304 MiB, 98303 MiB free)
func llamaServerDiscoverDevices(ctx context.Context, libDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
status := llm.NewStatusWriter(llamaServerDiscoveryOutput(ctx))
llamaServer, err := llm.FindLlamaServer()
if err != nil {
slog.Debug("llama-server not available for device discovery", "error", err)
return nil, status, err
}
start := time.Now()
defer func() {
slog.Debug("llama-server device discovery took", "duration", time.Since(start), "libDirs", libDirs)
}()
// Use a random port to avoid conflicts. The server may start listening
// before it emits system_info, but we stop it as soon as we have the GPU
// discovery output we need.
port := 49152 + time.Now().UnixNano()%16383
cmd := exec.CommandContext(ctx, llamaServer,
"--port", strconv.FormatInt(port, 10),
"--host", "127.0.0.1",
"--no-webui",
"--offline",
"--verbose",
)
cmd.WaitDelay = llamaServerDiscoveryWaitDelay
cmd.Env = os.Environ()
llm.SetupLlamaServerCommandEnv(cmd, llamaServer, libDirs, extraEnvs)
logutil.Trace("running llama-server for discovery", "cmd", cmd.Path, "libDirs", libDirs)
// Capture stderr (device info + system_info) via pipe so we can
// read it line-by-line and kill the server as soon as we have what we need.
stderrPipe, err := cmd.StderrPipe()
if err != nil {
slog.Debug("llama-server discovery: failed to create stderr pipe", "error", err)
return nil, status, err
}
// Forward stdout through the same status writer so trace logging captures
// all llama-server discovery output.
cmd.Stdout = status
if err := cmd.Start(); err != nil {
slog.Debug("llama-server discovery: failed to start", "error", err)
return nil, status, err
}
// Read stderr until we see system_info or timeout
var stderrLines []string
gotSystemInfo := false
done := make(chan struct{})
go func() {
scanner := bufio.NewScanner(stderrPipe)
for scanner.Scan() {
line := scanner.Text()
_, _ = status.Write([]byte(line + "\n"))
stderrLines = append(stderrLines, line)
if strings.Contains(line, "system_info:") {
gotSystemInfo = true
break
}
}
close(done)
}()
select {
case <-done:
case <-ctx.Done():
}
// Kill the server - we have what we need, or timed out.
stoppedForDiscovery := false
if cmd.Process != nil {
stoppedForDiscovery = cmd.Process.Kill() == nil
}
waitErr := cmd.Wait()
if waitErr != nil {
exit := llm.ExitStatusFromError(waitErr)
if stoppedForDiscovery {
slog.Debug("llama-server discovery: stopped subprocess after collecting GPU info", "exit", exit, "libDirs", libDirs)
}
if !stoppedForDiscovery {
slog.Debug("llama-server discovery: server startup exited", "error", waitErr, "exit", exit, "libDirs", libDirs)
}
}
<-done
if ctx.Err() != nil {
slog.Warn("llama-server discovery: timed out waiting for server startup", "error", ctx.Err(), "libDirs", libDirs, "lines_captured", len(stderrLines))
return nil, status, ctx.Err()
}
if !gotSystemInfo {
slog.Warn("llama-server discovery: system_info line not found in output - "+
"CUDA architecture filtering will be disabled. If GPU inference fails, "+
"this may indicate an incompatible llama-server version.",
"libDirs", libDirs, "lines_captured", len(stderrLines))
}
// Also run --list-devices to get the stdout device list with free memory
// (the brief server startup doesn't print that)
cmd2 := exec.CommandContext(ctx, llamaServer, "--list-devices", "--offline", "--verbose")
cmd2.WaitDelay = llamaServerDiscoveryWaitDelay
cmd2.Env = cmd.Env // reuse same environment
listOutput, err := cmd2.CombinedOutput()
_, _ = status.Write(listOutput)
if err != nil {
exit := llm.ExitStatusFromError(err)
slog.Debug("llama-server --list-devices failed", "error", err, "exit", exit)
if exit.Known() {
return nil, status, fmt.Errorf("llama-server --list-devices failed: %s", exit)
}
return nil, status, fmt.Errorf("llama-server --list-devices failed: %w", err)
}
nativeDevices, nativeStderr, nativeErr := discoverNativeDevices(ctx, llamaServer, libDirs, extraEnvs)
_, _ = status.Write([]byte(nativeStderr))
if nativeErr != nil {
logNativeProbeFailure(nativeErr, nativeStderr, libDirs)
}
combined := string(listOutput) + "\n" + strings.Join(stderrLines, "\n") + "\n" + nativeStderr
return parseLlamaServerDevicesWithNative(combined, libDirs, nativeDevices), status, nil
}
func llamaServerDiscoveryOutput(ctx context.Context) io.Writer {
if slog.Default().Enabled(ctx, logutil.LevelTrace) {
return os.Stderr
}
return io.Discard
}
// deviceLineRegex matches stdout lines like:
//
// CUDA0: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
// Metal: Apple M3 Max (98304 MiB, 98303 MiB free)
var deviceLineRegex = regexp.MustCompile(
`^\s+(.+?):\s+(.+?)\s+\((\d+)\s+MiB,\s+(\d+)\s+MiB\s+free\)`,
)
// cudaCCRegex matches CUDA stderr lines like:
//
// Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
var cudaCCRegex = regexp.MustCompile(
`Device\s+(\d+):.*compute capability\s+(\d+)\.(\d+)`,
)
// cudaArchsRegex matches the CUDA architecture list from system_info like:
//
// CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210
var cudaArchsRegex = regexp.MustCompile(
`CUDA\s*:\s*ARCHS\s*=\s*([\d,]+)`,
)
var (
cudaRuntimeSORegex = regexp.MustCompile(`^libcudart\.so\.(\d+)(?:\.(\d+))?`)
cudaRuntimeDLLRegex = regexp.MustCompile(`^cudart64_(\d{2})(\d)\.dll$`)
cudaRuntimeDirRegex = regexp.MustCompile(`^cuda_v(\d+)$`)
)
// parseLlamaServerDevices parses the combined output of llama-server discovery.
// It extracts device info, ROCm gfx targets, CUDA compute capabilities, and
// CUDA compiled architecture lists.
func parseLlamaServerDevices(output string, libDirs []string) []ml.DeviceInfo {
return parseLlamaServerDevicesWithNative(output, libDirs, nil)
}
func parseLlamaServerDevicesWithNative(output string, libDirs []string, nativeDevices []nativeProbeDevice) []ml.DeviceInfo {
// Extract per-device metadata from stderr
gfxByIndex := parseROCmGFXTargets(output)
rocmGFXOverride := hsaOverrideGFXTarget()
integratedByIndex := parseVulkanUMA(output)
ccByIndex := make(map[int]cudaComputeCapability)
var cudaArchs []string // compiled architectures for this variant
nativeByIndex := nativeProbeByLibraryIndex(nativeDevices)
for idx, dev := range nativeByIndex["ROCm"] {
if rocmGFXOverride != "" {
gfxByIndex[idx] = rocmGFXOverride
} else if dev.GFXTarget != "" {
gfxByIndex[idx] = dev.GFXTarget
}
}
scanner := bufio.NewScanner(strings.NewReader(output))
for scanner.Scan() {
line := scanner.Text()
if matches := cudaCCRegex.FindStringSubmatch(line); matches != nil {
idx, _ := strconv.Atoi(matches[1])
major, _ := strconv.Atoi(matches[2])
minor, _ := strconv.Atoi(matches[3])
ccByIndex[idx] = cudaComputeCapability{
major: major,
minor: minor,
arch: fmt.Sprintf("%d%d0", major, minor),
}
}
if matches := cudaArchsRegex.FindStringSubmatch(line); matches != nil {
cudaArchs = strings.Split(matches[1], ",")
}
}
if cudaDevices := nativeByIndex["CUDA"]; len(cudaDevices) > 0 {
for idx, dev := range cudaDevices {
if dev.ComputeMajor <= 0 {
continue
}
ccByIndex[idx] = cudaComputeCapability{
major: dev.ComputeMajor,
minor: dev.ComputeMinor,
arch: fmt.Sprintf("%d%d0", dev.ComputeMajor, dev.ComputeMinor),
}
}
}
// Validate CUDA devices against compiled architectures
cudaArchSet := make(map[string]bool, len(cudaArchs))
for _, arch := range cudaArchs {
cudaArchSet[strings.TrimSpace(arch)] = true
}
cudaRuntimeMajor, cudaRuntimeMinor, hasCUDARuntime := cudaRuntimeVersion(libDirs)
// Parse stdout device lines
var devices []ml.DeviceInfo
deviceIndex := 0
scanner = bufio.NewScanner(strings.NewReader(output))
for scanner.Scan() {
matches := deviceLineRegex.FindStringSubmatch(scanner.Text())
if matches == nil {
continue
}
name := matches[1]
description := matches[2]
totalMiB, _ := strconv.ParseUint(matches[3], 10, 64)
freeMiB, _ := strconv.ParseUint(matches[4], 10, 64)
library := inferLibrary(name, description)
// Skip pseudo-devices like BLAS/Accelerate that report zero memory.
// These are CPU math libraries, not real GPUs — they shouldn't appear
// as inference compute devices or inflate the scheduler's GPU count.
if totalMiB == 0 {
slog.Debug("skipping pseudo-device with zero memory", "name", name, "description", description)
deviceIndex++
continue
}
// For CUDA devices, check if this variant supports the device's CC
if library == "CUDA" {
cc, ok := ccByIndex[deviceIndex]
if ok && len(cudaArchSet) > 0 {
if !cudaArchSet[cc.arch] {
slog.Info("skipping CUDA device — compute capability not in compiled architectures",
"device", description, "cc", cc.arch, "archs", cudaArchs,
"libDirs", libDirs)
deviceIndex++
continue
}
} else if !ok {
slog.Warn("llama-server discovery: could not determine compute capability for CUDA device — "+
"architecture filtering disabled for this device. If inference crashes, "+
"check that the CUDA backend supports this GPU.",
"device", description, "libDirs", libDirs)
} else if len(cudaArchSet) == 0 {
slog.Warn("llama-server discovery: could not determine compiled CUDA architectures — "+
"architecture filtering disabled. If inference crashes on older GPUs, "+
"check llama-server system_info output for ARCHS.",
"device", description, "libDirs", libDirs)
}
}
nativeDevice, hasNativeDevice := nativeByIndex[library][deviceIndex]
totalBytes := totalMiB * 1024 * 1024
if hasNativeDevice && !nativeProbeMatchesLlamaServerDevice(library, description, totalBytes, nativeDevice) {
hasNativeDevice = false
}
computeMajor, computeMinor := computeVersion(library, deviceIndex, gfxByIndex, ccByIndex)
dev := ml.DeviceInfo{
DeviceID: ml.DeviceID{
ID: strconv.Itoa(deviceIndex),
Library: library,
},
Name: name,
Description: description,
TotalMemory: totalBytes,
FreeMemory: freeMiB * 1024 * 1024,
ComputeMajor: computeMajor,
ComputeMinor: computeMinor,
LibraryPath: libDirs,
GFXTarget: gfxByIndex[deviceIndex],
Integrated: isIntegratedLlamaServerDevice(library, deviceIndex, integratedByIndex),
}
if hasNativeDevice {
if nativeDevice.DeviceID != "" {
dev.PCIID = nativeDevice.DeviceID
}
if nativeDevice.IntegratedKnown {
dev.Integrated = nativeDevice.Integrated
} else {
dev.Integrated = dev.Integrated || nativeDevice.Integrated
}
if dev.ComputeMajor == 0 && nativeDevice.ComputeMajor > 0 {
dev.ComputeMajor = nativeDevice.ComputeMajor
dev.ComputeMinor = nativeDevice.ComputeMinor
}
if nativeDevice.CUDADriverMajor > 0 {
dev.DriverMajor = nativeDevice.CUDADriverMajor
dev.DriverMinor = nativeDevice.CUDADriverMinor
}
if nativeDevice.NVIDIADriverMajor > 0 {
dev.NVIDIADriverMajor = nativeDevice.NVIDIADriverMajor
}
setROCmGFXTarget(&dev, nativeDevice.GFXTarget)
}
setROCmGFXTarget(&dev, rocmGFXOverride)
if library == "CUDA" && dev.DriverMajor == 0 && hasCUDARuntime {
dev.DriverMajor = cudaRuntimeMajor
dev.DriverMinor = cudaRuntimeMinor
}
devices = append(devices, dev)
deviceIndex++
}
return refineLlamaServerDevices(devices, libDirs)
}
func nativeProbeMatchesLlamaServerDevice(library, description string, totalBytes uint64, nativeDevice nativeProbeDevice) bool {
if library != "Vulkan" {
return true
}
nativeDescription := nativeDevice.Description
if nativeDescription == "" {
nativeDescription = nativeDevice.Name
}
if nativeDescription == "" || !ml.SimilarDeviceDescription(description, nativeDescription) {
slog.Debug("skipping Vulkan native metadata with mismatched device name",
"llama_server_name", description,
"native_name", nativeDescription)
return false
}
if nativeDevice.TotalMemory != 0 && !ml.SimilarDeviceMemory(totalBytes, nativeDevice.TotalMemory) {
slog.Debug("skipping Vulkan native metadata with mismatched memory",
"llama_server_name", description,
"llama_server_total", totalBytes,
"native_total", nativeDevice.TotalMemory)
return false
}
return true
}
func cudaRuntimeVersion(libDirs []string) (int, int, bool) {
bestMajor, bestMinor := -1, -1
update := func(major, minor int) {
if major > bestMajor || (major == bestMajor && minor > bestMinor) {
bestMajor, bestMinor = major, minor
}
}
for _, dir := range libDirs {
for _, entry := range readDirNames(dir) {
if matches := cudaRuntimeSORegex.FindStringSubmatch(entry); matches != nil {
major, _ := strconv.Atoi(matches[1])
minor := 0
if matches[2] != "" {
minor, _ = strconv.Atoi(matches[2])
}
update(major, minor)
}
if matches := cudaRuntimeDLLRegex.FindStringSubmatch(entry); matches != nil {
major, _ := strconv.Atoi(matches[1])
minor, _ := strconv.Atoi(matches[2])
update(major, minor)
}
}
if matches := cudaRuntimeDirRegex.FindStringSubmatch(filepath.Base(dir)); matches != nil {
major, _ := strconv.Atoi(matches[1])
update(major, 0)
}
}
if bestMajor < 0 {
return 0, 0, false
}
return bestMajor, bestMinor, true
}
func readDirNames(dir string) []string {
entries, err := os.ReadDir(dir)
if err != nil {
return nil
}
names := make([]string, 0, len(entries))
for _, entry := range entries {
names = append(names, entry.Name())
}
return names
}
type cudaComputeCapability struct {
major int
minor int
arch string
}
func computeVersion(library string, deviceIndex int, gfxByIndex map[int]string, ccByIndex map[int]cudaComputeCapability) (int, int) {
switch library {
case "CUDA":
if cc, ok := ccByIndex[deviceIndex]; ok {
return cc.major, cc.minor
}
case "ROCm":
return parseGFXTarget(gfxByIndex[deviceIndex])
}
return 0, 0
}
// inferLibrary determines the GPU library type from the llama-server device name and description.
func inferLibrary(name, description string) string {
combined := strings.ToLower(name + " " + description)
switch {
case strings.Contains(combined, "cuda"):
return "CUDA"
case strings.Contains(combined, "rocm") || strings.Contains(combined, "hip"):
return "ROCm"
case strings.Contains(combined, "metal") || strings.Contains(combined, "apple"):
return "Metal"
case strings.Contains(combined, "vulkan"):
return "Vulkan"
default:
return description
}
}
func isIntegratedLlamaServerDevice(library string, deviceIndex int, integratedByIndex map[int]bool) bool {
if library == "Vulkan" && integratedByIndex[deviceIndex] {
return true
}
// llama-server discovery does not expose a stable backend device-type field,
// so we only infer "integrated" here for cases where the contract is strong:
// explicit Vulkan UMA metadata, or the single Apple Silicon Metal device.
//
// Other backends stay unclassified unless discovery provides a stronger
// signal. That keeps scheduling conservative instead of guessing from
// device names or backend-specific heuristics.
return library == "Metal" && runtime.GOOS == "darwin" && runtime.GOARCH == "arm64"
}
func llamaServerBootstrapDevicesWithStatus(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
devices, status, err := llamaServerDiscoverDevices(ctx, ollamaLibDirs, extraEnvs)
if err != nil {
return devices, status, err
}
hasROCm := false
for _, d := range devices {
if d.Library == "ROCm" {
hasROCm = true
break
}
}
if !hasROCm {
return devices, status, nil
}
return filterUnsupportedROCmDevices(devices, ollamaLibDirs), status, nil
}
// Ensure stderrPipe is fully consumed to avoid blocking
var _ io.Reader

View file

@ -0,0 +1,515 @@
package discover
import (
"io"
"log/slog"
"os"
"path/filepath"
"testing"
"github.com/ollama/ollama/logutil"
"github.com/ollama/ollama/ml"
)
func TestLlamaServerDiscovery(t *testing.T) {
originalProbe := probeLlamaServerVulkanDevices
probeLlamaServerVulkanDevices = func(_ []string) ([]vulkanPhysicalDevice, error) {
return nil, errWindowsVulkanProbeUnsupported
}
t.Cleanup(func() {
probeLlamaServerVulkanDevices = originalProbe
})
t.Run("output only trace", func(t *testing.T) {
original := slog.Default()
t.Cleanup(func() {
slog.SetDefault(original)
})
slog.SetDefault(logutil.NewLogger(io.Discard, slog.LevelDebug))
if got := llamaServerDiscoveryOutput(t.Context()); got != io.Discard {
t.Fatal("debug logging should discard raw llama-server discovery output")
}
slog.SetDefault(logutil.NewLogger(io.Discard, logutil.LevelTrace))
if got := llamaServerDiscoveryOutput(t.Context()); got == io.Discard {
t.Fatal("trace logging should emit raw llama-server discovery output")
}
})
t.Run("parse devices", func(t *testing.T) {
type wantDevice struct {
name string
library string
totalMiB uint64
compute string
driver string
gfxTarget string
checkIntegrated bool
integrated bool
}
tests := []struct {
name string
output string
libDirs []string
want []wantDevice
}{
{
name: "NVIDIA CUDA",
output: `load_backend: loaded CUDA backend from /lib/ollama/cuda_v12/libggml-cuda.so
Available devices:
NVIDIA GeForce RTX 4090: NVIDIA CUDA (24564 MiB, 23592 MiB free)
`,
libDirs: []string{"/lib/ollama", "/lib/ollama/cuda_v12"},
want: []wantDevice{{
name: "NVIDIA GeForce RTX 4090",
library: "CUDA",
totalMiB: 24564,
driver: "12.0",
}},
},
{
name: "Metal",
output: `Available devices:
Metal: Apple M3 Max (98304 MiB, 98303 MiB free)
`,
want: []wantDevice{{
name: "Metal",
library: "Metal",
totalMiB: 98304,
}},
},
{
name: "ROCm with gfx target",
output: ` Device 0: AMD Radeon RX 6700 XT, gfx1031 (0x1031), VMM: no, Wave Size: 32, VRAM: 12272 MiB
Available devices:
ROCm0: AMD Radeon RX 6700 XT (12272 MiB, 12248 MiB free)
`,
libDirs: []string{"/lib/ollama", "/lib/ollama/rocm_v7_2"},
want: []wantDevice{{
name: "ROCm0",
library: "ROCm",
totalMiB: 12272,
compute: "gfx1031",
gfxTarget: "gfx1031",
}},
},
{
name: "multi GPU",
output: `Available devices:
CUDA0: NVIDIA GeForce RTX 4090 (24564 MiB, 23592 MiB free)
CUDA1: NVIDIA GeForce RTX 3060 (12288 MiB, 11500 MiB free)
`,
libDirs: []string{"/lib/ollama", "/lib/ollama/cuda_v12"},
want: []wantDevice{
{name: "CUDA0", library: "CUDA", totalMiB: 24564},
{name: "CUDA1", library: "CUDA", totalMiB: 12288},
},
},
{
name: "Vulkan UMA",
output: `ggml_vulkan: 0 = Intel(R) Graphics (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Available devices:
Vulkan0: Intel(R) Graphics (16384 MiB, 12288 MiB free)
`,
libDirs: []string{"/lib/ollama", "/lib/ollama/vulkan"},
want: []wantDevice{{
name: "Vulkan0",
library: "Vulkan",
totalMiB: 16384,
checkIntegrated: true,
integrated: true,
}},
},
{
name: "Vulkan without UMA metadata",
output: `Available devices:
Vulkan0: AMD Radeon(TM) Graphics (32768 MiB, 31000 MiB free)
`,
libDirs: []string{"/lib/ollama", "/lib/ollama/vulkan"},
want: []wantDevice{{
name: "Vulkan0",
library: "Vulkan",
totalMiB: 32768,
checkIntegrated: true,
}},
},
{
name: "CUDA device filtered by compiled archs",
output: `ggml_cuda_init: found 1 CUDA devices (Total VRAM: 6063 MiB):
Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
load_backend: loaded CUDA backend from /lib/ollama/cuda_v13/libggml-cuda.so
system_info: n_threads = 4 | CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210 |
Available devices:
CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
`,
libDirs: []string{"/lib/ollama", "/lib/ollama/cuda_v13"},
},
{
name: "CUDA device kept by compiled archs",
output: `ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16379 MiB):
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
system_info: n_threads = 16 | CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210 |
Available devices:
CUDA0: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
`,
want: []wantDevice{{
name: "CUDA0",
library: "CUDA",
totalMiB: 16379,
compute: "8.9",
}},
},
{
name: "CUDA without compiled archs fails open",
output: `ggml_cuda_init: found 1 CUDA devices (Total VRAM: 6063 MiB):
Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
Available devices:
CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
`,
want: []wantDevice{{
name: "CUDA0",
library: "CUDA",
totalMiB: 6063,
compute: "6.1",
}},
},
{
name: "CUDA without compute capability fails open",
output: `system_info: n_threads = 4 | CUDA : ARCHS = 750,800 |
Available devices:
CUDA0: Some Future GPU (8192 MiB, 8000 MiB free)
`,
want: []wantDevice{{
name: "CUDA0",
library: "CUDA",
totalMiB: 8192,
}},
},
{
name: "CUDA mixed arch support",
output: `ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
system_info: n_threads = 8 | CUDA : ARCHS = 750,800,860,890 |
Available devices:
CUDA0: NVIDIA GeForce GTX 1060 (6063 MiB, 5900 MiB free)
CUDA1: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
`,
want: []wantDevice{{
name: "CUDA1",
library: "CUDA",
totalMiB: 16379,
compute: "8.9",
}},
},
{
name: "ROCm gfx target with xnack suffix",
output: `ggml_cuda_init: found 2 ROCm devices (Total VRAM: 32736 MiB):
Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 16368 MiB
Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
Available devices:
ROCm0: AMD Radeon RX 6800 (16368 MiB, 16342 MiB free)
ROCm1: AMD Radeon Pro VII (16368 MiB, 16348 MiB free)
`,
want: []wantDevice{
{name: "ROCm0", library: "ROCm", totalMiB: 16368, compute: "gfx1030", gfxTarget: "gfx1030"},
{name: "ROCm1", library: "ROCm", totalMiB: 16368, compute: "gfx906", gfxTarget: "gfx906"},
},
},
{
name: "unknown library",
output: `Available devices:
Future0: Mystery Accelerator (8192 MiB, 8000 MiB free)
`,
want: []wantDevice{{
name: "Future0",
library: "Mystery Accelerator",
totalMiB: 8192,
}},
},
{
name: "no devices",
output: "Available devices:\n",
},
{
name: "empty output",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if tt.libDirs == nil {
tt.libDirs = []string{"/lib/ollama"}
}
devices := parseLlamaServerDevices(tt.output, tt.libDirs)
if len(devices) != len(tt.want) {
t.Fatalf("got %d devices, want %d", len(devices), len(tt.want))
}
for i, want := range tt.want {
got := devices[i]
if want.name != "" && got.Name != want.name {
t.Errorf("device %d name = %q, want %q", i, got.Name, want.name)
}
if want.library != "" && got.Library != want.library {
t.Errorf("device %d library = %q, want %q", i, got.Library, want.library)
}
if want.totalMiB > 0 && got.TotalMemory != want.totalMiB*1024*1024 {
t.Errorf("device %d total memory = %d, want %d MiB", i, got.TotalMemory, want.totalMiB)
}
if want.compute != "" && got.Compute() != want.compute {
t.Errorf("device %d compute = %q, want %q", i, got.Compute(), want.compute)
}
if want.driver != "" && got.Driver() != want.driver {
t.Errorf("device %d driver = %q, want %q", i, got.Driver(), want.driver)
}
if want.gfxTarget != "" && got.GFXTarget != want.gfxTarget {
t.Errorf("device %d gfx target = %q, want %q", i, got.GFXTarget, want.gfxTarget)
}
if want.checkIntegrated && got.Integrated != want.integrated {
t.Errorf("device %d integrated = %v, want %v", i, got.Integrated, want.integrated)
}
}
})
}
})
t.Run("parse fixtures", func(t *testing.T) {
type wantDevice struct {
name string
library string
totalMiB uint64
compute string
gfxTarget string
integrated bool
}
tests := []struct {
name string
output string
libDirs []string
want []wantDevice
wantSkip string
}{
{
name: "cuda mixed archs filters unsupported device",
output: `ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
system_info: n_threads = 8 | CUDA : ARCHS = 750,800,860,890 |
Available devices:
CUDA0: NVIDIA GeForce GTX 1060 (6063 MiB, 5900 MiB free)
CUDA1: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
`,
want: []wantDevice{{
name: "CUDA1",
library: "CUDA",
totalMiB: 16379,
compute: "8.9",
}},
},
{
name: "rocm gfx targets preserve suffix-free compute",
output: `ggml_cuda_init: found 2 ROCm devices (Total VRAM: 32736 MiB):
Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 16368 MiB
Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
Available devices:
ROCm0: AMD Radeon RX 6800 (16368 MiB, 16342 MiB free)
ROCm1: AMD Radeon Pro VII (16368 MiB, 16348 MiB free)
`,
want: []wantDevice{
{name: "ROCm0", library: "ROCm", totalMiB: 16368, compute: "gfx1030", gfxTarget: "gfx1030"},
{name: "ROCm1", library: "ROCm", totalMiB: 16368, compute: "gfx906", gfxTarget: "gfx906"},
},
},
{
name: "vulkan uma marks integrated",
output: `ggml_vulkan: 0 = Intel(R) Graphics (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Available devices:
Vulkan0: Intel(R) Graphics (16384 MiB, 12288 MiB free)
`,
want: []wantDevice{{
name: "Vulkan0",
library: "Vulkan",
totalMiB: 16384,
integrated: true,
}},
},
{
name: "windows vulkan without uma stays unclassified",
output: `load_backend: loaded Vulkan backend from C:\ollama\lib\ollama\vulkan\ggml-vulkan.dll
Available devices:
Vulkan0: AMD Radeon(TM) Graphics (32768 MiB, 31000 MiB free)
Vulkan1: AMD Radeon RX 7900 XTX (24564 MiB, 23000 MiB free)
`,
want: []wantDevice{
{name: "Vulkan0", library: "Vulkan", totalMiB: 32768},
{name: "Vulkan1", library: "Vulkan", totalMiB: 24564},
},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
libDirs := tt.libDirs
if libDirs == nil {
libDirs = []string{"/lib/ollama"}
}
got := parseLlamaServerDevices(tt.output, libDirs)
if len(got) != len(tt.want) {
t.Fatalf("got %d devices, want %d", len(got), len(tt.want))
}
for i, want := range tt.want {
if got[i].Name != want.name {
t.Fatalf("device %d name = %q, want %q", i, got[i].Name, want.name)
}
if got[i].Library != want.library {
t.Fatalf("device %d library = %q, want %q", i, got[i].Library, want.library)
}
if got[i].TotalMemory != want.totalMiB*1024*1024 {
t.Fatalf("device %d total memory = %d, want %d MiB", i, got[i].TotalMemory, want.totalMiB)
}
if want.compute != "" && got[i].Compute() != want.compute {
t.Fatalf("device %d compute = %q, want %q", i, got[i].Compute(), want.compute)
}
if want.gfxTarget != "" && got[i].GFXTarget != want.gfxTarget {
t.Fatalf("device %d gfx target = %q, want %q", i, got[i].GFXTarget, want.gfxTarget)
}
if got[i].Integrated != want.integrated {
t.Fatalf("device %d integrated = %v, want %v", i, got[i].Integrated, want.integrated)
}
}
})
}
})
t.Run("skips mismatched Vulkan native metadata", func(t *testing.T) {
output := `Available devices:
Vulkan0: Intel(R) UHD Graphics 770 (32768 MiB, 31000 MiB free)
`
nativeDevices := []nativeProbeDevice{{
Library: "Vulkan",
Index: 0,
IndexMatchesBackend: true,
Description: "NVIDIA GeForce RTX 4060 Ti",
DeviceID: "0000:05:00.0",
IntegratedKnown: true,
TotalMemory: 16107 * 1024 * 1024,
}}
devices := parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/vulkan"}, nativeDevices)
if len(devices) != 1 {
t.Fatalf("got %d devices, want 1", len(devices))
}
if devices[0].PCIID != "" {
t.Fatalf("PCIID = %q, want empty", devices[0].PCIID)
}
if devices[0].Integrated {
t.Fatal("Integrated = true, want false")
}
})
t.Run("cuda runtime version", func(t *testing.T) {
dir := t.TempDir()
if err := os.WriteFile(filepath.Join(dir, "libcudart.so.12.8.90"), nil, 0o644); err != nil {
t.Fatal(err)
}
major, minor, ok := cudaRuntimeVersion([]string{dir})
if !ok || major != 12 || minor != 8 {
t.Fatalf("cudaRuntimeVersion = %d.%d, %v, want 12.8, true", major, minor, ok)
}
major, minor, ok = cudaRuntimeVersion([]string{filepath.Join(t.TempDir(), "cuda_v13")})
if !ok || major != 13 || minor != 0 {
t.Fatalf("cudaRuntimeVersion fallback = %d.%d, %v, want 13.0, true", major, minor, ok)
}
})
t.Run("refine windows vulkan devices", func(t *testing.T) {
makeDevices := func() []ml.DeviceInfo {
return []ml.DeviceInfo{
{DeviceID: ml.DeviceID{ID: "0", Library: "Vulkan"}, Description: "AMD Radeon(TM) Graphics"},
{DeviceID: ml.DeviceID{ID: "1", Library: "Vulkan"}, Description: "AMD Radeon RX 7900 XTX"},
{DeviceID: ml.DeviceID{ID: "0", Library: "CUDA"}, Description: "NVIDIA GeForce RTX 4090"},
}
}
tests := []struct {
name string
devices []ml.DeviceInfo
probed []vulkanPhysicalDevice
want []bool
applied bool
}{
{
name: "fills missing integrated bit",
probed: []vulkanPhysicalDevice{
{Name: "AMD Radeon(TM) Graphics", Integrated: true},
{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
},
want: []bool{true, false, false},
applied: true,
},
{
name: "matches names when order differs",
probed: []vulkanPhysicalDevice{
{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
{Name: "AMD Radeon(TM) Graphics", Integrated: true},
},
want: []bool{true, false, false},
applied: true,
},
{
name: "skips when names do not line up",
probed: []vulkanPhysicalDevice{
{Name: "Wrong GPU", Integrated: true},
{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
},
want: []bool{false, false, false},
},
{
name: "skips when counts do not line up",
probed: []vulkanPhysicalDevice{{Name: "AMD Radeon(TM) Graphics", Integrated: true}},
want: []bool{false, false, false},
},
{
name: "overwrites stale classification",
devices: []ml.DeviceInfo{
{DeviceID: ml.DeviceID{ID: "0", Library: "Vulkan"}, Description: "AMD Radeon(TM) Graphics", Integrated: true},
{DeviceID: ml.DeviceID{ID: "1", Library: "Vulkan"}, Description: "AMD Radeon RX 7900 XTX"},
},
probed: []vulkanPhysicalDevice{
{Name: "AMD Radeon(TM) Graphics", Integrated: false},
{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
},
want: []bool{false, false},
applied: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
devices := tt.devices
if devices == nil {
devices = makeDevices()
}
applied := applyWindowsVulkanRefinement(devices, tt.probed)
if applied != tt.applied {
t.Fatalf("applied = %v, want %v", applied, tt.applied)
}
got := devices
if len(got) != len(tt.want) {
t.Fatalf("got %d devices, want %d", len(got), len(tt.want))
}
for i, want := range tt.want {
if got[i].Integrated != want {
t.Fatalf("device %d integrated = %v, want %v", i, got[i].Integrated, want)
}
}
})
}
})
}

263
discover/native_probe.go Normal file
View file

@ -0,0 +1,263 @@
package discover
import (
"bytes"
"context"
"encoding/json"
"io"
"log/slog"
"os"
"os/exec"
"runtime"
"strings"
"time"
"github.com/ollama/ollama/llm"
"github.com/ollama/ollama/ml"
)
// Native GPU discovery runs in a short-lived Ollama subprocess so loading GGML
// and driver libraries cannot crash the main server process. The subprocess
// keeps stdout reserved for JSON and lets GGML's default logger write to
// stderr; the parent captures that stderr for trace/debug diagnostics.
const nativeProbeTimeout = 15 * time.Second
type nativeProbeDevice struct {
Library string `json:"library"`
Index int `json:"index"`
// IndexMatchesBackend means Index is in the same visible-device order that
// llama-server reports, so it is safe to correlate when PCI ID is missing.
IndexMatchesBackend bool `json:"index_matches_backend,omitempty"`
Name string `json:"name,omitempty"`
Description string `json:"description,omitempty"`
DeviceID string `json:"device_id,omitempty"`
Integrated bool `json:"integrated,omitempty"`
IntegratedKnown bool `json:"integrated_known"`
TotalMemory uint64 `json:"total_memory,omitempty"`
FreeMemory uint64 `json:"free_memory,omitempty"`
ComputeMajor int `json:"compute_major,omitempty"`
ComputeMinor int `json:"compute_minor,omitempty"`
CUDADriverMajor int `json:"cuda_driver_major,omitempty"`
CUDADriverMinor int `json:"cuda_driver_minor,omitempty"`
NVIDIADriverMajor int `json:"nvidia_driver_major,omitempty"`
GFXTarget string `json:"gfx_target,omitempty"`
}
type nativeProbeResult struct {
Devices []nativeProbeDevice `json:"devices"`
}
type ggmlBackendDevCaps struct {
Async uint8
HostBuffer uint8
BufferFromHostPtr uint8
Events uint8
}
type ggmlBackendDevProps struct {
Name uintptr
Description uintptr
MemoryFree uintptr
MemoryTotal uintptr
Type int32
_ [4]byte
DeviceID uintptr
Caps ggmlBackendDevCaps
_ [4]byte
}
func discoverNativeDevices(ctx context.Context, llamaServer string, libDirs []string, extraEnvs map[string]string) ([]nativeProbeDevice, string, error) {
if runtime.GOOS != "linux" && runtime.GOOS != "windows" {
return nil, "", nil
}
exe, err := os.Executable()
if err != nil {
return nil, "", err
}
ctx, cancel := context.WithTimeout(ctx, nativeProbeTimeout)
defer cancel()
args := []string{"gpu-discover"}
for _, dir := range libDirs {
args = append(args, "--lib-dir", dir)
}
cmd := exec.CommandContext(ctx, exe, args...)
cmd.WaitDelay = llamaServerDiscoveryWaitDelay
llm.SetupLlamaServerCommandEnv(cmd, llamaServer, libDirs, extraEnvs)
var stderr bytes.Buffer
cmd.Stderr = &stderr
stdout, err := cmd.Output()
if err != nil {
if ctx.Err() != nil {
return nil, stderr.String(), ctx.Err()
}
return nil, stderr.String(), err
}
var result nativeProbeResult
if err := json.Unmarshal(stdout, &result); err != nil {
return nil, stderr.String(), err
}
return result.Devices, stderr.String(), nil
}
func RunNativeProbeCommand(ctx context.Context, libDirs []string, out io.Writer) error {
if len(libDirs) == 0 {
libDirs = []string{ml.LibOllamaPath}
}
devices, err := runNativeProbe(ctx, libDirs)
if err != nil {
return err
}
return json.NewEncoder(out).Encode(nativeProbeResult{Devices: devices})
}
func runNativeProbe(ctx context.Context, libDirs []string) ([]nativeProbeDevice, error) {
return runPlatformNativeProbe(ctx, libDirs)
}
func mergeNativeProbeDevices(base, supplement []nativeProbeDevice) []nativeProbeDevice {
if len(base) == 0 {
var out []nativeProbeDevice
for _, extra := range supplement {
if extra.IndexMatchesBackend {
out = append(out, extra)
}
}
return out
}
out := append([]nativeProbeDevice(nil), base...)
for _, extra := range supplement {
idx := -1
for i := range out {
if sameNativeProbeDevice(out[i], extra) {
idx = i
break
}
}
if idx < 0 {
if !extra.IndexMatchesBackend || nativeProbeLibraryIndexExists(out, extra) {
continue
}
out = append(out, extra)
continue
}
mergeNativeProbeDevice(&out[idx], extra)
}
return out
}
func sameNativeProbeDevice(a, b nativeProbeDevice) bool {
if !strings.EqualFold(a.Library, b.Library) {
return false
}
if a.DeviceID != "" && b.DeviceID != "" {
return strings.EqualFold(a.DeviceID, b.DeviceID)
}
if !a.IndexMatchesBackend || !b.IndexMatchesBackend {
return false
}
return a.Index == b.Index
}
func mergeNativeProbeDevice(dst *nativeProbeDevice, src nativeProbeDevice) {
dst.IndexMatchesBackend = dst.IndexMatchesBackend || src.IndexMatchesBackend
if dst.Name == "" {
dst.Name = src.Name
}
if dst.Description == "" {
dst.Description = src.Description
}
if dst.DeviceID == "" {
dst.DeviceID = src.DeviceID
}
if src.IntegratedKnown {
dst.Integrated = src.Integrated
dst.IntegratedKnown = true
} else if !dst.IntegratedKnown && src.Integrated {
dst.Integrated = true
}
if dst.TotalMemory == 0 {
dst.TotalMemory = src.TotalMemory
}
if dst.FreeMemory == 0 {
dst.FreeMemory = src.FreeMemory
}
if dst.ComputeMajor == 0 && src.ComputeMajor != 0 {
dst.ComputeMajor = src.ComputeMajor
dst.ComputeMinor = src.ComputeMinor
}
if dst.CUDADriverMajor == 0 && src.CUDADriverMajor != 0 {
dst.CUDADriverMajor = src.CUDADriverMajor
dst.CUDADriverMinor = src.CUDADriverMinor
}
if dst.NVIDIADriverMajor == 0 && src.NVIDIADriverMajor != 0 {
dst.NVIDIADriverMajor = src.NVIDIADriverMajor
}
if dst.GFXTarget == "" {
dst.GFXTarget = src.GFXTarget
}
}
func nativeProbeLibraryIndexExists(devices []nativeProbeDevice, target nativeProbeDevice) bool {
if !target.IndexMatchesBackend {
return false
}
for _, dev := range devices {
if strings.EqualFold(dev.Library, target.Library) && dev.Index == target.Index {
return true
}
}
return false
}
func nativeProbeByLibraryIndex(devices []nativeProbeDevice) map[string]map[int]nativeProbeDevice {
out := map[string]map[int]nativeProbeDevice{}
for _, dev := range devices {
if !dev.IndexMatchesBackend {
continue
}
lib := normalizeNativeProbeLibrary(dev.Library)
if lib == "" {
continue
}
if _, ok := out[lib]; !ok {
out[lib] = map[int]nativeProbeDevice{}
}
out[lib][dev.Index] = dev
}
return out
}
func normalizeNativeProbeLibrary(library string) string {
switch strings.ToLower(library) {
case "cuda":
return "CUDA"
case "hip", "rocm":
return "ROCm"
case "vulkan":
return "Vulkan"
case "metal":
return "Metal"
default:
return library
}
}
func logNativeProbeFailure(err error, stderr string, libDirs []string) {
if err == nil {
return
}
if stderr != "" {
slog.Debug("native GPU discovery failed", "error", err, "stderr", stderr, "libDirs", libDirs)
return
}
slog.Debug("native GPU discovery failed", "error", err, "libDirs", libDirs)
}

View file

@ -0,0 +1,510 @@
//go:build linux
package discover
/*
#cgo linux LDFLAGS: -ldl
#include <dlfcn.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
static void * ollama_dlopen(const char * path, int global) {
return dlopen(path, RTLD_NOW | (global ? RTLD_GLOBAL : RTLD_LOCAL));
}
static void * ollama_dlsym(void * handle, const char * name) {
return dlsym(handle, name);
}
static const char * ollama_dlerror(void) {
const char * err = dlerror();
return err ? err : "";
}
typedef void * (*ollama_ggml_backend_load_fn)(const char *);
typedef size_t (*ollama_ggml_backend_reg_dev_count_fn)(void *);
typedef void * (*ollama_ggml_backend_reg_dev_get_fn)(void *, size_t);
typedef const char * (*ollama_ggml_backend_reg_name_fn)(void *);
typedef void (*ollama_ggml_backend_dev_get_props_fn)(void *, void *);
static void * ollama_call_ggml_backend_load(void * fn, const char * path) {
return ((ollama_ggml_backend_load_fn) fn)(path);
}
static size_t ollama_call_ggml_backend_reg_dev_count(void * fn, void * reg) {
return ((ollama_ggml_backend_reg_dev_count_fn) fn)(reg);
}
static void * ollama_call_ggml_backend_reg_dev_get(void * fn, void * reg, size_t index) {
return ((ollama_ggml_backend_reg_dev_get_fn) fn)(reg, index);
}
static const char * ollama_call_ggml_backend_reg_name(void * fn, void * reg) {
return ((ollama_ggml_backend_reg_name_fn) fn)(reg);
}
static void ollama_call_ggml_backend_dev_get_props(void * fn, void * dev, void * props) {
((ollama_ggml_backend_dev_get_props_fn) fn)(dev, props);
}
static const char * ollama_cstr_from_uintptr(uintptr_t ptr) {
return (const char *) ptr;
}
typedef int (*ollama_cu_init_fn)(unsigned int);
typedef int (*ollama_cu_driver_get_version_fn)(int *);
typedef int (*ollama_cu_device_get_count_fn)(int *);
typedef int (*ollama_cu_device_get_fn)(int *, int);
typedef int (*ollama_cu_device_get_attribute_fn)(int *, int, int);
typedef int (*ollama_cu_device_get_name_fn)(char *, int, int);
typedef int (*ollama_cu_device_total_mem_fn)(size_t *, int);
typedef int (*ollama_cu_device_get_pci_bus_id_fn)(char *, int, int);
static int ollama_call_cu_init(void * fn) {
return ((ollama_cu_init_fn) fn)(0);
}
static int ollama_call_cu_driver_get_version(void * fn, int * version) {
return ((ollama_cu_driver_get_version_fn) fn)(version);
}
static int ollama_call_cu_device_get_count(void * fn, int * count) {
return ((ollama_cu_device_get_count_fn) fn)(count);
}
static int ollama_call_cu_device_get(void * fn, int * device, int index) {
return ((ollama_cu_device_get_fn) fn)(device, index);
}
static int ollama_call_cu_device_get_attribute(void * fn, int * value, int attr, int device) {
return ((ollama_cu_device_get_attribute_fn) fn)(value, attr, device);
}
static int ollama_call_cu_device_get_name(void * fn, char * name, int len, int device) {
return ((ollama_cu_device_get_name_fn) fn)(name, len, device);
}
static int ollama_call_cu_device_total_mem(void * fn, size_t * total, int device) {
return ((ollama_cu_device_total_mem_fn) fn)(total, device);
}
static int ollama_call_cu_device_get_pci_bus_id(void * fn, char * pci, int len, int device) {
return ((ollama_cu_device_get_pci_bus_id_fn) fn)(pci, len, device);
}
typedef int (*ollama_nvml_init_fn)(void);
typedef int (*ollama_nvml_shutdown_fn)(void);
typedef int (*ollama_nvml_system_get_driver_version_fn)(char *, unsigned int);
static int ollama_call_nvml_init(void * fn) {
return ((ollama_nvml_init_fn) fn)();
}
static int ollama_call_nvml_shutdown(void * fn) {
return ((ollama_nvml_shutdown_fn) fn)();
}
static int ollama_call_nvml_system_get_driver_version(void * fn, char * version, unsigned int len) {
return ((ollama_nvml_system_get_driver_version_fn) fn)(version, len);
}
*/
import "C"
import (
"context"
"errors"
"fmt"
"log/slog"
"os"
"strings"
"unsafe"
)
const (
cuSuccess = 0
cuDeviceAttributeComputeCapabilityMajor = 75
cuDeviceAttributeComputeCapabilityMinor = 76
cuDeviceAttributeIntegrated = 18
)
type dlHandle struct {
ptr unsafe.Pointer
}
func runPlatformNativeProbe(ctx context.Context, libDirs []string) ([]nativeProbeDevice, error) {
select {
case <-ctx.Done():
return nil, ctx.Err()
default:
}
ggmlDevices, ggmlErr := probeGGMLDevicesLinux(libDirs)
var cudaDevices []nativeProbeDevice
var cudaErr error
if nativeProbeHasCUDA(libDirs) {
cudaDevices, cudaErr = probeCUDADriverLinux()
}
var rocmDevices []nativeProbeDevice
var rocmErr error
if nativeProbeHasROCm(libDirs) {
rocmDevices, rocmErr = probeROCmSysfsLinux()
}
devices := mergeNativeProbeDevices(mergeNativeProbeDevices(ggmlDevices, cudaDevices), rocmDevices)
if len(devices) > 0 {
return devices, nil
}
if ggmlErr != nil {
return nil, ggmlErr
}
if rocmErr != nil {
return nil, rocmErr
}
return nil, cudaErr
}
func probeGGMLDevicesLinux(libDirs []string) ([]nativeProbeDevice, error) {
if len(libDirs) == 0 {
return nil, errors.New("no library directories provided")
}
baseDir := libDirs[0]
if baseDir == "" {
return nil, errors.New("empty GGML library directory")
}
base, err := dlopen(ggmlLibraryFile(baseDir, "ggml-base"), true)
if err != nil {
return nil, err
}
ggml, err := dlopen(ggmlLibraryFile(baseDir, "ggml"), true)
if err != nil {
return nil, err
}
backendLoad, err := dlsym(ggml, "ggml_backend_load")
if err != nil {
return nil, err
}
regDevCount, err := dlsym(base, "ggml_backend_reg_dev_count")
if err != nil {
return nil, err
}
regDevGet, err := dlsym(base, "ggml_backend_reg_dev_get")
if err != nil {
return nil, err
}
regName, err := dlsym(base, "ggml_backend_reg_name")
if err != nil {
return nil, err
}
devGetProps, err := dlsym(base, "ggml_backend_dev_get_props")
if err != nil {
return nil, err
}
var devices []nativeProbeDevice
for _, backendPath := range nativeProbeBackendFiles(libDirs) {
reg := callGGMLBackendLoad(backendLoad, backendPath)
if reg == nil {
continue
}
library := ggmlProbeLibraryName(callGGMLRegName(regName, reg))
count := int(callGGMLRegDevCount(regDevCount, reg))
for i := range count {
dev := callGGMLRegDevGet(regDevGet, reg, i)
if dev == nil {
continue
}
props := callGGMLDeviceProps(devGetProps, dev)
if props.MemoryTotal == 0 {
continue
}
devices = append(devices, nativeProbeDevice{
Library: library,
Index: i,
IndexMatchesBackend: true,
Name: cString(props.Name),
Description: cString(props.Description),
DeviceID: cString(props.DeviceID),
Integrated: ggmlDeviceTypeIntegrated(props.Type),
IntegratedKnown: props.Type == ggmlBackendDeviceTypeGPU ||
props.Type == ggmlBackendDeviceTypeIGPU,
TotalMemory: uint64(props.MemoryTotal),
FreeMemory: uint64(props.MemoryFree),
})
slog.Debug("GGML GPU device type", "library", library, "index", i, "ggml_type", props.Type, "integrated", ggmlDeviceTypeIntegrated(props.Type))
}
}
return devices, nil
}
func probeCUDADriverLinux() ([]nativeProbeDevice, error) {
cuda, err := dlopenFirst([]string{"libcuda.so.1", "libcuda.so"}, false)
if err != nil {
return nil, err
}
cuInit, err := dlsym(cuda, "cuInit")
if err != nil {
return nil, err
}
cuDriverGetVersion, err := dlsym(cuda, "cuDriverGetVersion")
if err != nil {
return nil, err
}
cuDeviceGetCount, err := dlsym(cuda, "cuDeviceGetCount")
if err != nil {
return nil, err
}
cuDeviceGet, err := dlsym(cuda, "cuDeviceGet")
if err != nil {
return nil, err
}
cuDeviceGetAttribute, err := dlsym(cuda, "cuDeviceGetAttribute")
if err != nil {
return nil, err
}
cuDeviceGetName, err := dlsym(cuda, "cuDeviceGetName")
if err != nil {
return nil, err
}
cuDeviceTotalMem, err := dlsymAny(cuda, "cuDeviceTotalMem_v2", "cuDeviceTotalMem")
if err != nil {
return nil, err
}
cuDeviceGetPCIBusID, _ := dlsym(cuda, "cuDeviceGetPCIBusId")
if ret := C.ollama_call_cu_init(cuInit); ret != cuSuccess {
return nil, fmt.Errorf("cuInit failed: %d", int(ret))
}
var driverVersion C.int
driverMajor, driverMinor := 0, 0
if ret := C.ollama_call_cu_driver_get_version(cuDriverGetVersion, &driverVersion); ret == cuSuccess {
version := int(driverVersion)
driverMajor = version / 1000
driverMinor = (version - driverMajor*1000) / 10
}
nvidiaDriverMajor := 0
if driver, err := probeNVIDIADriverMajorLinux(); err == nil {
nvidiaDriverMajor = driver
}
var count C.int
if ret := C.ollama_call_cu_device_get_count(cuDeviceGetCount, &count); ret != cuSuccess {
return nil, fmt.Errorf("cuDeviceGetCount failed: %d", int(ret))
}
deviceCount := int(count)
devices := make([]nativeProbeDevice, 0, deviceCount)
for i := range deviceCount {
var device C.int
if ret := C.ollama_call_cu_device_get(cuDeviceGet, &device, C.int(i)); ret != cuSuccess {
continue
}
major := cudaDeviceAttribute(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMajor, device)
minor := cudaDeviceAttribute(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMinor, device)
integrated := cudaDeviceAttribute(cuDeviceGetAttribute, cuDeviceAttributeIntegrated, device) == 1
var name [128]C.char
_ = C.ollama_call_cu_device_get_name(cuDeviceGetName, &name[0], C.int(len(name)), device)
var total C.size_t
_ = C.ollama_call_cu_device_total_mem(cuDeviceTotalMem, &total, device)
pci := ""
if cuDeviceGetPCIBusID != nil {
var pciBuf [32]C.char
if ret := C.ollama_call_cu_device_get_pci_bus_id(cuDeviceGetPCIBusID, &pciBuf[0], C.int(len(pciBuf)), device); ret == cuSuccess {
pci = strings.ToLower(C.GoString(&pciBuf[0]))
}
}
devices = append(devices, nativeProbeDevice{
Library: "CUDA",
Index: i,
IndexMatchesBackend: true,
Description: C.GoString(&name[0]),
DeviceID: pci,
Integrated: integrated,
IntegratedKnown: true,
TotalMemory: uint64(total),
ComputeMajor: major,
ComputeMinor: minor,
CUDADriverMajor: driverMajor,
CUDADriverMinor: driverMinor,
NVIDIADriverMajor: nvidiaDriverMajor,
})
}
return devices, nil
}
func probeROCmSysfsLinux() ([]nativeProbeDevice, error) {
sysfsDevices, err := readROCmLinuxSysfsDevices("/sys")
if err != nil {
return nil, err
}
override := hsaOverrideGFXTarget()
// Sysfs stays in physical KFD order; ROCm visibility envs can reindex the
// backend device list, so filtered sysfs data must merge by PCI ID only.
backendIndex := !rocmVisibleDevicesEnvSet()
devices := make([]nativeProbeDevice, 0, len(sysfsDevices))
for i, sysfsDevice := range sysfsDevices {
gfxTarget := sysfsDevice.gfxTarget
if override != "" {
gfxTarget = override
}
devices = append(devices, nativeProbeDevice{
Library: "ROCm",
Index: i,
IndexMatchesBackend: backendIndex,
DeviceID: sysfsDevice.pciID,
Integrated: sysfsDevice.integrated,
IntegratedKnown: sysfsDevice.known,
GFXTarget: gfxTarget,
})
}
return devices, nil
}
func rocmVisibleDevicesEnvSet() bool {
for _, name := range []string{"HIP_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "GPU_DEVICE_ORDINAL"} {
if os.Getenv(name) != "" {
return true
}
}
return false
}
func probeNVIDIADriverMajorLinux() (int, error) {
nvml, err := dlopenFirst([]string{"libnvidia-ml.so.1", "libnvidia-ml.so"}, false)
if err != nil {
return 0, err
}
initFn, err := dlsym(nvml, "nvmlInit_v2")
if err != nil {
return 0, err
}
shutdownFn, err := dlsym(nvml, "nvmlShutdown")
if err != nil {
return 0, err
}
driverFn, err := dlsym(nvml, "nvmlSystemGetDriverVersion")
if err != nil {
return 0, err
}
if ret := C.ollama_call_nvml_init(initFn); ret != 0 {
return 0, fmt.Errorf("nvmlInit_v2 failed: %d", int(ret))
}
defer C.ollama_call_nvml_shutdown(shutdownFn)
var version [80]C.char
if ret := C.ollama_call_nvml_system_get_driver_version(driverFn, &version[0], C.uint(len(version))); ret != 0 {
return 0, fmt.Errorf("nvmlSystemGetDriverVersion failed: %d", int(ret))
}
return parseNVIDIADriverMajor(C.GoString(&version[0]))
}
func cudaDeviceAttribute(fn unsafe.Pointer, attr int, device C.int) int {
var value C.int
if ret := C.ollama_call_cu_device_get_attribute(fn, &value, C.int(attr), device); ret != cuSuccess {
return 0
}
return int(value)
}
func dlopenFirst(names []string, global bool) (dlHandle, error) {
var errs []string
for _, name := range names {
handle, err := dlopen(name, global)
if err == nil {
return handle, nil
}
errs = append(errs, err.Error())
}
return dlHandle{}, errors.New(strings.Join(errs, "; "))
}
func dlopen(path string, global bool) (dlHandle, error) {
cpath := C.CString(path)
defer C.free(unsafe.Pointer(cpath))
handle := C.ollama_dlopen(cpath, boolToCInt(global))
if handle == nil {
return dlHandle{}, fmt.Errorf("dlopen %s: %s", path, C.GoString(C.ollama_dlerror()))
}
return dlHandle{ptr: handle}, nil
}
func dlsym(handle dlHandle, name string) (unsafe.Pointer, error) {
cname := C.CString(name)
defer C.free(unsafe.Pointer(cname))
sym := C.ollama_dlsym(handle.ptr, cname)
if sym == nil {
return nil, fmt.Errorf("dlsym %s: %s", name, C.GoString(C.ollama_dlerror()))
}
return sym, nil
}
func dlsymAny(handle dlHandle, names ...string) (unsafe.Pointer, error) {
var errs []string
for _, name := range names {
sym, err := dlsym(handle, name)
if err == nil {
return sym, nil
}
errs = append(errs, err.Error())
}
return nil, errors.New(strings.Join(errs, "; "))
}
func callGGMLBackendLoad(fn unsafe.Pointer, path string) unsafe.Pointer {
cpath := C.CString(path)
defer C.free(unsafe.Pointer(cpath))
return C.ollama_call_ggml_backend_load(fn, cpath)
}
func callGGMLRegDevCount(fn unsafe.Pointer, reg unsafe.Pointer) uintptr {
return uintptr(C.ollama_call_ggml_backend_reg_dev_count(fn, reg))
}
func callGGMLRegDevGet(fn unsafe.Pointer, reg unsafe.Pointer, index int) unsafe.Pointer {
return C.ollama_call_ggml_backend_reg_dev_get(fn, reg, C.size_t(index))
}
func callGGMLRegName(fn unsafe.Pointer, reg unsafe.Pointer) string {
return C.GoString(C.ollama_call_ggml_backend_reg_name(fn, reg))
}
func callGGMLDeviceProps(fn unsafe.Pointer, dev unsafe.Pointer) ggmlBackendDevProps {
var props ggmlBackendDevProps
C.ollama_call_ggml_backend_dev_get_props(fn, dev, unsafe.Pointer(&props))
return props
}
func cString(ptr uintptr) string {
if ptr == 0 {
return ""
}
return C.GoString(C.ollama_cstr_from_uintptr(C.uintptr_t(ptr)))
}
func boolToCInt(v bool) C.int {
if v {
return 1
}
return 0
}

View file

@ -0,0 +1,12 @@
//go:build linux && !cgo
package discover
import (
"context"
"errors"
)
func runPlatformNativeProbe(context.Context, []string) ([]nativeProbeDevice, error) {
return nil, errors.New("native GPU discovery requires cgo on Linux")
}

View file

@ -0,0 +1,132 @@
//go:build (linux && cgo) || windows
package discover
import (
"errors"
"fmt"
"os"
"path/filepath"
"runtime"
"sort"
"strconv"
"strings"
)
const (
ggmlBackendDeviceTypeGPU = 1
ggmlBackendDeviceTypeIGPU = 2
)
func ggmlDeviceTypeIntegrated(deviceType int32) bool {
return deviceType == ggmlBackendDeviceTypeIGPU
}
func ggmlProbeLibraryName(name string) string {
switch strings.ToLower(name) {
case "cuda":
return "CUDA"
case "hip", "rocm":
return "ROCm"
case "vulkan":
return "Vulkan"
case "metal":
return "Metal"
default:
return name
}
}
func ggmlLibraryFile(dir, name string) string {
if runtime.GOOS == "windows" {
return filepath.Join(dir, name+".dll")
}
exact := filepath.Join(dir, "lib"+name+".so")
if _, err := os.Stat(exact); err == nil {
return exact
}
matches, _ := filepath.Glob(exact + ".*")
if len(matches) > 0 {
sort.Strings(matches)
return matches[len(matches)-1]
}
return exact
}
func nativeProbeBackendFiles(libDirs []string) []string {
var files []string
seen := map[string]bool{}
for _, dir := range libDirs {
for _, pattern := range nativeProbeBackendPatterns(dir) {
matches, _ := filepath.Glob(pattern)
for _, match := range matches {
if seen[match] {
continue
}
seen[match] = true
files = append(files, match)
}
}
}
return files
}
func nativeProbeBackendPatterns(dir string) []string {
if runtime.GOOS == "windows" {
return []string{
filepath.Join(dir, "ggml-cuda.dll"),
filepath.Join(dir, "ggml-hip.dll"),
filepath.Join(dir, "ggml-vulkan.dll"),
}
}
return []string{
filepath.Join(dir, "libggml-cuda.so"),
filepath.Join(dir, "libggml-hip.so"),
filepath.Join(dir, "libggml-vulkan.so"),
}
}
func nativeProbeHasCUDA(libDirs []string) bool {
for _, dir := range libDirs {
if strings.Contains(strings.ToLower(filepath.Base(dir)), "cuda") {
return true
}
}
for _, file := range nativeProbeBackendFiles(libDirs) {
if strings.Contains(strings.ToLower(filepath.Base(file)), "cuda") {
return true
}
}
return false
}
func nativeProbeHasROCm(libDirs []string) bool {
for _, dir := range libDirs {
base := strings.ToLower(filepath.Base(dir))
if strings.Contains(base, "rocm") || strings.Contains(base, "hip") {
return true
}
}
for _, file := range nativeProbeBackendFiles(libDirs) {
base := strings.ToLower(filepath.Base(file))
if strings.Contains(base, "hip") {
return true
}
}
return false
}
func parseNVIDIADriverMajor(version string) (int, error) {
version = strings.TrimSpace(version)
if version == "" {
return 0, errors.New("empty NVIDIA driver version")
}
major, _, _ := strings.Cut(version, ".")
driver, err := strconv.Atoi(major)
if err != nil {
return 0, fmt.Errorf("parse NVIDIA driver version %q: %w", version, err)
}
return driver, nil
}

View file

@ -0,0 +1,12 @@
//go:build !linux && !windows
package discover
import (
"context"
"errors"
)
func runPlatformNativeProbe(context.Context, []string) ([]nativeProbeDevice, error) {
return nil, errors.New("native GPU discovery is not implemented on this platform")
}

View file

@ -0,0 +1,203 @@
package discover
import (
"testing"
"unsafe"
"github.com/ollama/ollama/ml"
)
func TestGGMLBackendDevPropsLayout(t *testing.T) {
if unsafe.Sizeof(uintptr(0)) != 8 {
t.Skip("GGML probe layout assertions are for 64-bit builds")
}
var props ggmlBackendDevProps
if got, want := unsafe.Sizeof(props), uintptr(56); got != want {
t.Fatalf("ggmlBackendDevProps size = %d, want %d", got, want)
}
checks := []struct {
name string
got uintptr
want uintptr
}{
{"Name", unsafe.Offsetof(props.Name), 0},
{"Description", unsafe.Offsetof(props.Description), 8},
{"MemoryFree", unsafe.Offsetof(props.MemoryFree), 16},
{"MemoryTotal", unsafe.Offsetof(props.MemoryTotal), 24},
{"Type", unsafe.Offsetof(props.Type), 32},
{"DeviceID", unsafe.Offsetof(props.DeviceID), 40},
{"Caps", unsafe.Offsetof(props.Caps), 48},
}
for _, tt := range checks {
t.Run(tt.name, func(t *testing.T) {
if tt.got != tt.want {
t.Fatalf("offset = %d, want %d", tt.got, tt.want)
}
})
}
if got, want := unsafe.Sizeof(ggmlBackendDevCaps{}), uintptr(4); got != want {
t.Fatalf("ggmlBackendDevCaps size = %d, want %d", got, want)
}
}
func TestParseLlamaServerDevicesUsesNativeCUDAComputeCapability(t *testing.T) {
output := `system_info: n_threads = 4 | CUDA : ARCHS = 750,800 |
Available devices:
CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
`
devices := parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/cuda_v13"}, []nativeProbeDevice{{
Library: "CUDA",
Index: 0,
IndexMatchesBackend: true,
DeviceID: "0000:01:00.0",
ComputeMajor: 6,
ComputeMinor: 1,
CUDADriverMajor: 13,
NVIDIADriverMajor: 570,
}})
if len(devices) != 0 {
t.Fatalf("got %d devices, want unsupported CUDA device filtered", len(devices))
}
output = `system_info: n_threads = 4 | CUDA : ARCHS = 610,750,800 |
Available devices:
CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
`
devices = parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/cuda_v12"}, []nativeProbeDevice{{
Library: "CUDA",
Index: 0,
IndexMatchesBackend: true,
DeviceID: "0000:01:00.0",
ComputeMajor: 6,
ComputeMinor: 1,
CUDADriverMajor: 12,
NVIDIADriverMajor: 570,
}})
if len(devices) != 1 {
t.Fatalf("got %d devices, want 1", len(devices))
}
got := devices[0]
if got.Compute() != "6.1" {
t.Fatalf("compute = %q, want 6.1", got.Compute())
}
if got.PCIID != "0000:01:00.0" {
t.Fatalf("PCIID = %q, want 0000:01:00.0", got.PCIID)
}
if got.Driver() != "12.0" {
t.Fatalf("driver = %q, want 12.0", got.Driver())
}
if got.NVIDIADriverMajor != 570 {
t.Fatalf("NVIDIADriverMajor = %d, want 570", got.NVIDIADriverMajor)
}
}
func TestParseLlamaServerDevicesUsesNativeROCmMetadata(t *testing.T) {
output := `ggml_vulkan: 0 = AMD Radeon RX 7600 | uma: 1 | fp16: 1 |
Available devices:
ROCm0: AMD Radeon RX 7600 (8176 MiB, 7900 MiB free)
`
devices := parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/rocm_v7_2"}, []nativeProbeDevice{{
Library: "ROCm",
Index: 0,
IndexMatchesBackend: true,
DeviceID: "0000:03:00.0",
GFXTarget: "gfx1102",
Integrated: false,
IntegratedKnown: true,
}})
if len(devices) != 1 {
t.Fatalf("got %d devices, want 1", len(devices))
}
got := devices[0]
if got.PCIID != "0000:03:00.0" {
t.Fatalf("PCIID = %q, want 0000:03:00.0", got.PCIID)
}
if got.GFXTarget != "gfx1102" {
t.Fatalf("GFXTarget = %q, want gfx1102", got.GFXTarget)
}
if got.Compute() != "gfx1102" {
t.Fatalf("compute = %q, want gfx1102", got.Compute())
}
if got.Integrated {
t.Fatalf("Integrated = true, want false")
}
}
func TestNVIDIADriverMajorFromDevices(t *testing.T) {
devices := []ml.DeviceInfo{
{DeviceID: ml.DeviceID{Library: "CUDA"}, NVIDIADriverMajor: 565},
}
if got := nvidiaDriverMajorFromDevices(devices); got != 565 {
t.Fatalf("driver = %d, want 565", got)
}
}
func TestMergeNativeProbeDevicesAvoidsUnreliableIndexMatch(t *testing.T) {
tests := []struct {
name string
base []nativeProbeDevice
supplement []nativeProbeDevice
wantLen int
wantPCI string
wantKnown bool
wantIGPU bool
}{
{
name: "filtered sysfs cannot overwrite a different backend device by index",
base: []nativeProbeDevice{{
Library: "ROCm",
Index: 0,
IndexMatchesBackend: true,
DeviceID: "0000:03:00.0",
}},
supplement: []nativeProbeDevice{{
Library: "ROCm",
Index: 0,
DeviceID: "0000:04:00.0",
Integrated: true,
IntegratedKnown: true,
}},
wantLen: 1,
wantPCI: "0000:03:00.0",
},
{
name: "filtered sysfs can still merge by PCI ID",
base: []nativeProbeDevice{{
Library: "ROCm",
Index: 0,
IndexMatchesBackend: true,
DeviceID: "0000:04:00.0",
}},
supplement: []nativeProbeDevice{{
Library: "ROCm",
Index: 1,
DeviceID: "0000:04:00.0",
Integrated: true,
IntegratedKnown: true,
}},
wantLen: 1,
wantPCI: "0000:04:00.0",
wantKnown: true,
wantIGPU: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := mergeNativeProbeDevices(tt.base, tt.supplement)
if len(got) != tt.wantLen {
t.Fatalf("got %d devices, want %d: %#v", len(got), tt.wantLen, got)
}
if got[0].DeviceID != tt.wantPCI {
t.Fatalf("DeviceID = %q, want %q", got[0].DeviceID, tt.wantPCI)
}
if got[0].IntegratedKnown != tt.wantKnown {
t.Fatalf("IntegratedKnown = %v, want %v", got[0].IntegratedKnown, tt.wantKnown)
}
if got[0].Integrated != tt.wantIGPU {
t.Fatalf("Integrated = %v, want %v", got[0].Integrated, tt.wantIGPU)
}
})
}
}

View file

@ -0,0 +1,466 @@
//go:build windows
package discover
import (
"context"
"errors"
"fmt"
"log/slog"
"os"
"path/filepath"
"strings"
"unsafe"
"github.com/ollama/ollama/llm"
"golang.org/x/sys/windows"
)
const (
cuSuccessWindows = 0
cuDeviceAttributeComputeCapabilityMajorW = 75
cuDeviceAttributeComputeCapabilityMinorW = 76
cuDeviceAttributeIntegratedW = 18
hipSuccessWindows = 0
hipDeviceAttributeIntegratedWindows = 16
)
func runPlatformNativeProbe(ctx context.Context, libDirs []string) ([]nativeProbeDevice, error) {
select {
case <-ctx.Done():
return nil, ctx.Err()
default:
}
ggmlDevices, ggmlErr := probeGGMLDevicesWindows(libDirs)
var cudaDevices []nativeProbeDevice
var cudaErr error
if nativeProbeHasCUDA(libDirs) {
cudaDevices, cudaErr = probeCUDADriverWindows()
}
var rocmDevices []nativeProbeDevice
var rocmErr error
if nativeProbeHasROCm(libDirs) {
rocmDevices, rocmErr = probeHIPRuntimeWindows(libDirs)
}
devices := mergeNativeProbeDevices(mergeNativeProbeDevices(ggmlDevices, cudaDevices), rocmDevices)
if len(devices) > 0 {
return devices, nil
}
if ggmlErr != nil {
return nil, ggmlErr
}
if rocmErr != nil {
return nil, rocmErr
}
return nil, cudaErr
}
func probeGGMLDevicesWindows(libDirs []string) ([]nativeProbeDevice, error) {
if len(libDirs) == 0 || libDirs[0] == "" {
return nil, errors.New("empty GGML library directory")
}
base, err := loadDLLFromPath(ggmlLibraryFile(libDirs[0], "ggml-base"))
if err != nil {
return nil, err
}
ggml, err := loadDLLFromPath(ggmlLibraryFile(libDirs[0], "ggml"))
if err != nil {
return nil, err
}
backendLoad, err := findProc(ggml, "ggml_backend_load")
if err != nil {
return nil, err
}
regDevCount, err := findProc(base, "ggml_backend_reg_dev_count")
if err != nil {
return nil, err
}
regDevGet, err := findProc(base, "ggml_backend_reg_dev_get")
if err != nil {
return nil, err
}
regName, err := findProc(base, "ggml_backend_reg_name")
if err != nil {
return nil, err
}
devGetProps, err := findProc(base, "ggml_backend_dev_get_props")
if err != nil {
return nil, err
}
var devices []nativeProbeDevice
for _, backendPath := range nativeProbeBackendFiles(libDirs) {
cpath, err := windows.BytePtrFromString(backendPath)
if err != nil {
return nil, err
}
reg, _, _ := backendLoad.Call(uintptr(unsafe.Pointer(cpath)))
if reg == 0 {
continue
}
regNamePtr, _, _ := regName.Call(reg)
library := ggmlProbeLibraryName(windowsCString(regNamePtr))
count, _, _ := regDevCount.Call(reg)
for i := uintptr(0); i < count; i++ {
dev, _, _ := regDevGet.Call(reg, i)
if dev == 0 {
continue
}
var props ggmlBackendDevProps
devGetProps.Call(dev, uintptr(unsafe.Pointer(&props)))
if props.MemoryTotal == 0 {
continue
}
devices = append(devices, nativeProbeDevice{
Library: library,
Index: int(i),
IndexMatchesBackend: true,
Name: windowsCString(props.Name),
Description: windowsCString(props.Description),
DeviceID: windowsCString(props.DeviceID),
Integrated: ggmlDeviceTypeIntegrated(props.Type),
IntegratedKnown: props.Type == ggmlBackendDeviceTypeGPU ||
props.Type == ggmlBackendDeviceTypeIGPU,
TotalMemory: uint64(props.MemoryTotal),
FreeMemory: uint64(props.MemoryFree),
})
slog.Debug("GGML GPU device type", "library", library, "index", i, "ggml_type", props.Type, "integrated", ggmlDeviceTypeIntegrated(props.Type))
}
}
return devices, nil
}
func probeCUDADriverWindows() ([]nativeProbeDevice, error) {
cuda, err := loadDLLFromSystem32("nvcuda.dll")
if err != nil {
return nil, err
}
cuInit, err := findProc(cuda, "cuInit")
if err != nil {
return nil, err
}
cuDriverGetVersion, err := findProc(cuda, "cuDriverGetVersion")
if err != nil {
return nil, err
}
cuDeviceGetCount, err := findProc(cuda, "cuDeviceGetCount")
if err != nil {
return nil, err
}
cuDeviceGet, err := findProc(cuda, "cuDeviceGet")
if err != nil {
return nil, err
}
cuDeviceGetAttribute, err := findProc(cuda, "cuDeviceGetAttribute")
if err != nil {
return nil, err
}
cuDeviceGetName, err := findProc(cuda, "cuDeviceGetName")
if err != nil {
return nil, err
}
cuDeviceTotalMem, err := procAny(cuda, "cuDeviceTotalMem_v2", "cuDeviceTotalMem")
if err != nil {
return nil, err
}
cuDeviceGetPCIBusID, _ := findProc(cuda, "cuDeviceGetPCIBusId")
if ret, _, _ := cuInit.Call(0); ret != cuSuccessWindows {
return nil, fmt.Errorf("cuInit failed: %d", ret)
}
driverMajor, driverMinor := 0, 0
var driverVersion int32
if ret, _, _ := cuDriverGetVersion.Call(uintptr(unsafe.Pointer(&driverVersion))); ret == cuSuccessWindows {
version := int(driverVersion)
driverMajor = version / 1000
driverMinor = (version - driverMajor*1000) / 10
}
nvidiaDriverMajor := 0
if driver, err := probeNVIDIADriverMajorWindows(); err == nil {
nvidiaDriverMajor = driver
}
var count int32
if ret, _, _ := cuDeviceGetCount.Call(uintptr(unsafe.Pointer(&count))); ret != cuSuccessWindows {
return nil, fmt.Errorf("cuDeviceGetCount failed: %d", ret)
}
devices := make([]nativeProbeDevice, 0, int(count))
for i := range int(count) {
var device int32
if ret, _, _ := cuDeviceGet.Call(uintptr(unsafe.Pointer(&device)), uintptr(i)); ret != cuSuccessWindows {
continue
}
major := cudaDeviceAttributeWindows(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMajorW, device)
minor := cudaDeviceAttributeWindows(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMinorW, device)
integrated := cudaDeviceAttributeWindows(cuDeviceGetAttribute, cuDeviceAttributeIntegratedW, device) == 1
name := make([]byte, 128)
cuDeviceGetName.Call(uintptr(unsafe.Pointer(&name[0])), uintptr(len(name)), uintptr(device))
var total uintptr
cuDeviceTotalMem.Call(uintptr(unsafe.Pointer(&total)), uintptr(device))
pci := ""
if cuDeviceGetPCIBusID != nil {
pciBuf := make([]byte, 32)
if ret, _, _ := cuDeviceGetPCIBusID.Call(uintptr(unsafe.Pointer(&pciBuf[0])), uintptr(len(pciBuf)), uintptr(device)); ret == cuSuccessWindows {
pci = strings.ToLower(byteCString(pciBuf))
}
}
devices = append(devices, nativeProbeDevice{
Library: "CUDA",
Index: i,
IndexMatchesBackend: true,
Description: byteCString(name),
DeviceID: pci,
Integrated: integrated,
IntegratedKnown: true,
TotalMemory: uint64(total),
ComputeMajor: major,
ComputeMinor: minor,
CUDADriverMajor: driverMajor,
CUDADriverMinor: driverMinor,
NVIDIADriverMajor: nvidiaDriverMajor,
})
}
return devices, nil
}
func probeHIPRuntimeWindows(libDirs []string) ([]nativeProbeDevice, error) {
hipPath, err := llm.WindowsROCmRuntimeDLLPath(libDirs)
if err != nil {
return nil, err
}
hip, err := loadDLLFromPath(hipPath)
if err != nil {
return nil, err
}
hipGetDeviceCount, err := findProc(hip, "hipGetDeviceCount")
if err != nil {
return nil, err
}
hipDeviceGetName, err := findProc(hip, "hipDeviceGetName")
if err != nil {
return nil, err
}
hipDeviceTotalMem, err := findProc(hip, "hipDeviceTotalMem")
if err != nil {
return nil, err
}
hipDeviceGetPCIBusID, _ := findProc(hip, "hipDeviceGetPCIBusId")
hipDeviceGetAttribute, _ := findProc(hip, "hipDeviceGetAttribute")
var count int32
if ret, _, _ := hipGetDeviceCount.Call(uintptr(unsafe.Pointer(&count))); ret != hipSuccessWindows {
return nil, fmt.Errorf("hipGetDeviceCount failed: %d", ret)
}
devices := make([]nativeProbeDevice, 0, int(count))
for i := range int(count) {
name := make([]byte, 128)
hipDeviceGetName.Call(uintptr(unsafe.Pointer(&name[0])), uintptr(len(name)), uintptr(i))
var total uintptr
hipDeviceTotalMem.Call(uintptr(unsafe.Pointer(&total)), uintptr(i))
pci := ""
if hipDeviceGetPCIBusID != nil {
pciBuf := make([]byte, 32)
if ret, _, _ := hipDeviceGetPCIBusID.Call(uintptr(unsafe.Pointer(&pciBuf[0])), uintptr(len(pciBuf)), uintptr(i)); ret == hipSuccessWindows {
pci = strings.ToLower(byteCString(pciBuf))
}
}
integrated, integratedKnown := false, false
if hipDeviceGetAttribute != nil {
integrated = hipDeviceAttributeWindows(hipDeviceGetAttribute, hipDeviceAttributeIntegratedWindows, int32(i)) == 1
integratedKnown = true
}
devices = append(devices, nativeProbeDevice{
Library: "ROCm",
Index: i,
IndexMatchesBackend: true,
Description: byteCString(name),
DeviceID: pci,
Integrated: integrated,
IntegratedKnown: integratedKnown,
TotalMemory: uint64(total),
})
}
return devices, nil
}
func probeNVIDIADriverMajorWindows() (int, error) {
nvml, err := loadDLLFromSystem32("nvml.dll")
if err != nil {
nvml, err = loadDLLFromDirs([]string{"nvml.dll"}, nvidiaNVMLDirsWindows())
}
if err != nil {
return 0, err
}
initFn, err := findProc(nvml, "nvmlInit_v2")
if err != nil {
return 0, err
}
shutdownFn, err := findProc(nvml, "nvmlShutdown")
if err != nil {
return 0, err
}
driverFn, err := findProc(nvml, "nvmlSystemGetDriverVersion")
if err != nil {
return 0, err
}
if ret, _, _ := initFn.Call(); ret != 0 {
return 0, fmt.Errorf("nvmlInit_v2 failed: %d", ret)
}
defer shutdownFn.Call()
version := make([]byte, 80)
if ret, _, _ := driverFn.Call(uintptr(unsafe.Pointer(&version[0])), uintptr(len(version))); ret != 0 {
return 0, fmt.Errorf("nvmlSystemGetDriverVersion failed: %d", ret)
}
return parseNVIDIADriverMajor(byteCString(version))
}
func cudaDeviceAttributeWindows(fn *windows.Proc, attr int, device int32) int {
var value int32
if ret, _, _ := fn.Call(uintptr(unsafe.Pointer(&value)), uintptr(attr), uintptr(device)); ret != cuSuccessWindows {
return 0
}
return int(value)
}
func hipDeviceAttributeWindows(fn *windows.Proc, attr int, device int32) int {
var value int32
if ret, _, _ := fn.Call(uintptr(unsafe.Pointer(&value)), uintptr(attr), uintptr(device)); ret != hipSuccessWindows {
return 0
}
return int(value)
}
func findProc(dll *windows.DLL, name string) (*windows.Proc, error) {
return dll.FindProc(name)
}
// Use LoadLibraryEx so GPU discovery does not honor the current directory or PATH for DLL resolution.
func loadDLLFromSystem32(name string) (*windows.DLL, error) {
return loadDLLWithFlags(name, windows.LOAD_LIBRARY_SEARCH_SYSTEM32)
}
func loadDLLFromPath(path string) (*windows.DLL, error) {
absPath, err := filepath.Abs(path)
if err != nil {
return nil, err
}
return loadDLLWithFlags(absPath, windows.LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR|windows.LOAD_LIBRARY_SEARCH_DEFAULT_DIRS)
}
func loadDLLWithFlags(name string, flags uintptr) (*windows.DLL, error) {
handle, err := windows.LoadLibraryEx(name, 0, flags)
if err != nil {
return nil, fmt.Errorf("failed to load %s: %w", name, err)
}
return &windows.DLL{Name: name, Handle: handle}, nil
}
func loadDLLFromDirs(names, dirs []string) (*windows.DLL, error) {
var errs []string
for _, name := range names {
for _, dir := range dirs {
path := filepath.Join(dir, name)
if _, err := os.Stat(path); err != nil {
continue
}
dll, err := loadDLLFromPath(path)
if err == nil {
return dll, nil
}
errs = append(errs, err.Error())
}
}
if len(errs) == 0 {
return nil, fmt.Errorf("no matching DLL found: %s", strings.Join(names, ", "))
}
return nil, errors.New(strings.Join(errs, "; "))
}
func nvidiaNVMLDirsWindows() []string {
var dirs []string
for _, root := range windowsProgramFilesDirs() {
dirs = append(dirs, filepath.Join(root, "NVIDIA Corporation", "NVSMI"))
}
return uniqueAbsDirs(dirs)
}
func windowsProgramFilesDirs() []string {
return uniqueAbsDirs([]string{
os.Getenv("ProgramW6432"),
os.Getenv("ProgramFiles"),
})
}
func uniqueAbsDirs(dirs []string) []string {
seen := map[string]bool{}
var out []string
for _, dir := range dirs {
if dir == "" {
continue
}
absDir, err := filepath.Abs(dir)
if err != nil {
continue
}
absDir = filepath.Clean(absDir)
key := strings.ToLower(absDir)
if seen[key] {
continue
}
seen[key] = true
out = append(out, absDir)
}
return out
}
func procAny(dll *windows.DLL, names ...string) (*windows.Proc, error) {
var errs []string
for _, name := range names {
proc, err := dll.FindProc(name)
if err == nil {
return proc, nil
}
errs = append(errs, err.Error())
}
return nil, errors.New(strings.Join(errs, "; "))
}
//nolint:govet // Windows Proc.Call returns C string pointers as uintptr.
func windowsCString(ptr uintptr) string {
if ptr == 0 {
return ""
}
return windows.BytePtrToString((*byte)(unsafe.Pointer(ptr)))
}
func byteCString(data []byte) string {
for i, b := range data {
if b == 0 {
return string(data[:i])
}
}
return string(data)
}

View file

@ -4,9 +4,7 @@ package discover
import (
"context"
"io"
"log/slog"
"os"
"os/exec"
"path/filepath"
"runtime"
@ -27,7 +25,6 @@ var (
deviceMu sync.Mutex
devices []ml.DeviceInfo
libDirs map[string]struct{}
exe string
bootstrapped bool
)
@ -43,15 +40,6 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
if !bootstrapped {
msg = "GPU bootstrap discovery took"
libDirs = make(map[string]struct{})
var err error
exe, err = os.Executable()
if err != nil {
slog.Error("unable to lookup executable path", "error", err)
return nil
}
if eval, err := filepath.EvalSymlinks(exe); err == nil {
exe = eval
}
files, err := filepath.Glob(filepath.Join(ml.LibOllamaPath, "*", "*ggml-*"))
if err != nil {
slog.Debug("unable to lookup runner library directories", "error", err)
@ -66,6 +54,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
slog.Info("discovering available GPUs...")
detectIncompatibleLibraries()
detectOldAMDDriverWindows()
// Warn if any user-overrides are set which could lead to incorrect GPU discovery
overrideWarnings()
@ -102,8 +91,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
} else if jetpack == "" && strings.Contains(filepath.Base(dir), "cuda_jetpack") {
slog.Debug("jetpack not detected (set JETSON_JETPACK or OLLAMA_LLM_LIBRARY to override), skipping", "libDir", dir)
continue
} else if !envconfig.EnableVulkan() && strings.Contains(filepath.Base(dir), "vulkan") {
slog.Info("experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1")
} else if !envconfig.EnableVulkan(true) && strings.Contains(filepath.Base(dir), "vulkan") {
continue
}
dirs = []string{ml.LibOllamaPath, dir}
@ -113,10 +101,16 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
ctx1stPass, cancel := context.WithTimeout(ctx, bootstrapTimeout)
// For this pass, we retain duplicates in case any are incompatible with some libraries
devices = append(devices, bootstrapDevicesWithMetalRetry(ctx1stPass, ctx, bootstrapTimeout, dirs, nil)...)
discovered := bootstrapDevicesWithMetalRetry(ctx1stPass, ctx, bootstrapTimeout, dirs, nil)
if filepath.Base(dirs[len(dirs)-1]) == "cuda_v12" {
discovered = filterOldCUDADriver(ctx, discovered)
}
devices = append(devices, discovered...)
cancel()
}
devices = filterIntegratedGPUs(devices)
// In the second pass, we more deeply initialize the GPUs to weed out devices that
// aren't supported by a given library. We run this phase in parallel to speed up discovery.
// Only devices that need verification are included in this pass
@ -146,7 +140,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
wg.Add(1)
go func(i int) {
defer wg.Done()
extraEnvs := ml.GetDevicesEnv(devices[i:i+1], true)
extraEnvs := ml.GetDevicesEnv(devices[i : i+1])
devices[i].AddInitValidation(extraEnvs)
if len(bootstrapDevicesWithMetalRetry(ctx2ndPass, ctx, 30*time.Second, devices[i].LibraryPath, extraEnvs)) == 0 {
slog.Debug("filtering device which didn't fully initialize",
@ -193,6 +187,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
devices[i].FilterID = devices[i].ID
devices[i].ID = strconv.Itoa(postFilteredID[devices[i].Library])
}
remapFilterIDForUserVisibleDevices(&devices[i])
postFilteredID[devices[i].Library]++
}
}
@ -328,18 +323,18 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
// Bootstrapping may take longer in some cases (AMD windows), but we
// would rather use stale free data to get the model running sooner
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
rctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
// Apply any dev filters to avoid re-discovering unsupported devices, and get IDs correct
// We avoid CUDA filters here to keep ROCm from failing to discover GPUs in a mixed environment
devFilter := ml.GetDevicesEnv(devices, false)
// Apply any device filters to avoid re-discovering unsupported devices
// and keep remapped IDs aligned.
devFilter := ml.GetDevicesEnv(devices)
for dir := range libDirs {
updatedDevices := bootstrapDevices(ctx, []string{ml.LibOllamaPath, dir}, devFilter)
updatedDevices := bootstrapDevicesWithMetalRetry(rctx, ctx, 3*time.Second, []string{ml.LibOllamaPath, dir}, devFilter)
for _, u := range updatedDevices {
for i := range devices {
if u.DeviceID == devices[i].DeviceID && u.PCIID == devices[i].PCIID {
if sameRefreshDevice(u, devices[i]) {
updated[i] = true
devices[i].FreeMemory = u.FreeMemory
break
@ -360,6 +355,64 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
return append([]ml.DeviceInfo{}, devices...)
}
func sameRefreshDevice(updated, existing ml.DeviceInfo) bool {
if updated.Library != existing.Library {
return false
}
if updated.PCIID != "" && existing.PCIID != "" {
return strings.EqualFold(updated.PCIID, existing.PCIID)
}
return updated.DeviceID == existing.DeviceID
}
func filterIntegratedGPUs(devices []ml.DeviceInfo) []ml.DeviceInfo {
if runtime.GOOS == "darwin" && runtime.GOARCH == "arm64" {
return devices
}
allow, explicit := integratedGPUAdmission()
filtered := devices[:0]
for _, device := range devices {
if !device.Integrated {
filtered = append(filtered, device)
continue
}
if explicit {
if allow {
filtered = append(filtered, device)
continue
}
} else if integratedGPUAllowedByDefault(device) {
filtered = append(filtered, device)
continue
}
slog.Info("dropping integrated GPU",
"id", device.ID,
"library", device.Library,
"compute", device.Compute(),
"name", device.Name,
"description", device.Description,
"pci_id", device.PCIID)
}
return filtered
}
func integratedGPUAdmission() (allow, explicit bool) {
enabledWithTrueDefault := envconfig.EnableIntegratedGPU(true)
enabledWithFalseDefault := envconfig.EnableIntegratedGPU(false)
if enabledWithTrueDefault == enabledWithFalseDefault {
return enabledWithTrueDefault, true
}
return false, false
}
func integratedGPUAllowedByDefault(device ml.DeviceInfo) bool {
return device.Library == "CUDA"
}
func filterOverlapByLibrary(supported map[string]map[string]map[string]int, needsDelete []bool) {
// For multi-GPU systems, use the newest version that supports all the GPUs
for _, byLibDirs := range supported {
@ -410,67 +463,216 @@ func filterOverlapByLibrary(supported map[string]map[string]map[string]int, need
}
}
type bootstrapRunner struct {
port int
cmd *exec.Cmd
}
func (r *bootstrapRunner) GetPort() int {
return r.port
}
func (r *bootstrapRunner) HasExited() bool {
if r.cmd != nil && r.cmd.ProcessState != nil {
return true
}
return false
}
func bootstrapDevicesWithMetalRetry(firstAttemptCtx, retryParentCtx context.Context, timeout time.Duration, ollamaLibDirs []string, extraEnvs map[string]string) []ml.DeviceInfo {
runDiscovery := func(ctx context.Context, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, int, error) {
extraEnvs = normalizeDiscoveryEnv(ollamaLibDirs, extraEnvs)
runDiscovery := func(ctx context.Context, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
start := time.Now()
defer func() {
slog.Debug("bootstrap discovery took", "duration", time.Since(start), "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs)
}()
return bootstrapDevicesWithStatus(ctx, ollamaLibDirs, extraEnvs)
return bootstrapDevicesWithStatusWatchdog(ctx, ollamaLibDirs, extraEnvs)
}
devices, status, exitCode, err := runDiscovery(firstAttemptCtx, extraEnvs)
devices, status, err := runDiscovery(firstAttemptCtx, extraEnvs)
if err == nil {
recordPersistentRunnerEnv(devices, extraEnvs)
return devices
}
if err != nil && llm.ShouldRetryWithMetalTensorDisabled(err, status) && (extraEnvs == nil || extraEnvs["GGML_METAL_TENSOR_DISABLE"] != "1") {
if llm.ShouldRetryWithMetalTensorDisabled(err, status) && (extraEnvs == nil || extraEnvs["GGML_METAL_TENSOR_DISABLE"] != "1") {
retryEnvs := map[string]string{}
for k, v := range extraEnvs {
retryEnvs[k] = v
}
retryEnvs["GGML_METAL_TENSOR_DISABLE"] = "1"
slog.Warn("retrying GPU discovery with Metal tensor API disabled", "error", err)
slog.Warn("retrying llama-server GPU discovery with Metal tensor API disabled", "error", err, "detail", lastDiscoveryStatusError(status))
retryCtx, cancel := context.WithTimeout(retryParentCtx, timeout)
defer cancel()
devices, status, exitCode, err = runDiscovery(retryCtx, retryEnvs)
devices, status, err = runDiscovery(retryCtx, retryEnvs)
if err == nil {
recordPersistentRunnerEnv(devices, retryEnvs)
return devices
}
}
if err != nil {
if exitCode >= 0 {
// Expected during bootstrapping while we filter out unsupported GPUs.
logutil.Trace("runner exited", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "code", exitCode, "detail", status.LastError())
} else {
slog.Info("failure during GPU discovery", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", err, "detail", status.LastError())
}
}
slog.Info("failure during llama-server GPU discovery", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", err, "detail", lastDiscoveryStatusError(status))
return devices
}
func normalizeDiscoveryEnv(ollamaLibDirs []string, extraEnvs map[string]string) map[string]string {
return normalizeDiscoveryEnvForGOOS(runtime.GOOS, ollamaLibDirs, extraEnvs)
}
func normalizeDiscoveryEnvForGOOS(goos string, ollamaLibDirs []string, extraEnvs map[string]string) map[string]string {
if goos != "linux" || len(ollamaLibDirs) == 0 || !isROCmLibraryDir(filepath.Base(ollamaLibDirs[len(ollamaLibDirs)-1])) {
return extraEnvs
}
if extraEnvs["ROCR_VISIBLE_DEVICES"] != "" || envconfig.RocrVisibleDevices() != "" {
return extraEnvs
}
source, tokens := rocmNumericVisibleDeviceSource(extraEnvs)
if len(tokens) == 0 {
return extraEnvs
}
env := make(map[string]string, len(extraEnvs)+1)
for k, v := range extraEnvs {
env[k] = v
}
env["ROCR_VISIBLE_DEVICES"] = strings.Join(tokens, ",")
env[source] = visibleDeviceOrdinals(len(tokens))
slog.Debug("normalizing AMD visible devices for ROCm discovery", "from_env", source, "ROCR_VISIBLE_DEVICES", env["ROCR_VISIBLE_DEVICES"], "visible_ordinals", env[source])
return env
}
func isROCmLibraryDir(name string) bool {
return strings.HasPrefix(name, "rocm")
}
type bootstrapDevicesResult struct {
devices []ml.DeviceInfo
status *llm.StatusWriter
err error
}
func bootstrapDevicesWithStatusWatchdog(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
return runBootstrapDevicesWithStatusWatchdog(ctx, ollamaLibDirs, extraEnvs, llamaServerBootstrapDevicesWithStatus)
}
func runBootstrapDevicesWithStatusWatchdog(
ctx context.Context,
ollamaLibDirs []string,
extraEnvs map[string]string,
discover func(context.Context, []string, map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error),
) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
resultCh := make(chan bootstrapDevicesResult, 1)
go func() {
devices, status, err := discover(ctx, ollamaLibDirs, extraEnvs)
resultCh <- bootstrapDevicesResult{devices: devices, status: status, err: err}
}()
select {
case result := <-resultCh:
return result.devices, result.status, result.err
case <-ctx.Done():
slog.Warn("llama-server GPU discovery watchdog timed out", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", ctx.Err())
return nil, nil, ctx.Err()
}
}
func remapFilterIDForUserVisibleDevices(device *ml.DeviceInfo) {
tokens := visibleDeviceFilterTokens(runtime.GOOS, device.Library)
if len(tokens) == 0 {
return
}
id := device.FilterID
if id == "" {
id = device.ID
}
index, err := strconv.Atoi(id)
if err != nil || index < 0 || index >= len(tokens) {
return
}
device.FilterID = tokens[index]
}
func visibleDeviceFilterTokens(goos, library string) []string {
switch library {
case "CUDA":
return splitVisibleDeviceList(envconfig.CudaVisibleDevices())
case "ROCm":
if goos == "linux" {
if tokens := splitVisibleDeviceList(envconfig.RocrVisibleDevices()); len(tokens) > 0 {
return tokens
}
if _, tokens := rocmNumericVisibleDeviceSource(nil); len(tokens) > 0 {
return tokens
}
return nil
}
for _, value := range []string{envconfig.HipVisibleDevices(), envconfig.GpuDeviceOrdinal(), envconfig.CudaVisibleDevices()} {
if tokens := splitNumericVisibleDeviceList(value); len(tokens) > 0 {
return tokens
}
}
case "Vulkan":
return splitVisibleDeviceList(envconfig.VkVisibleDevices())
}
return nil
}
func rocmNumericVisibleDeviceSource(extraEnvs map[string]string) (string, []string) {
for _, name := range []string{"HIP_VISIBLE_DEVICES", "GPU_DEVICE_ORDINAL", "CUDA_VISIBLE_DEVICES"} {
value := extraEnvs[name]
if value == "" {
switch name {
case "HIP_VISIBLE_DEVICES":
value = envconfig.HipVisibleDevices()
case "GPU_DEVICE_ORDINAL":
value = envconfig.GpuDeviceOrdinal()
case "CUDA_VISIBLE_DEVICES":
value = envconfig.CudaVisibleDevices()
}
}
if tokens := splitNumericVisibleDeviceList(value); len(tokens) > 0 {
return name, tokens
}
}
return "", nil
}
func splitVisibleDeviceList(value string) []string {
fields := strings.Split(value, ",")
tokens := make([]string, 0, len(fields))
for _, field := range fields {
field = strings.TrimSpace(field)
if field != "" {
tokens = append(tokens, field)
}
}
return tokens
}
func splitNumericVisibleDeviceList(value string) []string {
tokens := splitVisibleDeviceList(value)
if len(tokens) == 0 {
return nil
}
for _, token := range tokens {
index, err := strconv.Atoi(token)
if err != nil || index < 0 {
return nil
}
}
return tokens
}
func visibleDeviceOrdinals(count int) string {
ordinals := make([]string, count)
for i := range ordinals {
ordinals[i] = strconv.Itoa(i)
}
return strings.Join(ordinals, ",")
}
func lastDiscoveryStatusError(status *llm.StatusWriter) string {
if status == nil {
return ""
}
return status.LastError()
}
func recordPersistentRunnerEnv(devices []ml.DeviceInfo, extraEnvs map[string]string) {
if extraEnvs["GGML_METAL_TENSOR_DISABLE"] != "1" {
return
}
for i := range devices {
if devices[i].Library != "Metal" {
continue
@ -482,44 +684,6 @@ func recordPersistentRunnerEnv(devices []ml.DeviceInfo, extraEnvs map[string]str
}
}
func bootstrapDevices(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) []ml.DeviceInfo {
devices, _, _, _ := bootstrapDevicesWithStatus(ctx, ollamaLibDirs, extraEnvs)
return devices
}
func bootstrapDevicesWithStatus(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, int, error) {
var baseOut io.Writer = io.Discard
if envconfig.LogLevel() == logutil.LevelTrace {
baseOut = os.Stderr
}
status := llm.NewStatusWriter(baseOut)
cmd, port, err := llm.StartRunner(
true, // ollama engine
"", // no model
ollamaLibDirs,
status,
extraEnvs,
)
if err != nil {
slog.Debug("failed to start runner to discovery GPUs", "error", err)
return nil, status, -1, err
}
go func() {
cmd.Wait() // exit status ignored
}()
defer cmd.Process.Kill()
devices, err := ml.GetDevicesFromRunner(ctx, &bootstrapRunner{port: port, cmd: cmd})
exitCode := -1
if cmd.ProcessState != nil {
exitCode = cmd.ProcessState.ExitCode()
}
return devices, status, exitCode, err
}
func overrideWarnings() {
anyFound := false
m := envconfig.AsMap()

View file

@ -1,18 +1,16 @@
package discover
import (
"log/slog"
"os"
"context"
"errors"
"runtime"
"testing"
"time"
"github.com/ollama/ollama/llm"
"github.com/ollama/ollama/ml"
)
func init() {
logger := slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelDebug}))
slog.SetDefault(logger)
}
func TestFilterOverlapByLibrary(t *testing.T) {
type testcase struct {
name string
@ -133,3 +131,266 @@ func TestRecordPersistentRunnerEnv(t *testing.T) {
t.Fatalf("unexpected RunnerEnvOverrides recorded for non-Metal device: %#v", devices[1].RunnerEnvOverrides)
}
}
func TestFilterIntegratedGPUs(t *testing.T) {
devices := []ml.DeviceInfo{
{DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"}, Description: "NVIDIA integrated", Integrated: true},
{DeviceID: ml.DeviceID{Library: "Metal", ID: "0"}, Description: "Apple GPU", Integrated: true},
{DeviceID: ml.DeviceID{Library: "Vulkan", ID: "0"}, Description: "AMD Radeon(TM) Graphics", Integrated: true},
{DeviceID: ml.DeviceID{Library: "ROCm", ID: "0"}, Description: "AMD Radeon(TM) Graphics", Integrated: true},
{DeviceID: ml.DeviceID{Library: "Vulkan", ID: "1"}, Description: "AMD Radeon RX 6800"},
}
if runtime.GOOS == "darwin" && runtime.GOARCH == "arm64" {
t.Setenv("OLLAMA_IGPU_ENABLE", "false")
got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
want := []ml.DeviceID{
{Library: "CUDA", ID: "0"},
{Library: "Metal", ID: "0"},
{Library: "Vulkan", ID: "0"},
{Library: "ROCm", ID: "0"},
{Library: "Vulkan", ID: "1"},
}
assertDeviceIDs(t, got, want)
return
}
t.Run("auto admits only allowlisted integrated GPUs", func(t *testing.T) {
got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
want := []ml.DeviceID{
{Library: "CUDA", ID: "0"},
{Library: "Vulkan", ID: "1"},
}
assertDeviceIDs(t, got, want)
})
t.Run("explicit true admits all integrated GPUs", func(t *testing.T) {
t.Setenv("OLLAMA_IGPU_ENABLE", "true")
got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
want := []ml.DeviceID{
{Library: "CUDA", ID: "0"},
{Library: "Metal", ID: "0"},
{Library: "Vulkan", ID: "0"},
{Library: "ROCm", ID: "0"},
{Library: "Vulkan", ID: "1"},
}
assertDeviceIDs(t, got, want)
})
t.Run("explicit false drops integrated GPUs", func(t *testing.T) {
t.Setenv("OLLAMA_IGPU_ENABLE", "false")
got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
want := []ml.DeviceID{{Library: "Vulkan", ID: "1"}}
assertDeviceIDs(t, got, want)
})
}
func assertDeviceIDs(t *testing.T, got []ml.DeviceInfo, want []ml.DeviceID) {
t.Helper()
if len(got) != len(want) {
t.Fatalf("got %d devices, want %d: %#v", len(got), len(want), got)
}
for i := range want {
if got[i].DeviceID != want[i] {
t.Fatalf("device %d = %#v, want %#v", i, got[i].DeviceID, want[i])
}
}
}
func TestRemapFilterIDForUserVisibleDevices(t *testing.T) {
tests := []struct {
name string
env map[string]string
device ml.DeviceInfo
wantID string
wantFilter string
}{
{
name: "cuda numeric parent filter",
env: map[string]string{"CUDA_VISIBLE_DEVICES": "1"},
device: ml.DeviceInfo{
DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"},
FilterID: "0",
},
wantID: "0",
wantFilter: "1",
},
{
name: "cuda uuid parent filter",
env: map[string]string{"CUDA_VISIBLE_DEVICES": "GPU-f3a94ab8-b31d-61ff-9fbb-ce91ac1cdd95"},
device: ml.DeviceInfo{
DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"},
FilterID: "0",
},
wantID: "0",
wantFilter: "GPU-f3a94ab8-b31d-61ff-9fbb-ce91ac1cdd95",
},
{
name: "rocm hip parent filter",
env: map[string]string{"HIP_VISIBLE_DEVICES": "2,0"},
device: ml.DeviceInfo{
DeviceID: ml.DeviceID{Library: "ROCm", ID: "1"},
FilterID: "1",
},
wantID: "1",
wantFilter: "0",
},
{
name: "vulkan parent filter",
env: map[string]string{"GGML_VK_VISIBLE_DEVICES": "1"},
device: ml.DeviceInfo{
DeviceID: ml.DeviceID{Library: "Vulkan", ID: "0"},
FilterID: "0",
},
wantID: "0",
wantFilter: "1",
},
{
name: "no parent filter keeps internal filter id",
device: ml.DeviceInfo{
DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"},
FilterID: "3",
},
wantID: "0",
wantFilter: "3",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
for key, value := range tt.env {
t.Setenv(key, value)
}
remapFilterIDForUserVisibleDevices(&tt.device)
if tt.device.ID != tt.wantID {
t.Fatalf("ID = %q, want %q", tt.device.ID, tt.wantID)
}
if tt.device.FilterID != tt.wantFilter {
t.Fatalf("FilterID = %q, want %q", tt.device.FilterID, tt.wantFilter)
}
})
}
}
func TestNormalizeROCmDiscoveryEnv(t *testing.T) {
tests := []struct {
name string
env map[string]string
extra map[string]string
wantROCR string
wantSource string
wantOrdinal string
wantSame bool
}{
{
name: "hip becomes rocr",
env: map[string]string{"HIP_VISIBLE_DEVICES": "2"},
wantROCR: "2",
wantSource: "HIP_VISIBLE_DEVICES",
wantOrdinal: "0",
},
{
name: "gpu ordinal becomes rocr",
env: map[string]string{"GPU_DEVICE_ORDINAL": "3"},
wantROCR: "3",
wantSource: "GPU_DEVICE_ORDINAL",
wantOrdinal: "0",
},
{
name: "cuda numeric becomes rocr",
env: map[string]string{"CUDA_VISIBLE_DEVICES": "2,0"},
wantROCR: "2,0",
wantSource: "CUDA_VISIBLE_DEVICES",
wantOrdinal: "0,1",
},
{
name: "rocr wins",
env: map[string]string{"ROCR_VISIBLE_DEVICES": "1", "HIP_VISIBLE_DEVICES": "2"},
wantSame: true,
},
{
name: "cuda uuid does not become rocr",
env: map[string]string{"CUDA_VISIBLE_DEVICES": "GPU-f3a94ab8-b31d-61ff-9fbb-ce91ac1cdd95"},
wantSame: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
for key, value := range tt.env {
t.Setenv(key, value)
}
got := normalizeDiscoveryEnvForGOOS("linux", []string{"/lib/ollama", "/lib/ollama/rocm_v7_2"}, tt.extra)
if tt.wantSame {
if got != nil && got["ROCR_VISIBLE_DEVICES"] != "" {
t.Fatalf("ROCR_VISIBLE_DEVICES = %q, want unset", got["ROCR_VISIBLE_DEVICES"])
}
return
}
if got["ROCR_VISIBLE_DEVICES"] != tt.wantROCR {
t.Fatalf("ROCR_VISIBLE_DEVICES = %q, want %q", got["ROCR_VISIBLE_DEVICES"], tt.wantROCR)
}
if got[tt.wantSource] != tt.wantOrdinal {
t.Fatalf("%s = %q, want %q", tt.wantSource, got[tt.wantSource], tt.wantOrdinal)
}
})
}
}
func TestBootstrapDevicesWithStatusWatchdogReturnsResult(t *testing.T) {
want := []ml.DeviceInfo{{DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"}}}
devices, _, err := runBootstrapDevicesWithStatusWatchdog(
t.Context(),
[]string{"/lib/ollama", "/lib/ollama/cuda_v12"},
nil,
func(context.Context, []string, map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
return want, nil, nil
},
)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if len(devices) != 1 || devices[0].DeviceID != want[0].DeviceID {
t.Fatalf("devices = %#v, want %#v", devices, want)
}
}
func TestBootstrapDevicesWithStatusWatchdogReturnsOnDeadline(t *testing.T) {
ctx, cancel := context.WithTimeout(t.Context(), 10*time.Millisecond)
defer cancel()
started := make(chan struct{})
release := make(chan struct{})
finished := make(chan struct{})
_, _, err := runBootstrapDevicesWithStatusWatchdog(
ctx,
[]string{"/lib/ollama", "/lib/ollama/rocm_v7_2"},
nil,
func(context.Context, []string, map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
close(started)
defer close(finished)
<-release
return nil, nil, nil
},
)
if !errors.Is(err, context.DeadlineExceeded) {
t.Fatalf("err = %v, want context deadline exceeded", err)
}
close(release)
select {
case <-started:
case <-time.After(time.Second):
t.Fatal("discovery function was not called")
}
select {
case <-finished:
case <-time.After(time.Second):
t.Fatal("discovery function did not finish after release")
}
}

View file

@ -16,16 +16,6 @@ type memInfo struct {
FreeSwap uint64 `json:"free_swap,omitempty"` // TODO split this out for system only
}
// CPU type represents a CPU Package occupying a socket
type CPU struct {
ID string `cpuinfo:"processor"`
VendorID string `cpuinfo:"vendor_id"`
ModelName string `cpuinfo:"model name"`
CoreCount int
EfficiencyCoreCount int // Performance = CoreCount - Efficiency
ThreadCount int
}
func LogDetails(devices []ml.DeviceInfo) {
sort.Sort(sort.Reverse(ml.ByFreeMemory(devices))) // Report devices in order of scheduling preference
for _, dev := range devices {

136
discover/vulkan.go Normal file
View file

@ -0,0 +1,136 @@
// Vulkan discovery needs a small amount of normalization around device type.
// llama-server discovery output does not currently expose a stable structured
// backend type field, so we use explicit Vulkan UMA metadata when it is
// present and, on Windows, refine the result with a direct Vulkan API query.
// The goal is to preserve correct integrated-vs-discrete scheduling decisions
// without relying on device-name heuristics.
package discover
import (
"bufio"
"errors"
"log/slog"
"regexp"
"runtime"
"strconv"
"strings"
"github.com/ollama/ollama/ml"
)
// vulkanUMARegex matches Vulkan debug lines like:
//
// ggml_vulkan: 0 = Intel(R) Graphics (...) | uma: 1 | fp16: 1 |
var vulkanUMARegex = regexp.MustCompile(
`ggml_vulkan:\s+(\d+)\s+=.*\|\s+uma:\s+([01])\s+\|`,
)
func parseVulkanUMA(output string) map[int]bool {
integratedByIndex := make(map[int]bool)
scanner := bufio.NewScanner(strings.NewReader(output))
for scanner.Scan() {
if matches := vulkanUMARegex.FindStringSubmatch(scanner.Text()); matches != nil {
idx, _ := strconv.Atoi(matches[1])
integratedByIndex[idx] = matches[2] == "1"
}
}
return integratedByIndex
}
var errWindowsVulkanProbeUnsupported = errors.New("windows vulkan probe unsupported")
type vulkanPhysicalDevice struct {
Name string
Integrated bool
}
var probeLlamaServerVulkanDevices = func(_ []string) ([]vulkanPhysicalDevice, error) {
return nil, errWindowsVulkanProbeUnsupported
}
func refineLlamaServerDevices(devices []ml.DeviceInfo, libDirs []string) []ml.DeviceInfo {
devices = refineLinuxROCmDevices(devices)
return refineWindowsVulkanDevices(devices, libDirs)
}
func refineWindowsVulkanDevices(devices []ml.DeviceInfo, libDirs []string) []ml.DeviceInfo {
if runtime.GOOS != "windows" {
return devices
}
var vulkanIndexes []int
for i, device := range devices {
if device.Library != "Vulkan" {
continue
}
vulkanIndexes = append(vulkanIndexes, i)
}
if len(vulkanIndexes) == 0 {
return devices
}
probed, err := probeLlamaServerVulkanDevices(libDirs)
if err != nil {
if !errors.Is(err, errWindowsVulkanProbeUnsupported) {
slog.Debug("windows vulkan device refinement unavailable", "error", err)
}
return devices
}
if !applyWindowsVulkanRefinement(devices, probed) {
return devices
}
return devices
}
func applyWindowsVulkanRefinement(devices []ml.DeviceInfo, probed []vulkanPhysicalDevice) bool {
var vulkanIndexes []int
for i, device := range devices {
if device.Library == "Vulkan" {
vulkanIndexes = append(vulkanIndexes, i)
}
}
if len(probed) != len(vulkanIndexes) {
slog.Debug("windows vulkan device refinement skipped: device count mismatch",
"llama_server_count", len(vulkanIndexes), "vulkan_count", len(probed))
return false
}
matches := make([]int, len(vulkanIndexes))
for i := range matches {
matches[i] = -1
}
used := make([]bool, len(probed))
for i, deviceIndex := range vulkanIndexes {
description := devices[deviceIndex].Description
for j, probedDevice := range probed {
if used[j] || !sameVulkanDeviceName(description, probedDevice.Name) {
continue
}
matches[i] = j
used[j] = true
break
}
if matches[i] < 0 {
slog.Debug("windows vulkan device refinement skipped: device name mismatch",
"index", i, "llama_server_name", description)
return false
}
}
for i, probedIndex := range matches {
devices[vulkanIndexes[i]].Integrated = probed[probedIndex].Integrated
}
slog.Debug("windows vulkan device refinement applied", "devices", len(vulkanIndexes))
return true
}
func sameVulkanDeviceName(a, b string) bool {
return ml.SimilarDeviceDescription(a, b)
}

View file

@ -0,0 +1,3 @@
//go:build !windows
package discover

View file

@ -0,0 +1,119 @@
package discover
import (
"fmt"
"unsafe"
"github.com/ollama/ollama/llm"
)
const (
vkSuccess = 0
vkStructureTypeInstanceCreateInfo = 1
vkPhysicalDeviceTypeIntegratedGPU = 1
vkMaxPhysicalDeviceNameSize = 256
vkPhysicalDevicePropertiesByteCount = 4096
)
type vkInstanceCreateInfo struct {
SType uint32
PNext uintptr
Flags uint32
PApplicationInfo uintptr
EnabledLayerCount uint32
PpEnabledLayerNames uintptr
EnabledExtensionCount uint32
PpEnabledExtensionNames uintptr
}
func init() {
probeLlamaServerVulkanDevices = windowsVulkanPhysicalDevices
}
func windowsVulkanPhysicalDevices(libDirs []string) ([]vulkanPhysicalDevice, error) {
vulkanPath, err := llm.WindowsVulkanRuntimeDLLPath(libDirs)
if err != nil {
return nil, err
}
vulkanDLL, err := loadDLLFromPath(vulkanPath)
if err != nil {
return nil, err
}
vkCreateInstanceProc, err := findProc(vulkanDLL, "vkCreateInstance")
if err != nil {
return nil, fmt.Errorf("vkCreateInstance unavailable: %w", err)
}
vkDestroyInstanceProc, err := findProc(vulkanDLL, "vkDestroyInstance")
if err != nil {
return nil, fmt.Errorf("vkDestroyInstance unavailable: %w", err)
}
vkEnumeratePhysicalDevices, err := findProc(vulkanDLL, "vkEnumeratePhysicalDevices")
if err != nil {
return nil, fmt.Errorf("vkEnumeratePhysicalDevices unavailable: %w", err)
}
vkGetPhysicalDeviceProperties, err := findProc(vulkanDLL, "vkGetPhysicalDeviceProperties")
if err != nil {
return nil, fmt.Errorf("vkGetPhysicalDeviceProperties unavailable: %w", err)
}
createInfo := vkInstanceCreateInfo{SType: vkStructureTypeInstanceCreateInfo}
var instance uintptr
result, _, callErr := vkCreateInstanceProc.Call(
uintptr(unsafe.Pointer(&createInfo)),
0,
uintptr(unsafe.Pointer(&instance)),
)
if result != vkSuccess {
return nil, fmt.Errorf("vkCreateInstance failed: result=%d error=%w", result, callErr)
}
defer vkDestroyInstanceProc.Call(instance, 0)
var count uint32
result, _, callErr = vkEnumeratePhysicalDevices.Call(
instance,
uintptr(unsafe.Pointer(&count)),
0,
)
if result != vkSuccess {
return nil, fmt.Errorf("vkEnumeratePhysicalDevices count failed: result=%d error=%w", result, callErr)
}
if count == 0 {
return nil, nil
}
physicalDevices := make([]uintptr, int(count))
result, _, callErr = vkEnumeratePhysicalDevices.Call(
instance,
uintptr(unsafe.Pointer(&count)),
uintptr(unsafe.Pointer(&physicalDevices[0])),
)
if result != vkSuccess {
return nil, fmt.Errorf("vkEnumeratePhysicalDevices failed: result=%d error=%w", result, callErr)
}
devices := make([]vulkanPhysicalDevice, 0, count)
for _, physicalDevice := range physicalDevices[:int(count)] {
properties := make([]byte, vkPhysicalDevicePropertiesByteCount)
vkGetPhysicalDeviceProperties.Call(
physicalDevice,
uintptr(unsafe.Pointer(&properties[0])),
)
deviceType := *(*uint32)(unsafe.Pointer(&properties[16]))
deviceNameBytes := properties[20 : 20+vkMaxPhysicalDeviceNameSize]
devices = append(devices, vulkanPhysicalDevice{
Name: nulTerminatedString(deviceNameBytes),
Integrated: deviceType == vkPhysicalDeviceTypeIntegratedGPU,
})
}
return devices, nil
}
func nulTerminatedString(data []byte) string {
for i, b := range data {
if b == 0 {
return string(data[:i])
}
}
return string(data)
}

View file

@ -398,6 +398,7 @@ curl http://localhost:11434/api/generate -d '{
"num_keep": 5,
"seed": 42,
"num_predict": 100,
"draft_num_predict": 4,
"top_k": 20,
"top_p": 0.9,
"min_p": 0.0,

View file

@ -3,9 +3,11 @@
Install prerequisites:
- [Go](https://go.dev/doc/install)
- C/C++ Compiler e.g. Clang on macOS, [TDM-GCC](https://github.com/jmeubank/tdm-gcc/releases/latest) (Windows amd64) or [llvm-mingw](https://github.com/mstorsjo/llvm-mingw) (Windows arm64), GCC/Clang on Linux.
- [CMake](https://cmake.org/download/) 3.24 or newer
- C/C++ compiler: Clang on macOS, Visual Studio 2022 C++ tools on Windows, or GCC/Clang on Linux
- [Ninja](https://github.com/ninja-build/ninja/releases) in `PATH` is recommended, especially on Windows
Then build and run Ollama from the root directory of the repository:
For pure Go iteration against an existing native payload, run Ollama from the repository root:
```shell
go run . serve
@ -14,53 +16,73 @@ go run . serve
> [!NOTE]
> Ollama includes native code compiled with CGO. From time to time these data structures can change and CGO can get out of sync resulting in unexpected crashes. You can force a full build of the native code by running `go clean -cache` first.
## Native build model
For a fresh checkout, or after changing native code, build from the repository root. On macOS arm64, this builds Metal inference. On all other platforms this builds CPU-only inference. It builds the Go binary at the repository root and installs the native runtime payload under `build/lib/ollama`.
```shell
cmake -B build .
cmake --build build --parallel 8
./ollama serve
```
To install into a standard prefix layout:
```shell
cmake --install build --prefix /path/to/install
```
On all platforms except macOS arm64, to build GPU backends select the backends explicitly:
```shell
cmake -B build . -DOLLAMA_LLAMA_BACKENDS="cuda_v13;vulkan"
cmake --build build --parallel 8
```
Supported backend values are `cuda_v12`, `cuda_v13`, `rocm_v7_1`, `rocm_v7_2`, `vulkan`, `cuda_jetpack5`, and `cuda_jetpack6`.
Use standard CMake architecture overrides to narrow GPU builds for local hardware:
```shell
# CUDA
cmake -B build . -DOLLAMA_LLAMA_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=native
# ROCm / HIP
cmake -B build . -DOLLAMA_LLAMA_BACKENDS=rocm_v7_2 -DCMAKE_HIP_ARCHITECTURES=gfx1100
```
You can tune GGML build options by setting `GGML_*` values during configure. For example, to build CUDA v12 for Pascal without flash attention kernels:
```shell
cmake -B build . -DOLLAMA_LLAMA_BACKENDS=cuda_v12 -DCMAKE_CUDA_ARCHITECTURES=61 -DGGML_CUDA_FA=OFF
```
## macOS (Apple Silicon)
macOS Apple Silicon supports Metal which is built-in to the Ollama binary. No additional steps are required.
Additional prerequisites:
## macOS (Intel)
Install prerequisites:
- [CMake](https://cmake.org/download/) or `brew install cmake`
Then, configure and build the project:
MLX Metal requires the Metal toolchain. Install [Xcode](https://developer.apple.com/xcode/) first, then:
```shell
cmake -B build
cmake --build build
```
Lastly, run Ollama:
```shell
go run . serve
xcodebuild -downloadComponent MetalToolchain
```
## Windows
Install prerequisites:
Additional prerequisites:
- [CMake](https://cmake.org/download/)
- [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/) including the Native Desktop Workload
- (Optional) AMD GPU support
- [ROCm](https://rocm.docs.amd.com/en/latest/)
- [Ninja](https://github.com/ninja-build/ninja/releases)
- (Optional) NVIDIA GPU support
- [CUDA SDK](https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_network)
- (Optional) VULKAN GPU support
- [VULKAN SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
- [CUDA SDK](https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_type=exe_network)
- (Optional) Vulkan GPU support
- [Vulkan SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
- (Optional) MLX engine support
- [CUDA 13+ SDK](https://developer.nvidia.com/cuda-downloads)
- [cuDNN 9+](https://developer.nvidia.com/cudnn)
Then, configure and build the project:
```shell
cmake -B build
cmake --build build --config Release
```
For Ninja builds, run CMake from a Developer PowerShell/Command Prompt or another shell where the Visual Studio compiler is available.
> Building for Vulkan requires VULKAN_SDK environment variable:
>
@ -73,36 +95,20 @@ cmake --build build --config Release
> set VULKAN_SDK=C:\VulkanSDK\<version>
> ```
> [!IMPORTANT]
> Building for ROCm requires additional flags:
> ```
> cmake -B build -G Ninja -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
> cmake --build build --config Release
> ```
Lastly, run Ollama:
```shell
go run . serve
```
## Windows (ARM)
Windows ARM does not support additional acceleration libraries at this time. Do not use cmake, simply `go run` or `go build`.
Windows ARM does not support additional acceleration libraries at this time.
## Linux
Install prerequisites:
Additional prerequisites:
- [CMake](https://cmake.org/download/) or `sudo apt install cmake` or `sudo dnf install cmake`
- (Optional) AMD GPU support
- [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html)
- (Optional) NVIDIA GPU support
- [CUDA SDK](https://developer.nvidia.com/cuda-downloads)
- (Optional) VULKAN GPU support
- [VULKAN SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
- (Optional) Vulkan GPU support
- [Vulkan SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
- Or install via package manager: `sudo apt install vulkan-sdk` (Ubuntu/Debian) or `sudo dnf install vulkan-sdk` (Fedora/CentOS)
- (Optional) MLX engine support
- [CUDA 13+ SDK](https://developer.nvidia.com/cuda-downloads)
@ -111,57 +117,17 @@ Install prerequisites:
> [!IMPORTANT]
> Ensure prerequisites are in `PATH` before running CMake.
Then, configure and build the project:
```shell
cmake -B build
cmake --build build
```
Lastly, run Ollama:
```shell
go run . serve
```
## MLX Engine (Optional)
The MLX engine enables running safetensor based models. It requires building the [MLX](https://github.com/ml-explore/mlx) and [MLX-C](https://github.com/ml-explore/mlx-c) shared libraries separately via CMake. On MacOS, MLX leverages the Metal library to run on the GPU, and on Windows and Linux, runs on NVIDIA GPUs via CUDA v13.
The MLX engine enables running safetensor based models. On macOS arm64, MLX is enabled by default. On other platforms, MLX backends are selected with `OLLAMA_MLX_BACKENDS`.
### macOS (Apple Silicon)
Requires the Metal toolchain. Install [Xcode](https://developer.apple.com/xcode/) first, then:
```shell
xcodebuild -downloadComponent MetalToolchain
```
Verify it's installed correctly (should print "no input files"):
```shell
xcrun metal
```
Then build:
```shell
cmake -B build --preset MLX
cmake --build build --preset MLX --parallel
cmake --install build --component MLX
```
> [!NOTE]
> Without the Metal toolchain, cmake will silently complete with Metal disabled. Check the cmake output for `Setting MLX_BUILD_METAL=OFF` which indicates the toolchain is missing.
### Windows / Linux (CUDA)
### CUDA
Requires CUDA 13+ and [cuDNN](https://developer.nvidia.com/cudnn) 9+.
```shell
cmake -B build --preset "MLX CUDA 13"
cmake --build build --target mlx --target mlxc --config Release --parallel
cmake --install build --component MLX --strip
cmake -B build . -DOLLAMA_MLX_BACKENDS=cuda_v13
cmake --build build --parallel 8
```
### Local MLX source overrides
@ -173,17 +139,20 @@ export OLLAMA_MLX_SOURCE=/path/to/mlx
export OLLAMA_MLX_C_SOURCE=/path/to/mlx-c
```
For example, using the helper scripts with local mlx and mlx-c repos:
```shell
OLLAMA_MLX_SOURCE=../mlx OLLAMA_MLX_C_SOURCE=../mlx-c ./scripts/build_linux.sh
On macOS arm64:
OLLAMA_MLX_SOURCE=../mlx OLLAMA_MLX_C_SOURCE=../mlx-c ./scripts/build_darwin.sh
```shell
OLLAMA_MLX_SOURCE=../mlx OLLAMA_MLX_C_SOURCE=../mlx-c cmake -B build .
cmake --build build --parallel 8
```
For CUDA:
```powershell
$env:OLLAMA_MLX_SOURCE="../mlx"
$env:OLLAMA_MLX_C_SOURCE="../mlx-c"
./scripts/build_darwin.ps1
cmake -B build . -DOLLAMA_MLX_BACKENDS=cuda_v13
cmake --build build --parallel 8
```
## Docker
@ -208,11 +177,11 @@ go test ./...
## Library detection
Ollama looks for acceleration libraries in the following paths relative to the `ollama` executable:
Ollama looks for native helper binaries and acceleration libraries in installed and local development layouts:
* `./lib/ollama` (Windows)
* `../lib/ollama` (Linux)
* `.` (macOS)
* `build/lib/ollama` (for development)
* `../lib/ollama` for standard installs where `ollama` is under `bin/`
* `./lib/ollama` for Windows release-style payloads and local dist output
* `.` for macOS release artifacts that colocate helpers with `ollama`
* `build/lib/ollama` and `dist/<platform>/lib/ollama` for local development builds
If the libraries are not found, Ollama will not run with any acceleration libraries.

View file

@ -70,12 +70,16 @@ docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 114
## Vulkan Support
Vulkan is bundled into the `ollama/ollama` image.
Vulkan is bundled into the `ollama/ollama` image and is enabled by default when
the container can access the GPU devices.
```shell
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_VULKAN=1 --name ollama ollama/ollama
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```
Use `OLLAMA_VULKAN=0` to disable Vulkan, or `GGML_VK_VISIBLE_DEVICES=<ids>` to
select specific Vulkan devices.
## Run model locally
@ -88,4 +92,3 @@ docker exec -it ollama ollama run llama3.2
## Try different models
More models can be found on the [Ollama library](https://ollama.com/library).

View file

@ -4,6 +4,7 @@ title: Hardware support
## Nvidia
Ollama supports Nvidia GPUs with compute capability 5.0+ and driver version 531 and newer.
Nvidia GPUs with compute capability 5.0 through 6.2 require driver version 570 or newer.
Check your compute compatibility to see if your card is supported:
[https://developer.nvidia.com/cuda-gpus](https://developer.nvidia.com/cuda-gpus)
@ -75,7 +76,7 @@ using the `amdgpu-install` utility from
### Windows Support
With ROCm v6.1, the following GPUs are supported on Windows.
Ollama requires an AMD ROCm v7 / HIP7-capable driver stack on Windows.
| Family | Cards and accelerators |
| -------------- | -------------------------------------------------------------------------------------------------------------------- |
@ -142,12 +143,9 @@ Ollama supports GPU acceleration on Apple devices via the Metal API.
## Vulkan GPU Support
> **NOTE:**
> Vulkan is currently an Experimental feature. To enable, you must set OLLAMA_VULKAN=1 for the Ollama server as
described in the [FAQ](faq#how-do-i-configure-ollama-server)
Additional GPU support on Windows and Linux is provided via
[Vulkan](https://www.vulkan.org/). On Windows most GPU vendors drivers come
[Vulkan](https://www.vulkan.org/). Vulkan is enabled by default when the
backend is installed. On Windows most GPU vendors drivers come
bundled with Vulkan support and require no additional setup steps. Most Linux
distributions require installing additional components, and you may have
multiple options for Vulkan drivers between Mesa and GPU Vendor specific packages
@ -173,4 +171,9 @@ To select specific Vulkan GPU(s), you can set the environment variable
`GGML_VK_VISIBLE_DEVICES` to one or more numeric IDs on the Ollama server as
described in the [FAQ](faq#how-do-i-configure-ollama-server). If you
encounter any problems with Vulkan based GPUs, you can disable all Vulkan GPUs
by setting `GGML_VK_VISIBLE_DEVICES=-1`
by setting `OLLAMA_VULKAN=0` or `GGML_VK_VISIBLE_DEVICES=-1`.
On mixed iGPU/dGPU systems where the Vulkan iGPU is unstable, keep Vulkan
enabled and set `GGML_VK_VISIBLE_DEVICES` to the discrete GPU index. For
example, use `GGML_VK_VISIBLE_DEVICES=1` when `Vulkan1` is the discrete
GPU.

View file

@ -157,6 +157,7 @@ PARAMETER <parameter> <parametervalue>
| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
| stop | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile. | string | stop "AI assistant:" |
| num_predict | Maximum number of tokens to predict when generating text. (Default: -1, infinite generation) | int | num_predict 42 |
| draft_num_predict | Maximum number of speculative draft tokens to predict per step when a draft model is available. Separate draft models default to 4; embedded MTP tensors require setting this parameter. Set to 0 to disable speculative drafting. | int | draft_num_predict 4 |
| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 |
| top_p | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9) | float | top_p 0.9 |
| min_p | Alternative to the top*p, and aims to ensure a balance of quality and variety. The parameter \_p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with _p_=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. (Default: 0.0) | float | min_p 0.05 |

View file

@ -12,10 +12,18 @@ terminal application. As usual the Ollama [API](/api) will be served on
- Windows 10 22H2 or newer, Home or Pro
- NVIDIA 452.39 or newer Drivers if you have an NVIDIA card
- AMD Radeon Driver https://www.amd.com/en/support if you have a Radeon card
- AMD ROCm v7 / HIP7-capable driver stack for ROCm acceleration, or a Vulkan-capable AMD Radeon driver for Vulkan acceleration
Ollama uses unicode characters for progress indication, which may render as unknown squares in some older terminal fonts in Windows 10. If you see this, try changing your terminal font settings.
<Note>
Some RDNA2 / Radeon RX 6000 systems, including RX 6800-class cards, may not
expose ROCm v7 on current Windows AMD drivers. Vulkan is enabled by default
and is the recommended fallback for those systems. If a mixed iGPU/dGPU
system selects an unstable Vulkan iGPU, set `GGML_VK_VISIBLE_DEVICES` to the
discrete GPU index.
</Note>
## Filesystem Requirements
The Ollama install does not require Administrator, and installs in your home directory by default. You'll need at least 4GB of space for the binary install. Once you've installed Ollama, you'll need additional space for storing the Large Language models, which can be tens to hundreds of GB in size. If your home directory doesn't have enough space, you can change where the binaries are installed, and where the models are stored.

View file

@ -214,6 +214,8 @@ func LogLevel() slog.Level {
var (
// FlashAttention enables the experimental flash attention feature.
FlashAttention = BoolWithDefault("OLLAMA_FLASH_ATTENTION")
// GoTemplate enables Modelfile TEMPLATE rendering when a model has one.
GoTemplate = BoolWithDefault("OLLAMA_GO_TEMPLATE")
// DebugLogRequests logs inference requests to disk for replay/debugging.
DebugLogRequests = Bool("OLLAMA_DEBUG_LOG_REQUESTS")
// KvCacheType is the quantization type for the K/V cache.
@ -224,16 +226,14 @@ var (
NoPrune = Bool("OLLAMA_NOPRUNE")
// SchedSpread allows scheduling models across all GPUs.
SchedSpread = Bool("OLLAMA_SCHED_SPREAD")
// MultiUserCache optimizes prompt caching for multi-user scenarios
MultiUserCache = Bool("OLLAMA_MULTIUSER_CACHE")
// Enable the new Ollama engine
NewEngine = Bool("OLLAMA_NEW_ENGINE")
// ContextLength sets the default context length
ContextLength = Uint("OLLAMA_CONTEXT_LENGTH", 0)
// Auth enables authentication between the Ollama client and server
UseAuth = Bool("OLLAMA_AUTH")
// Enable Vulkan backend
EnableVulkan = Bool("OLLAMA_VULKAN")
// EnableVulkan controls Vulkan backend discovery.
EnableVulkan = BoolWithDefault("OLLAMA_VULKAN")
// EnableIntegratedGPU controls whether integrated GPUs may be selected.
EnableIntegratedGPU = BoolWithDefault("OLLAMA_IGPU_ENABLE")
// NoCloudEnv checks the OLLAMA_NO_CLOUD environment variable.
NoCloudEnv = Bool("OLLAMA_NO_CLOUD")
)
@ -312,9 +312,13 @@ func AsMap() map[string]EnvVar {
ret := map[string]EnvVar{
"OLLAMA_DEBUG": {"OLLAMA_DEBUG", LogLevel(), "Show additional debug information (e.g. OLLAMA_DEBUG=1)"},
"OLLAMA_DEBUG_LOG_REQUESTS": {"OLLAMA_DEBUG_LOG_REQUESTS", DebugLogRequests(), "Log inference request bodies and replay curl commands to a temp directory"},
"OLLAMA_GO_TEMPLATE": {"OLLAMA_GO_TEMPLATE", GoTemplate(true), "Enable Modelfile TEMPLATE based rendering when available"},
"OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention(false), "Enabled flash attention"},
"OLLAMA_KV_CACHE_TYPE": {"OLLAMA_KV_CACHE_TYPE", KvCacheType(), "Quantization type for the K/V cache (default: f16)"},
"OLLAMA_GPU_OVERHEAD": {"OLLAMA_GPU_OVERHEAD", GpuOverhead(), "Reserve a portion of VRAM per GPU (bytes)"},
"OLLAMA_IGPU_ENABLE": {"OLLAMA_IGPU_ENABLE", String("OLLAMA_IGPU_ENABLE")(), "Enable integrated GPUs"},
"LLAMA_ARG_FIT": {"LLAMA_ARG_FIT", String("LLAMA_ARG_FIT")(), "Enable llama.cpp automatic fit of unset memory options (default \"on\")"},
"LLAMA_ARG_FIT_TARGET": {"LLAMA_ARG_FIT_TARGET", String("LLAMA_ARG_FIT_TARGET")(), "Target free VRAM margin per device for llama.cpp fit (MiB)"},
"OLLAMA_HOST": {"OLLAMA_HOST", Host(), "IP Address for the ollama server (default 127.0.0.1:11434)"},
"OLLAMA_KEEP_ALIVE": {"OLLAMA_KEEP_ALIVE", KeepAlive(), "The duration that models stay loaded in memory (default \"5m\")"},
"OLLAMA_LLM_LIBRARY": {"OLLAMA_LLM_LIBRARY", LLMLibrary(), "Set LLM library to bypass autodetection"},
@ -329,10 +333,8 @@ func AsMap() map[string]EnvVar {
"OLLAMA_NUM_PARALLEL": {"OLLAMA_NUM_PARALLEL", NumParallel(), "Maximum number of parallel requests"},
"OLLAMA_ORIGINS": {"OLLAMA_ORIGINS", AllowedOrigins(), "A comma separated list of allowed origins"},
"OLLAMA_SCHED_SPREAD": {"OLLAMA_SCHED_SPREAD", SchedSpread(), "Always schedule model across all GPUs"},
"OLLAMA_MULTIUSER_CACHE": {"OLLAMA_MULTIUSER_CACHE", MultiUserCache(), "Optimize prompt caching for multi-user scenarios"},
"OLLAMA_CONTEXT_LENGTH": {"OLLAMA_CONTEXT_LENGTH", ContextLength(), "Context length to use unless otherwise specified (default: 4k/32k/256k based on VRAM)"},
"OLLAMA_EDITOR": {"OLLAMA_EDITOR", Editor(), "Path to editor for interactive prompt editing (Ctrl+G)"},
"OLLAMA_NEW_ENGINE": {"OLLAMA_NEW_ENGINE", NewEngine(), "Enable the new Ollama engine"},
"OLLAMA_REMOTES": {"OLLAMA_REMOTES", Remotes(), "Allowed hosts for remote models (default \"ollama.com\")"},
// Informational
@ -355,7 +357,7 @@ func AsMap() map[string]EnvVar {
ret["GGML_VK_VISIBLE_DEVICES"] = EnvVar{"GGML_VK_VISIBLE_DEVICES", VkVisibleDevices(), "Set which Vulkan devices are visible by numeric ID"}
ret["GPU_DEVICE_ORDINAL"] = EnvVar{"GPU_DEVICE_ORDINAL", GpuDeviceOrdinal(), "Set which AMD devices are visible by numeric ID"}
ret["HSA_OVERRIDE_GFX_VERSION"] = EnvVar{"HSA_OVERRIDE_GFX_VERSION", HsaOverrideGfxVersion(), "Override the gfx used for all detected AMD GPUs"}
ret["OLLAMA_VULKAN"] = EnvVar{"OLLAMA_VULKAN", EnableVulkan(), "Enable experimental Vulkan support"}
ret["OLLAMA_VULKAN"] = EnvVar{"OLLAMA_VULKAN", EnableVulkan(true), "Enable Vulkan support"}
}
return ret

View file

@ -424,8 +424,12 @@ func (t TensorType) BlockSize() uint64 {
TensorTypeQ8_0,
TensorTypeQ8_1,
tensorTypeIQ4_NL,
4, TensorTypeMXFP4:
TensorTypeMXFP4:
return 32
case TensorTypeNVFP4:
return 64
case TensorTypeQ1_0:
return 128
default:
return 256
}
@ -497,8 +501,12 @@ func (t TensorType) TypeSize() uint64 {
return blockSize/8 + blockSize/16 + blockSize/32
case TensorTypeBF16:
return 2
case 4, TensorTypeMXFP4:
case TensorTypeMXFP4:
return 1 + blockSize/2
case TensorTypeNVFP4:
return 4 + blockSize/2
case TensorTypeQ1_0:
return 2 + blockSize/8
default:
return 0
}

View file

@ -14,9 +14,9 @@ const (
FileTypeF16
fileTypeQ4_0
fileTypeQ4_1
fileTypeMXFP4 // originally fileTypeQ4_1_F16 // unused by GGML
fileTypeQ4_2 // unused by GGML
fileTypeQ4_3 // unused by GGML
fileTypeQ4_1_F16 // removed from GGUF files
fileTypeQ4_2 // removed from GGUF files
fileTypeQ4_3 // removed from GGUF files
FileTypeQ8_0
fileTypeQ5_0
fileTypeQ5_1
@ -48,6 +48,9 @@ const (
fileTypeQ4_0_8_8 // unused by GGML
fileTypeTQ1_0
fileTypeTQ2_0
fileTypeMXFP4_MOE
fileTypeNVFP4
fileTypeQ1_0
FileTypeUnknown = 1024
)
@ -97,8 +100,6 @@ func (t FileType) String() string {
return "Q4_0"
case fileTypeQ4_1:
return "Q4_1"
case fileTypeMXFP4:
return "MXFP4"
case FileTypeQ8_0:
return "Q8_0"
case fileTypeQ5_0:
@ -123,10 +124,44 @@ func (t FileType) String() string {
return "Q5_K_M"
case fileTypeQ6_K:
return "Q6_K"
case fileTypeIQ2_XXS:
return "IQ2_XXS"
case fileTypeIQ2_XS:
return "IQ2_XS"
case fileTypeQ2_K_S:
return "Q2_K_S"
case fileTypeIQ3_XS:
return "IQ3_XS"
case fileTypeIQ3_XXS:
return "IQ3_XXS"
case fileTypeIQ1_S:
return "IQ1_S"
case fileTypeIQ4_NL:
return "IQ4_NL"
case fileTypeIQ3_S:
return "IQ3_S"
case fileTypeIQ3_M:
return "IQ3_M"
case fileTypeIQ2_S:
return "IQ2_S"
case fileTypeIQ2_M:
return "IQ2_M"
case fileTypeIQ4_XS:
return "IQ4_XS"
case fileTypeIQ1_M:
return "IQ1_M"
case FileTypeBF16:
return "BF16"
case fileTypeTQ1_0:
return "TQ1_0"
case fileTypeTQ2_0:
return "TQ2_0"
case fileTypeMXFP4_MOE:
return "MXFP4_MOE"
case fileTypeNVFP4:
return "NVFP4"
case fileTypeQ1_0:
return "Q1_0"
default:
return "unknown"
}
@ -170,12 +205,40 @@ func (ftype FileType) ToTensorType() TensorType {
return TensorTypeQ5_K
case fileTypeQ6_K:
return TensorTypeQ6_K
case fileTypeIQ2_XXS:
return tensorTypeIQ2_XXS
case fileTypeIQ2_XS:
return tensorTypeIQ2_XS
case fileTypeQ2_K_S:
return TensorTypeQ2_K
case fileTypeIQ3_XS:
return tensorTypeIQ3_S
case fileTypeIQ3_XXS:
return tensorTypeIQ3_XXS
case fileTypeIQ1_S:
return tensorTypeIQ1_S
case fileTypeIQ4_NL:
return tensorTypeIQ4_NL
case fileTypeIQ3_S, fileTypeIQ3_M:
return tensorTypeIQ3_S
case fileTypeIQ2_S, fileTypeIQ2_M:
return tensorTypeIQ2_S
case fileTypeIQ4_XS:
return tensorTypeIQ4_XS
case fileTypeIQ1_M:
return tensorTypeIQ1_M
case FileTypeBF16:
return TensorTypeBF16
case fileTypeMXFP4:
case fileTypeTQ1_0:
return tensorTypeTQ1_0
case fileTypeTQ2_0:
return tensorTypeTQ2_0
case fileTypeMXFP4_MOE:
return TensorTypeMXFP4
case fileTypeNVFP4:
return TensorTypeNVFP4
case fileTypeQ1_0:
return TensorTypeQ1_0
default:
slog.Warn("unsupported file type", "type", ftype)
return 0 // F32
@ -227,6 +290,8 @@ const (
tensorTypeIQ4_NL_4_8 // unused by GGML
tensorTypeIQ4_NL_8_8 // unused by GGML
TensorTypeMXFP4
TensorTypeNVFP4
TensorTypeQ1_0
)
// ParseTensorType parses the provided GGUF tensor type
@ -315,12 +380,46 @@ func (t TensorType) String() string {
return "Q6_K"
case TensorTypeQ8_K:
return "Q8_K"
case tensorTypeIQ2_XXS:
return "IQ2_XXS"
case tensorTypeIQ2_XS:
return "IQ2_XS"
case tensorTypeIQ3_XXS:
return "IQ3_XXS"
case tensorTypeIQ1_S:
return "IQ1_S"
case tensorTypeIQ4_NL:
return "IQ4_NL"
case tensorTypeIQ3_S:
return "IQ3_S"
case tensorTypeIQ2_S:
return "IQ2_S"
case tensorTypeIQ4_XS:
return "IQ4_XS"
case TensorTypeI8:
return "I8"
case TensorTypeI16:
return "I16"
case TensorTypeI32:
return "I32"
case TensorTypeI64:
return "I64"
case TensorTypeF64:
return "F64"
case tensorTypeIQ1_M:
return "IQ1_M"
case TensorTypeBF16:
return "BF16"
case 4, TensorTypeMXFP4:
case tensorTypeTQ1_0:
return "TQ1_0"
case tensorTypeTQ2_0:
return "TQ2_0"
case TensorTypeMXFP4:
return "MXFP4"
case TensorTypeNVFP4:
return "NVFP4"
case TensorTypeQ1_0:
return "Q1_0"
default:
return "unknown"
}

115
fs/ggml/type_test.go Normal file
View file

@ -0,0 +1,115 @@
package ggml
import "testing"
func TestFileTypeStringMatchesLlamaFType(t *testing.T) {
tests := []struct {
ftype FileType
want string
}{
{0, "F32"},
{1, "F16"},
{2, "Q4_0"},
{3, "Q4_1"},
{7, "Q8_0"},
{8, "Q5_0"},
{9, "Q5_1"},
{10, "Q2_K"},
{11, "Q3_K_S"},
{12, "Q3_K_M"},
{13, "Q3_K_L"},
{14, "Q4_K_S"},
{15, "Q4_K_M"},
{16, "Q5_K_S"},
{17, "Q5_K_M"},
{18, "Q6_K"},
{19, "IQ2_XXS"},
{20, "IQ2_XS"},
{21, "Q2_K_S"},
{22, "IQ3_XS"},
{23, "IQ3_XXS"},
{24, "IQ1_S"},
{25, "IQ4_NL"},
{26, "IQ3_S"},
{27, "IQ3_M"},
{28, "IQ2_S"},
{29, "IQ2_M"},
{30, "IQ4_XS"},
{31, "IQ1_M"},
{32, "BF16"},
{36, "TQ1_0"},
{37, "TQ2_0"},
{38, "MXFP4_MOE"},
{39, "NVFP4"},
{40, "Q1_0"},
{FileTypeUnknown, "unknown"},
}
for _, tt := range tests {
t.Run(tt.want, func(t *testing.T) {
if got := tt.ftype.String(); got != tt.want {
t.Fatalf("FileType(%d).String() = %q, want %q", tt.ftype, got, tt.want)
}
})
}
}
func TestRemovedFileTypesAreUnknown(t *testing.T) {
for _, ftype := range []FileType{4, 5, 6, 33, 34, 35} {
t.Run(ftype.String(), func(t *testing.T) {
if got := ftype.String(); got != "unknown" {
t.Fatalf("FileType(%d).String() = %q, want unknown", ftype, got)
}
})
}
}
func TestTensorTypeStringMatchesGGMLType(t *testing.T) {
tests := []struct {
tt TensorType
want string
}{
{0, "F32"},
{1, "F16"},
{2, "Q4_0"},
{3, "Q4_1"},
{6, "Q5_0"},
{7, "Q5_1"},
{8, "Q8_0"},
{9, "Q8_1"},
{10, "Q2_K"},
{11, "Q3_K"},
{12, "Q4_K"},
{13, "Q5_K"},
{14, "Q6_K"},
{15, "Q8_K"},
{16, "IQ2_XXS"},
{17, "IQ2_XS"},
{18, "IQ3_XXS"},
{19, "IQ1_S"},
{20, "IQ4_NL"},
{21, "IQ3_S"},
{22, "IQ2_S"},
{23, "IQ4_XS"},
{24, "I8"},
{25, "I16"},
{26, "I32"},
{27, "I64"},
{28, "F64"},
{29, "IQ1_M"},
{30, "BF16"},
{34, "TQ1_0"},
{35, "TQ2_0"},
{39, "MXFP4"},
{40, "NVFP4"},
{41, "Q1_0"},
}
for _, tt := range tests {
t.Run(tt.want, func(t *testing.T) {
if got := tt.tt.String(); got != tt.want {
t.Fatalf("TensorType(%d).String() = %q, want %q", tt.tt, got, tt.want)
}
})
}
}

2
go.mod
View file

@ -24,7 +24,7 @@ require (
github.com/charmbracelet/bubbletea v1.3.10
github.com/charmbracelet/lipgloss v1.1.0
github.com/d4l3k/go-bfloat16 v0.0.0-20211005043715-690c3bdd05f1
github.com/dlclark/regexp2 v1.11.4
github.com/dlclark/regexp2 v1.11.5
github.com/emirpasic/gods/v2 v2.0.0-alpha
github.com/klauspost/compress v1.18.3
github.com/mattn/go-runewidth v0.0.16

4
go.sum
View file

@ -62,8 +62,8 @@ github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dgryski/trifles v0.0.0-20200323201526-dd97f9abfb48 h1:fRzb/w+pyskVMQ+UbP35JkH8yB7MYb4q/qhBarqZE6g=
github.com/dgryski/trifles v0.0.0-20200323201526-dd97f9abfb48/go.mod h1:if7Fbed8SFyPtHLHbg49SI7NAdJiC5WIA09pe59rfAA=
github.com/dlclark/regexp2 v1.11.4 h1:rPYF9/LECdNymJufQKmri9gV604RvvABwgOA8un7yAo=
github.com/dlclark/regexp2 v1.11.4/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
github.com/dlclark/regexp2 v1.11.5 h1:Q/sSnsKerHeCkc/jSTNq1oCm7KiVgUMZRDUoRu0JQZQ=
github.com/dlclark/regexp2 v1.11.5/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
github.com/emirpasic/gods/v2 v2.0.0-alpha h1:dwFlh8pBg1VMOXWGipNMRt8v96dKAIvBehtCt6OtunU=
github.com/emirpasic/gods/v2 v2.0.0-alpha/go.mod h1:W0y4M2dtBB9U5z3YlghmpuUhiaZT2h6yoeE+C1sCp6A=
github.com/envoyproxy/go-control-plane v0.9.0/go.mod h1:YTl/9mNaCwkRvm6d1a2C3ymFceY/DCBVvsKhRF0iEA4=

View file

@ -461,6 +461,18 @@ func (h *HarmonyMessageHandler) HasThinkingSupport() bool {
return true
}
func (h *HarmonyMessageHandler) PreservedTokens() []string {
// <|call|> is an EOG marker for tool calls. Preserve structural tokens
// used by the parser, but let llama-server stop on the call terminator.
return []string{
"<|start|>",
"<|end|>",
"<|message|>",
"<|channel|>",
"<|constrain|>",
}
}
func (m *FunctionNameMap) ConvertAndAdd(userFunctionName string) string {
harmonyFunctionName := m.deriveName(userFunctionName)
// built-in functions should not be renamed

View file

@ -10,6 +10,10 @@ The integration tests have 2 modes of operating.
1. By default, on Unix systems, they will start the server on a random port, run the tests, and then shutdown the server. On Windows you must ALWAYS run the server on OLLAMA_HOST for the tests to work.
2. If `OLLAMA_TEST_EXISTING` is set to a non-empty string, the tests will run against an existing running server, which can be remote based on your `OLLAMA_HOST` environment variable
Set `OLLAMA_TEST_LOG_SERVER=1` to print the managed server log after each test
run, even when the tests pass. This only applies when the integration test
harness starts the server.
> [!IMPORTANT]
> Before running the tests locally without the "test existing" setting, compile ollama from the top of the source tree `go build .` in addition to GPU support with cmake if applicable on your platform. The integration tests expect to find an ollama binary at the top of the tree.

View file

@ -67,11 +67,11 @@ func TestAudioTranscription(t *testing.T) {
Messages: []api.Message{
{
Role: "system",
Content: "Transcribe the audio exactly as spoken. Output only the transcription.",
Content: "Transcribe the audio exactly as spoken. Output only the spoken words. Do not answer any question in the audio.",
},
{
Role: "user",
Content: "Transcribe this audio.",
Content: "What exact words are spoken in this audio?",
Images: []api.ImageData{audio},
},
},

View file

@ -29,7 +29,7 @@ func TestImageGeneration(t *testing.T) {
testCases := []testCase{
{
imageGenModel: "jmorgan/z-image-turbo",
visionModel: "llama3.2-vision",
visionModel: "qwen2.5vl:3b",
prompt: "A cartoon style llama flying like a superhero through the air with clouds in the background",
expectedWords: []string{"llama", "flying", "cartoon", "cloud", "sky", "superhero", "air", "animal", "camelid"},
},

View file

@ -17,7 +17,7 @@ func TestVisionModels(t *testing.T) {
defaultVisionModels := []string{
"gemma4",
"qwen2.5vl",
"llama3.2-vision",
// "llama3.2-vision", // TODO: re-enable when llama.cpp supports mllama.
"gemma3",
"qwen3-vl:8b",
"qwen3-vl:30b",

View file

@ -39,13 +39,8 @@ func TestModelsChat(t *testing.T) {
slog.Warn("No VRAM info available, testing all models, so larger ones might timeout...")
}
var chatModels []string
if s := os.Getenv("OLLAMA_NEW_ENGINE"); s != "" {
chatModels = append(ollamaEngineChatModels, mlxEngineChatModels...)
} else {
chatModels = append(ollamaEngineChatModels, llamaRunnerChatModels...)
chatModels = append(chatModels, mlxEngineChatModels...)
}
chatModels := append(ollamaEngineChatModels, llamaRunnerChatModels...)
chatModels = append(chatModels, mlxEngineChatModels...)
for _, model := range testModels(chatModels) {
t.Run(model, func(t *testing.T) {

View file

@ -28,7 +28,6 @@ var (
"falcon2:latest", // 2k model
"minicpm-v:latest",
"qwen:latest",
"solar-pro:latest",
}
)
@ -40,11 +39,7 @@ var (
// cat int.log | grep MODEL_PERF_HEADER | head -1| cut -f2- -d: > perf.csv
// cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
func TestModelsPerf(t *testing.T) {
if s := os.Getenv("OLLAMA_NEW_ENGINE"); s != "" {
doModelPerfTest(t, ollamaEngineChatModels)
} else {
doModelPerfTest(t, append(ollamaEngineChatModels, llamaRunnerChatModels...))
}
doModelPerfTest(t, append(ollamaEngineChatModels, llamaRunnerChatModels...))
}
func TestLibraryModelsPerf(t *testing.T) {

View file

@ -46,7 +46,7 @@ var (
// Note: add newer models at the top of the list to test them first
ollamaEngineChatModels = []string{
"nemotron3:33b",
"laguna-xs.2:q4_K_M",
// "laguna-xs.2:q4_K_M", // TODO: re-enable when llama.cpp supports laguna.
"gemma4",
"lfm2.5-thinking",
"ministral-3",
@ -55,7 +55,7 @@ var (
"gemma3n:e2b",
"mistral-small3.2:latest",
"deepseek-r1:1.5b",
"llama3.2-vision:latest",
// "llama3.2-vision:latest", // TODO: re-enable when llama.cpp supports mllama.
"qwen2.5-coder:latest",
"qwen2.5vl:3b",
"qwen3:0.6b", // dense
@ -74,8 +74,8 @@ var (
// failure into a test skip.
mlxEngineChatModels = []string{
"laguna-xs.2:nvfp4",
"qwen3.5:2b-nvfp4", // ~2.5GB, Qwen3_5 arch
"gemma4:e2b-nvfp4", // ~7.1GB, Gemma4 arch (skipped under low VRAM)
"qwen3.5:2b-nvfp4", // ~2.5GB, Qwen3_5 arch
"gemma4:e2b-nvfp4", // ~7.1GB, Gemma4 arch (skipped under low VRAM)
}
llamaRunnerChatModels = []string{
"mistral:latest",
@ -84,7 +84,6 @@ var (
"command-r:latest",
"nemotron-mini:latest",
"phi3.5:latest",
"solar-pro:latest",
"internlm2:latest",
"codellama:latest", // arch=llama
"phi3:latest",
@ -166,7 +165,7 @@ var (
"llama3-gradient",
"llama3-groq-tool-use",
"llama3.1",
"llama3.2-vision",
// "llama3.2-vision", // TODO: re-enable when llama.cpp supports mllama.
"llama3.2",
"llama3.3",
"llama3",
@ -236,7 +235,6 @@ var (
"smallthinker",
"smollm",
"smollm2",
"solar-pro",
"solar",
"sqlcoder",
"stable-beluga",
@ -278,7 +276,7 @@ var (
}
libraryToolsModels = []string{
"nemotron3:33b",
"laguna-xs.2",
// "laguna-xs.2", // TODO: re-enable when llama.cpp supports laguna.
"gemma4",
"lfm2.5-thinking",
"qwen3-vl",
@ -555,7 +553,7 @@ func InitServerConnection(ctx context.Context, t *testing.T) (*api.Client, strin
<-serverDone
slog.Info("terminate complete")
if t.Failed() {
if t.Failed() || os.Getenv("OLLAMA_TEST_LOG_SERVER") != "" {
slog.Warn("SERVER LOG FOLLOWS")
io.Copy(os.Stderr, &serverLog)
slog.Warn("END OF SERVER")

View file

@ -19,7 +19,7 @@ var defaultVisionModels = []string{
"nemotron3:33b",
"gemma4",
"gemma3",
"llama3.2-vision",
// "llama3.2-vision", // TODO: re-enable when llama.cpp supports mllama.
"qwen2.5vl",
"qwen3-vl:8b",
}
@ -116,7 +116,7 @@ func TestVisionMultiTurn(t *testing.T) {
Images: []api.ImageData{abbeyRoad},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}
@ -182,7 +182,7 @@ func TestVisionObjectCounting(t *testing.T) {
Images: []api.ImageData{docs},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}
@ -225,7 +225,7 @@ func TestVisionSceneUnderstanding(t *testing.T) {
Images: []api.ImageData{abbeyRoad},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}
@ -263,7 +263,7 @@ func TestVisionSpatialReasoning(t *testing.T) {
Images: []api.ImageData{docs},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}
@ -299,7 +299,7 @@ func TestVisionDetailRecognition(t *testing.T) {
Images: []api.ImageData{docs},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}
@ -344,7 +344,7 @@ func TestVisionMultiImage(t *testing.T) {
Images: []api.ImageData{abbeyRoad, docs},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}
@ -383,7 +383,7 @@ func TestVisionImageDescription(t *testing.T) {
Images: []api.ImageData{ollamaHome},
},
},
Stream: &stream,
Stream: &stream,
KeepAlive: &api.Duration{Duration: 10 * time.Second},
Options: map[string]any{"temperature": 0.0, "seed": 42},
}

View file

@ -1,55 +0,0 @@
# `llama`
This package provides Go bindings to [llama.cpp](https://github.com/ggerganov/llama.cpp).
## Vendoring
Ollama vendors [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [ggml](https://github.com/ggerganov/llama.cpp/tree/master/ggml/src). While we generally strive to contribute changes back upstream to avoid drift, we carry a small set of patches which are applied to the tracking commit.
If you update the vendoring code, start by running the following command to establish the tracking llama.cpp repo in the `./vendor/` directory.
```shell
make -f Makefile.sync apply-patches
```
### Updating Base Commit
**Pin to new base commit**
To change the base commit, update `FETCH_HEAD` in Makefile.sync.
When updating to a newer base commit, the existing patches may not apply cleanly and require manual merge resolution.
Start by applying the patches. If any of the patches have conflicts, the `git am` will stop at the first failure.
```shell
make -f Makefile.sync apply-patches
```
If there are conflicts, you will see an error message. Resolve the conflicts in `./vendor/`, and continue the patch series with `git am --continue` and rerun `make -f Makefile.sync apply-patches`. Repeat until all patches are successfully applied.
Once all patches are applied, commit the changes to the tracking repository.
```shell
make -f Makefile.sync format-patches sync
```
### Generating Patches
When working on new fixes or features that impact vendored code, use the following model. First get a clean tracking repo with all current patches applied:
```shell
make -f Makefile.sync clean apply-patches
```
Iterate until you're ready to submit PRs. Once your code is ready, commit a change in the `./vendor/` directory, then generate the patches for ollama with
```shell
make -f Makefile.sync format-patches
```
In your `./vendor/` directory, create a branch, and cherry-pick the new commit to that branch, then submit a PR upstream to llama.cpp.
Commit the changes in the ollama repo and submit a PR to Ollama, which will include the vendored code update with your change, along with the patches.
After your PR upstream is merged, follow the **Updating Base Commit** instructions above, however first remove your patch before running `apply-patches` since the new base commit contains your change already.

4
llama/build-info.cpp generated vendored
View file

@ -1,4 +0,0 @@
int LLAMA_BUILD_NUMBER = 0;
char const *LLAMA_COMMIT = "ec98e2002";
char const *LLAMA_COMPILER = "";
char const *LLAMA_BUILD_TARGET = "";

View file

@ -1,4 +0,0 @@
int LLAMA_BUILD_NUMBER = 0;
char const *LLAMA_COMMIT = "@FETCH_HEAD@";
char const *LLAMA_COMPILER = "";
char const *LLAMA_BUILD_TARGET = "";

137
llama/compat/README.md Normal file
View file

@ -0,0 +1,137 @@
# llama.cpp compatibility layer
This directory holds a temporary in-process compatibility layer for existing
published Ollama GGUFs whose metadata or tensor layout does not yet match what
llama.cpp expects directly. The layer translates those files in memory at load
time so users do not need to re-pull or re-create models during the transition
to llama-server.
This patch model is intended to be short lived. The target end state is that
published models and newly created models use llama.cpp-compatible metadata and
tensor layouts on disk, and this directory can be removed.
The layer is applied automatically at build time via CMake `FetchContent`'s
`PATCH_COMMAND` for normal fetched builds. If CMake is pointed at a source
override through `FETCHCONTENT_SOURCE_DIR_LLAMA_CPP`, the same patch is applied
during configure. If `OLLAMA_LLAMA_CPP_SOURCE` is set, the patch is
intentionally skipped so a developer can iterate on a local llama.cpp tree.
## Files
- `llama-ollama-compat.h`, `llama-ollama-compat.cpp` - the compatibility
entry points and per-architecture handlers.
- `llama-ollama-compat-util.h`, `llama-ollama-compat-util.cpp` - helpers for
KV edits, tensor renames, skip-prefix tracking, tensor load operations, and
small tensor repacking primitives.
- `llama-cpp-hooks.patch` - small additive call-site edits in llama.cpp files.
It currently touches `src/llama-model-loader.cpp` and `tools/mtmd/clip.cpp`.
- `compat.cmake`, `apply-patch.cmake` - CMake glue and an idempotent patch
applier used by `llama/server/CMakeLists.txt`.
The compatibility source files stay in this directory and are linked into the
fetched llama.cpp targets. The patch file only adds call sites.
## Load-Time Hooks
The layer runs at a small set of loader hook points:
1. Main model constructor: `translate_metadata` inspects the parsed metadata
and mutates the in-memory `gguf_context` and `ggml_context` when a handler
recognizes an existing published model format. It can also request mmap
disablement when a handler needs writable backend buffers for transformed
tensor data.
2. Main model tensor indexing: `should_skip_tensor` hides embedded projector,
vision, audio, MTP, or other tensors that the text loader should not claim.
3. Main model tensor reads: `maybe_load_text_tensor` applies registered
text-side load operations, such as FFN concat or dtype promotion, before
the normal llama.cpp file read. This is wired into both full model loading
and single-tensor reads used by tools such as `llama-quantize`.
4. `mtmd/clip` constructor: `translate_clip_metadata` rewrites a clip-facing
view of monolithic GGUFs into the mmproj form expected by llama.cpp.
5. `mtmd/clip` tensor load loop: `maybe_load_tensor` applies clip-side load
operations, such as F16 to F32 promotion, QKV merge, tensor repack, tensor
split, or zero-fill.
Files that do not match a supported published-model marker are left unchanged.
Setting `OLLAMA_LLAMA_CPP_COMPAT=0` disables the hook bodies for internal
create-time validation and for models that are already known to be
llama.cpp-compatible on disk.
## Supported Transformations
This table tracks the dispatch surface. Keep it brief; the handler comments in
`llama-ollama-compat.cpp` are the source of truth for exact KV and tensor maps.
| Internal arch / marker | Text handling | Clip/mmproj handling |
|---|---|---|
| `gemma3` | Normalizes Gemma 3 metadata, tokenizer fields, and embedded vision/projector tensors. | Gemma 3 projector translation. |
| `gemma3` + embedding markers (`embeddinggemma`) | Maps to `gemma-embedding` metadata and fixes embedding dense/norm tensors. | n/a |
| `bert` + Snowflake markers (`snowflake-arctic-embed2`) | Fixes Snowflake Arctic Embed 2 tokenizer metadata. | n/a |
| `gemma3n` | Normalizes tokenizer/EOS metadata, truncates vocab-shaped tensors, and hides unused embedded vision/audio/projector tensors. | n/a |
| `gemma4` | Normalizes tokenizer metadata and hides embedded audio/vision/projector tensors from the text loader. | Gemma 4 projector translation; audio remains disabled. |
| `gptoss` | Maps to `gpt-oss`, copies KVs, injects missing expert FFN metadata, and renames tensors. | n/a |
| `lfm2` | Renames norm tensors and fixes feed-forward metadata. | n/a |
| `olmo3` | Maps to the OLMo2-compatible loader path. | n/a |
| `mistral3` | Fixes RoPE/YaRN metadata and hides embedded vision/projector tensors. | Pixtral-style projector translation. |
| `qwen35`, `qwen35moe` | Fixes Qwen3.5/Qwen3-VL-style text metadata, translates embedded MTP tensors, and hides embedded vision/projector tensors. | Qwen3-VL merger-style projector translation. |
| `qwen3next` | Normalizes hybrid attention KV-head metadata and renames SSM dt tensors to the names expected by llama.cpp. | n/a |
| `qwen25vl` | Maps to `qwen2vl` metadata conventions. | Qwen2.5-VL projector translation. |
| `qwen3vl`, `qwen3vlmoe` | Adds missing Qwen3-VL metadata and hides embedded vision/projector tensors. | Qwen3-VL projector translation, including QKV merge and patch-embedding split/repack. |
| `deepseekocr` | Maps to `deepseek2-ocr`, injects missing OCR/MoE metadata, and hides embedded SAM/vision/projector tensors. | DeepSeek OCR projector translation. |
| `glmocr` | Maps GLM OCR metadata/tensors to the llama.cpp-compatible view. | GLM OCR projector translation. |
| `glm4moelite` | Maps GLM-4.7 Flash MLA metadata to the `deepseek2` path and fixes special-token metadata. | n/a |
| `nemotron_h_moe` | Fixes latent-FFN variants and hides MTP tensors. | n/a |
| `nemotron_h_omni` | Selects the Nemotron text loader and hides audio/vision/projector tensors from the text loader. | Nemotron V2 VL projector translation; audio remains disabled. |
| `llama` with Llama 3 markers | Fixes Llama 3 tokenizer metadata. | n/a |
| `llama4` | Hides embedded vision/projector tensors from the text loader. | Llama 4 projector translation. |
| `clip` projector without `clip.projector_type` | n/a | Defaults LLaVA/BakLLaVA projectors to `clip.projector_type=mlp`. |
Usage:
```sh
llama-server --model /path/to/ollama-blob --mmproj /path/to/ollama-blob
```
Passing the same monolithic GGUF as both `--model` and `--mmproj` works because
each loader applies its own translation.
Additional architectures are added by implementing a `handle_<arch>()` and,
for vision models, `handle_<arch>_clip()` in `llama-ollama-compat.cpp`, then
dispatching them from `translate_metadata` / `translate_clip_metadata`. For
monolithic vision models, also update the `compatClipArches` allowlist in
`llm/llama_server.go` so Ollama passes the main GGUF as `--mmproj`.
## Regenerating the Patch File
After a llama.cpp bump moves the insertion points, re-apply the edits to a
fresh checkout and run:
```sh
cd /path/to/llama.cpp
git diff -- \
src/llama-model-loader.cpp \
tools/mtmd/clip.cpp \
> /path/to/ollama/llama/compat/llama-cpp-hooks.patch
```
## Implementation Notes
The compatibility code is mostly written against public APIs (`gguf.h`,
`ggml.h`, `ggml-backend.h`). A few operations rely on implementation details
because the public API does not expose equivalent mutators:
| Dependency | Use | Replacement if needed |
|---|---|---|
| Direct writes to `ggml_tensor::type` / `ne[]` / `nb[]` | Post-creation tensor reshape/retype for in-memory translation. | Add public tensor shape/type mutators. |
| `const_cast<char *>(gguf_get_tensor_name(...))` in `rename_tensor` | Renames gguf tensors in place. | Add a public `gguf_rename_tensor` helper. |
| `llama_model_loader` forward declaration from `src/llama-model-loader.h` | Opaque key for per-loader registries. The pointer is never dereferenced. | Replace registry keys with `const void *`. |
Two helpers need extra context:
- `reclaim_slot_as` repurposes an orphaned tensor slot as a synthesized tensor
when a clip handler splits one source tensor into multiple destination
tensors. This is needed because clip metadata loading allocates exactly enough
tensor slots for the source file.
- Load-op registry overrides ignore the caller-provided `file_offset` when a
registered operation exists. The operations capture their own source offsets
at translation time, before renames change tensor names.

View file

@ -0,0 +1,50 @@
# Idempotent patch applier used by compat.cmake.
#
# Invocation (from a CMake PATCH_COMMAND):
# cmake -DPATCH_FILE=<abs path> -P apply-patch.cmake
#
# The patch is applied in the current working directory (which ExternalProject
# / FetchContent sets to the fetched source's SOURCE_DIR). If the patch is
# already applied detected via `git apply --reverse --check` this script
# is a no-op. This makes re-configuring and re-building the project safe.
if(NOT DEFINED PATCH_FILE)
message(FATAL_ERROR "apply-patch.cmake: PATCH_FILE not set")
endif()
if(NOT EXISTS "${PATCH_FILE}")
message(FATAL_ERROR "apply-patch.cmake: PATCH_FILE does not exist: ${PATCH_FILE}")
endif()
find_package(Git QUIET REQUIRED)
get_filename_component(_patch_workdir "." ABSOLUTE)
get_filename_component(_git_ceiling "${_patch_workdir}" DIRECTORY)
set(_git_apply_env GIT_CEILING_DIRECTORIES=${_git_ceiling})
# If the patch can be REVERSED cleanly, it's already applied. Skip.
execute_process(
COMMAND ${CMAKE_COMMAND} -E env ${_git_apply_env}
${GIT_EXECUTABLE} apply --reverse --check "${PATCH_FILE}"
RESULT_VARIABLE _reverse_check
OUTPUT_QUIET ERROR_QUIET
)
if(_reverse_check EQUAL 0)
message(STATUS "llama/compat: patch already applied, skipping")
return()
endif()
# Otherwise, apply forward.
execute_process(
COMMAND ${CMAKE_COMMAND} -E env ${_git_apply_env}
${GIT_EXECUTABLE} apply --whitespace=nowarn "${PATCH_FILE}"
RESULT_VARIABLE _apply_result
)
if(NOT _apply_result EQUAL 0)
message(FATAL_ERROR
"llama/compat: failed to apply ${PATCH_FILE}\n"
"This usually means the pinned llama.cpp source has changed. "
"Regenerate the patch (see llama/compat/README.md) against the "
"pinned LLAMA_CPP_VERSION and retry.")
endif()
message(STATUS "llama/compat: applied patch")

57
llama/compat/compat.cmake Normal file
View file

@ -0,0 +1,57 @@
# llama.cpp compatibility shim CMake integration
#
# Include this file BEFORE calling FetchContent_Declare(llama_cpp ...) to
# patch the fetched llama.cpp with Ollama's in-process compatibility
# layer. Example usage:
#
# include(${CMAKE_CURRENT_SOURCE_DIR}/../compat/compat.cmake)
#
# FetchContent_Declare(
# llama_cpp
# GIT_REPOSITORY ...
# GIT_TAG ${LLAMA_CPP_GIT_TAG}
# GIT_SHALLOW TRUE
# PATCH_COMMAND ${OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND}
# UPDATE_DISCONNECTED TRUE
# )
#
# The compat layer consists of:
# 1. Ollama-owned compat source files linked into the fetched llama.cpp
# targets from this directory.
# 2. A small patch file that adds call-sites in llama.cpp loaders.
set(_compat_dir ${CMAKE_CURRENT_LIST_DIR})
# Expose a single variable the main CMakeLists passes into FetchContent's
# PATCH_COMMAND. The patch is applied via a small CMake script so the step
# is idempotent re-configuring or rebuilding won't fail with "already
# applied".
#
# The compat source files are NOT copied into the fetched tree.
# Instead, llama/server/CMakeLists.txt does target_sources() on the llama
# target after FetchContent_MakeAvailable. That keeps Ollama's code in
# Ollama's tree and makes the patch pure call-site insertions.
set(OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND
${CMAKE_COMMAND}
-DPATCH_FILE=${_compat_dir}/llama-cpp-hooks.patch
-P ${_compat_dir}/apply-patch.cmake
CACHE INTERNAL "llama.cpp compat patch command for FetchContent")
# Where the compat source files live, so the main CMakeLists can wire them
# into the llama.cpp targets that need the hooks.
set(OLLAMA_LLAMA_CPP_COMPAT_DIR
"${_compat_dir}"
CACHE INTERNAL "Directory holding llama.cpp compat sources")
# Also export the individual paths in case callers want to do something
# custom (e.g. emit a dependency on the patch so reconfigures re-apply).
set(OLLAMA_LLAMA_CPP_COMPAT_PATCH_FILE
"${_compat_dir}/llama-cpp-hooks.patch"
CACHE INTERNAL "Path to the llama.cpp compat patch")
set(OLLAMA_LLAMA_CPP_COMPAT_SOURCES
"${_compat_dir}/llama-ollama-compat.h"
"${_compat_dir}/llama-ollama-compat.cpp"
"${_compat_dir}/llama-ollama-compat-util.h"
"${_compat_dir}/llama-ollama-compat-util.cpp"
CACHE INTERNAL "Source files linked into llama.cpp targets")

View file

@ -0,0 +1,89 @@
diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 4e65a45..a6e4fe2 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -4,6 +4,7 @@
#include "ggml.h"
#include "gguf.h"
#include "llama-hparams.h"
+#include "llama-ollama-compat.h"
#include <algorithm>
#include <array>
@@ -549,6 +550,7 @@ llama_model_loader::llama_model_loader(
}
get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
+ if (llama_ollama_compat::translate_metadata(this, metadata, ctx, arch_name, fname.c_str())) use_mmap = false;
llm_kv = LLM_KV(llm_arch_from_string(arch_name));
files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
@@ -573,6 +575,9 @@ llama_model_loader::llama_model_loader(
// so we build a unified tensors index for weights.
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
std::string tensor_name = std::string(cur->name);
+ if (llama_ollama_compat::should_skip_tensor(this, tensor_name.c_str())) {
+ continue;
+ }
// make sure there is no duplicated tensor names
if (weights_map.find(tensor_name) != weights_map.end()) {
throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
@@ -683,6 +688,9 @@ llama_model_loader::llama_model_loader(
// Save tensors data offset info of the main file.
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
std::string tensor_name = std::string(cur->name);
+ if (llama_ollama_compat::should_skip_tensor(this, tensor_name.c_str())) {
+ continue;
+ }
// make sure there is no duplicated tensor names
if (weights_map.find(tensor_name) != weights_map.end()) {
throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
@@ -1375,6 +1383,7 @@ void llama_model_loader::get_mapping_range(size_t * first, size_t * last, void *
void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
const auto & w = require_weight(ggml_get_name(cur));
+ if (llama_ollama_compat::maybe_load_text_tensor(this, cur, w.offs)) return;
if (use_mmap) {
const auto & mapping = mappings.at(w.idx);
@@ -1525,6 +1534,7 @@ bool llama_model_loader::load_all_data(
}
size_t n_size = ggml_nbytes(cur);
+ if (llama_ollama_compat::maybe_load_text_tensor(this, cur, weight->offs)) continue;
if (use_mmap) {
const auto & mapping = mappings.at(weight->idx);
diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
index 2e0cfa6..a0f2955 100644
--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
@@ -10,6 +10,8 @@
#include "ggml-backend.h"
#include "gguf.h"
+#include "llama-ollama-compat.h"
+
#include <algorithm>
#include <cassert>
#include <cmath>
@@ -1009,6 +1011,11 @@ struct clip_model_loader {
ctx_meta.reset(meta);
+ // If this is an Ollama-format monolithic GGUF (text + embedded
+ // vision), translate its metadata and tensor names into the
+ // upstream mmproj shape so the rest of this loader runs unchanged.
+ llama_ollama_compat::translate_clip_metadata(ctx_gguf.get(), meta);
+
const int n_tensors = gguf_get_n_tensors(ctx_gguf.get());
// print gguf info
@@ -2611,6 +2618,7 @@ struct clip_model_loader {
auto it_off = tensor_offset.find(t->name);
GGML_ASSERT(it_off != tensor_offset.end() && "no offset for tensor");
const size_t offset = it_off->second;
+ if (llama_ollama_compat::maybe_load_tensor(cur, fname.c_str(), offset, buft)) continue;
fin.seekg(offset, std::ios::beg);
if (!fin) {
throw std::runtime_error(string_format("%s: failed to seek for tensor %s\n", __func__, t->name));

Some files were not shown because too many files have changed in this diff Show more