runner: Remove CGO engines, use llama-server exclusively for GGML models (#16031)

* broad lint fixes to sidestep CI scope glitch * runner: Remove CGO engines, use llama-server exclusively for GGML models Remove the vendored GGML and llama.cpp backend, CGO runner, Go model implementations, and sample. llama-server (built from upstream llama.cpp via FetchContent) is now the sole inference engine for GGUF-based models. (Safetensor based models continue to run on the new MLX engine.) This allows us to more rapidly pick up new capabilities and fixes from llama.cpp as they come out. On windows this now requires recent AMD driver versions to support ROCm v7 as llama.cpp currently does not support building against v6. * llama/compat: load Ollama-format GGUFs in llama-server Squashed from upstream/jmorganca/llama-compat on 2026-04-29. Source tip: 0c33775d37. Original source commits: - 25223160d llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs - 7449b539a llm,server: route Ollama-format gemma3 blobs through llama/compat - 436f2e2b1 llama/compat: make patch-apply idempotent - 8c2c9d4c8 llama/compat: extend gemma3 handler to cover 1B and 270M blobs - 021389f7b llama/compat: shrink clip.cpp injection from 18 lines to 1 - 61b367ec2 llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines) - 36049361c llama/compat: simplify shim (gemma3-tested) - 8fa664865 llama/compat: add qwen35moe text handler - db0c74530 llama/compat: add qwen35moe vision (clip) support - 2a388da77 llama/compat: split shared infra into a util TU - 9a69a17dc llama/compat: document non-public API dependencies - d0f38a915 llama/compat: add gpt-oss and lfm2 handlers - 086071822 llama/compat: add mistral3 text handler (vision TODO) - 63bde9ff7 llama/compat: add mistral3 vision (clip) support - 3a57b89d5 llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K - 99cb87439 llama/compat: add qwen35, gemma4, deepseek-ocr handlers - 2c7850dba llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip) - 9e3b54225 llama/compat: add llama4 text + clip handlers - 034fee349 llama/compat: add gemma4 clip handler (gemma4v projector) - 9945c5a93 server: remove dhiltgen/* compat redirect table - 5d4539101 llama/compat: rewrite gemma4 tokenizer model to BPE - 7e0765327 llama/compat: add glm-ocr text handler + text-loader load-op hook - f1bd1a25a llama/compat: add glm-ocr clip handler (glm4v projector) - 4b5cf3420 llama/compat: collapse text-loader hook back to one new patch line - eb4ecf4fc llama/compat: extend gemma4 clip handler to gemma4a (audio) - a23a5e76f llama/compat: fix gemma4a per-block norm tensor mapping - cd2dcaff4 llama/compat: add embeddinggemma handler - 1ce8a6b26 llama/compat: add qwen3-vl + qwen2.5-vl handlers - fd98ffa1e llama/compat: add gemma3n + glm4moelite handlers - cc7bdf0bc llama/compat: handle null buft in maybe_load_tensor - 0c33775d3 llama/compat: disable mmap when load_op transforms text-side tensors * refine implementation * ci: fix windows MLX build * ci: fix windows llama-server build * ci: fix windows rocm build * ci: windows mlx tuning Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit * ci: fix windows dependencies * win: fix dependency gathering * disable openmp * win: arm64 cross-compile build also DRY out CI steps * scheduler improvements * ci: improvements from #15982 * win: favor ninja for faster developer builds * win: fix build * win: fix arm64 cross-compile * win: avoid spaces in compiler path * misc discovery fixes, and bos handling * lint fixes * win: fix arm cross-compile build/CI bugs * llama.cpp update * win: handle multiple CRT dirs * vulkan: add windows iGPU detection * fix creation bugs for patched models, other refactoring work * tune batch size for better performance * ci and lint fixes * fix repeat_last_n bug * build: revamp build for better developer UX * amd, sampler, qwen3next fixes * version bump * fix mlx build * revamp GPU discovery Scanning the output of llama-server is turning out to be too error prone across llama.cpp updates, so this switches to a thin dynamic library load against the bundled GGML libraries so more details can be gathered from the API. * version bump * missing file * ci: fix cache miss on rocm build * refine vulkan dep handling * fix ps reporting bug on full GPU load * improve cmake wiring for customized local builds * version bump * docker build arg cleanup * improve windows exit error logs * fix community gemma4 support and ci flakes * fix mlx unit test * tighten up ps logic to avoid double counting fit log lines * version bump * fix ps view for full gpu layer offload * add MTP wiring for llama-server and create with GGUFs * pick best template by capabilities * version bump * ci: harden apt repos * remove unused cpu core discovery * adjust batch default logic to reduce OOMs * support larger tool calls * fix audio support, template show * qwen35 mtp patch support * flesh out dtypes * rocm deps * version bump * lint fix * block broken gfx1150 on windows * fix qwen3.5 moe mtp tensors in patch * mmproj oom fallback and vulkan on by default * qwen MTP compat fix * version bump * ci: fix WoA cross-compile * ci: workaround ui tool in cross-compile * version bump * win: enable OpenMP for CPU builds * build: improve developer UX * ci: windows path workaround for CPU build * win: fix WoA dependencies * win: fix large offset reads for mmproj patched loads * version bump * fix vulkan dup detection * add OLLAMA_IGPU_ENABLE and largely disable iGPUs by default * opt-in MTP, win large offset, integraton fixes * fix unit test scheduler interaction hang * fix multi-gpu filtering * version bump * review comments * fix thinking level * fix linux rocm ordering and granite 3.3 template * version bump * ci fix - non-shallow MLX checkout * bypass linux sysfs unit test on windows --------- Co-authored-by: jmorganca <jmorganca@gmail.com>
2026-07-03 03:38:52 +00:00 · 2026-05-29 13:35:47 -07:00 · 2026-05-29 13:35:47 -07:00 · 9db4bdbad6
commit 9db4bdbad6
parent f63eea3d27
1100 changed files with 28510 additions and 430069 deletions
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@ -16,7 +16,7 @@ jobs:
    outputs:
      GOFLAGS: ${{ steps.goflags.outputs.GOFLAGS }}
      VERSION: ${{ steps.goflags.outputs.VERSION }}
-      vendorsha: ${{ steps.changes.outputs.vendorsha }}
+      vendorsha: ${{ steps.goflags.outputs.vendorsha }}
    steps:
      - uses: actions/checkout@v4
      - name: Set environment
@ -24,7 +24,7 @@ jobs:
        run: |
          echo GOFLAGS="'-ldflags=-w -s \"-X=github.com/ollama/ollama/version.Version=${GITHUB_REF_NAME#v}\" \"-X=github.com/ollama/ollama/server.mode=release\"'" | tee -a $GITHUB_OUTPUT
          echo VERSION="${GITHUB_REF_NAME#v}" | tee -a $GITHUB_OUTPUT
-          echo vendorsha=$(make -f Makefile.sync print-base) | tee -a $GITHUB_OUTPUT
+          echo vendorsha=$(cat LLAMA_CPP_VERSION)-$(cat MLX_VERSION)-$(cat MLX_C_VERSION) | tee -a $GITHUB_OUTPUT

  darwin-build:
    runs-on: macos-26-xlarge
@ -57,7 +57,9 @@ jobs:
          go-version-file: go.mod
          cache-dependency-path: |
            go.sum
-            Makefile.sync
+            LLAMA_CPP_VERSION
+            MLX_VERSION
+            MLX_C_VERSION
      - run: |
          ./scripts/build_darwin.sh
      - name: Log build results
@ -73,15 +75,18 @@ jobs:
            dist/*.dmg

  windows-depends:
+    needs: setup-environment
    strategy:
      matrix:
        os: [windows]
        arch: [amd64]
        preset: ['CPU']
+        build-steps: ['cpu cpuArm64']
        include:
          - os: windows
            arch: amd64
            preset: 'CUDA 12'
+            build-steps: cuda12
            install: https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_571.96_windows.exe
            cuda-components:
              - '"cudart"'
@ -89,10 +94,10 @@ jobs:
              - '"cublas"'
              - '"cublas_dev"'
            cuda-version: '12.8'
-            flags: ''
          - os: windows
            arch: amd64
            preset: 'CUDA 13'
+            build-steps: cuda13
            install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
            cuda-components:
              - '"cudart"'
@ -103,23 +108,23 @@ jobs:
              - '"nvvm"'
              - '"nvptxcompiler"'
            cuda-version: '13.0'
-            flags: ''
          - os: windows
            arch: amd64
-            preset: 'ROCm 6'
-            install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-WinSvr2022-For-HIP.exe
-            rocm-version: '6.2'
-            flags: '-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma" -DCMAKE_CXX_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma"'
-            runner_dir: 'rocm'
+            preset: 'ROCm 7'
+            build-steps: rocm7
+            install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-26.Q1-Win11-For-HIP.exe
+            rocm-version: '7.1'
          - os: windows
            arch: amd64
            preset: Vulkan
+            build-steps: vulkan
            install: https://sdk.lunarg.com/sdk/download/1.4.321.1/windows/vulkansdk-windows-X64-1.4.321.1.exe
-            flags: ''
-            runner_dir: 'vulkan'
          - os: windows
            arch: amd64
            preset: 'MLX CUDA 13'
+            build-steps: mlxCuda13
+            build-parallel: '16'
+            cmake-cuda-flags: '-t 6'
            install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
            cudnn-install: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.18.1.3_cuda13-archive.zip
            cuda-components:
@ -135,13 +140,12 @@ jobs:
              - '"nvvm"'
              - '"nvptxcompiler"'
            cuda-version: '13.0'
-            flags: ''
    runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
    environment: release
    env:
      GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
+      VERSION: ${{ needs.setup-environment.outputs.VERSION }}
    steps:
-      # Increase pagefile to handle momentary spikes in RAM from NVCC compiles
      - if: startsWith(matrix.preset, 'MLX ')
        name: Increase pagefile to 200 GB
        uses: al-cheb/configure-pagefile-action@v1.5
@ -155,6 +159,15 @@ jobs:
          if (Get-Command ccache -ErrorAction SilentlyContinue) {
            ccache -o cache_dir=${{ github.workspace }}\.ccache
          }
+      - if: matrix.preset == 'CPU'
+        name: Install Windows ARM64 cross compiler
+        run: |
+          Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-x86_64.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
+          Expand-Archive -Path ${{ runner.temp }}\llvm-mingw-ucrt.zip -DestinationPath "C:\Program Files\"
+          $installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt-x86_64").path
+          if (!(Test-Path "$installPath\bin\aarch64-w64-mingw32-gcc.exe")) {
+            throw "llvm-mingw x86_64 package is missing the aarch64 cross compiler"
+          }
      - if: startsWith(matrix.preset, 'CUDA ') || startsWith(matrix.preset, 'ROCm ') || startsWith(matrix.preset, 'Vulkan') || startsWith(matrix.preset, 'MLX ')
        id: cache-install
        uses: actions/cache/restore@v4
@ -203,12 +216,12 @@ jobs:
          }
          
          $vulkanPath = (Resolve-Path "C:\VulkanSDK\*").path
+          $vulkanRuntime = Join-Path $vulkanPath "Helpers\VulkanRT.exe"
+          if (Test-Path $vulkanRuntime) {
+            Start-Process -FilePath $vulkanRuntime -ArgumentList "/s" -NoNewWindow -Wait
+          }
          echo "$vulkanPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
          echo "VULKAN_SDK=$vulkanPath" >> $env:GITHUB_ENV
-      - if: matrix.preset == 'CPU'
-        run: |
-          echo "CC=clang.exe" | Out-File -FilePath $env:GITHUB_ENV -Append
-          echo "CXX=clang++.exe" | Out-File -FilePath $env:GITHUB_ENV -Append
      - if: startsWith(matrix.preset, 'MLX ')
        name: Install cuDNN for MLX
        run: |
@ -240,73 +253,63 @@ jobs:
        with:
          path: ${{ github.workspace }}\.ccache
          key: ccache-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.preset }}-${{ needs.setup-environment.outputs.vendorsha }}
-      - name: Build target "${{ matrix.preset }}"
+      - name: Build Windows dependencies
        run: |
          Import-Module 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\Tools\Microsoft.VisualStudio.DevShell.dll'
          Enter-VsDevShell -VsInstallPath 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise' -SkipAutomaticLocation  -DevCmdArguments '-arch=x64 -no_logo'
-          cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }} --install-prefix "$((pwd).Path)\dist\${{ matrix.os }}-${{ matrix.arch }}"
-          cmake --build --preset "${{ matrix.preset }}" -- -l $([Environment]::ProcessorCount)
-          cmake --install build --component "${{ startsWith(matrix.preset, 'MLX ') && 'MLX' || startsWith(matrix.preset, 'CUDA ') && 'CUDA' || startsWith(matrix.preset, 'ROCm ') && 'HIP' || startsWith(matrix.preset, 'Vulkan') && 'Vulkan' || 'CPU' }}" --strip
-          if ('${{ matrix.preset }}'.StartsWith('MLX ')) { cmake --install build --component MLX_VENDOR }
-          Remove-Item -Path dist\lib\ollama\rocm\rocblas\library\*gfx906* -ErrorAction SilentlyContinue
+          $steps = "${{ matrix.build-steps }}".Split(' ', [System.StringSplitOptions]::RemoveEmptyEntries)
+          ./scripts/build_windows.ps1 @steps
        env:
          CMAKE_GENERATOR: Ninja
+          OLLAMA_BUILD_PARALLEL: ${{ matrix.build-parallel || '' }}
+          OLLAMA_CMAKE_CUDA_FLAGS: ${{ matrix.cmake-cuda-flags || '' }}
      - name: Log build results
        run: |
          gci -path .\dist -Recurse -File | ForEach-Object { get-filehash -path $_.FullName -Algorithm SHA256 } | format-list
+      - if: matrix.preset == 'CPU'
+        name: Verify Windows CPU payloads
+        shell: bash
+        run: |
+          set -euo pipefail
+          for payload in \
+            dist/windows-amd64/lib/ollama/llama-server.exe \
+            dist/windows-arm64/lib/ollama/llama-server.exe
+          do
+            [ -f "$payload" ] || { echo "missing $payload"; exit 1; }
+          done
      - uses: actions/upload-artifact@v4
        with:
          name: depends-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.preset }}
          path: dist\*

  windows-build:
-    strategy:
-      matrix:
-        os: [windows]
-        arch: [amd64, arm64]
-        include:
-        - os: windows
-          arch: amd64
-          llvmarch: x86_64
-        - os: windows
-          arch: arm64
-          llvmarch: aarch64
-    runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
+    runs-on: windows
    environment: release
    needs: [setup-environment]
    env:
      GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
      VERSION: ${{ needs.setup-environment.outputs.VERSION }}
    steps:
-      - name: Install ARM64 system dependencies
-        if: matrix.arch == 'arm64'
-        run: |
-          $ErrorActionPreference = "Stop"
-          Set-ExecutionPolicy Bypass -Scope Process -Force
-          [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072
-          iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
-          echo "C:\ProgramData\chocolatey\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
-
-          Invoke-WebRequest -Uri https://aka.ms/vs/17/release/vc_redist.arm64.exe -OutFile "${{ runner.temp }}\vc_redist.arm64.exe"
-          Start-Process -FilePath "${{ runner.temp }}\vc_redist.arm64.exe" -ArgumentList @("/install", "/quiet", "/norestart") -NoNewWindow -Wait
-
-          choco install -y --no-progress git gzip
-          echo "C:\Program Files\Git\cmd" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
      - name: Install clang and gcc-compat
        run: |
          $ErrorActionPreference = "Stop"
          Set-ExecutionPolicy Bypass -Scope Process -Force
-          Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-${{ matrix.llvmarch }}.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
+          Invoke-WebRequest -Uri "https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-x86_64.zip" -OutFile "${{ runner.temp }}\llvm-mingw-ucrt.zip"
          Expand-Archive -Path ${{ runner.temp }}\llvm-mingw-ucrt.zip -DestinationPath "C:\Program Files\"
-          $installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt*").path
+          $installPath=(Resolve-Path -Path "C:\Program Files\llvm-mingw-*-ucrt-x86_64").path
          echo "$installPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
+          if (!(Test-Path "$installPath\bin\aarch64-w64-mingw32-gcc.exe")) {
+            throw "llvm-mingw x86_64 package is missing the aarch64 cross compiler"
+          }
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
          cache-dependency-path: |
            go.sum
-            Makefile.sync
+            LLAMA_CPP_VERSION
+            MLX_VERSION
+            MLX_C_VERSION
      - name: Verify gcc is actually clang
        run: |
          $ErrorActionPreference='Continue'
@ -323,20 +326,30 @@ jobs:
        with:
          node-version: "20"
      - run: |
-          ./scripts/build_windows ollama app
+          ./scripts/build_windows ollama ollamaArm64 app appArm64
+      - name: Verify Windows build payloads
+        shell: bash
+        run: |
+          set -euo pipefail
+          for payload in \
+            dist/windows-amd64/ollama.exe \
+            dist/windows-arm64/ollama.exe
+          do
+            [ -f "$payload" ] || { echo "missing $payload"; exit 1; }
+          done
      - name: Log build results
        run: |
          gci -path .\dist -Recurse -File | ForEach-Object { get-filehash -path $_.FullName -Algorithm SHA256 } | format-list
      - uses: actions/upload-artifact@v4
        with:
-          name: build-${{ matrix.os }}-${{ matrix.arch }}
+          name: build-windows-amd64
          path: |
            dist\*

  windows-app:
    runs-on: windows
    environment: release
-    needs: [windows-build, windows-depends]
+    needs: [setup-environment, windows-build, windows-depends]
    env:
      GOFLAGS: ${{ needs.setup-environment.outputs.GOFLAGS }}
      VERSION: ${{ needs.setup-environment.outputs.VERSION }}
@ -362,7 +375,9 @@ jobs:
          go-version-file: go.mod
          cache-dependency-path: |
            go.sum
-            Makefile.sync
+            LLAMA_CPP_VERSION
+            MLX_VERSION
+            MLX_C_VERSION
      - uses: actions/download-artifact@v4
        with:
          pattern: depends-windows*
@ -376,6 +391,18 @@ jobs:
      - name: Log dist contents after download
        run: |
          gci -path .\dist -recurse
+      - name: Verify Windows package inputs
+        shell: bash
+        run: |
+          set -euo pipefail
+          for payload in \
+            dist/windows-amd64/ollama.exe \
+            dist/windows-amd64/lib/ollama/llama-server.exe \
+            dist/windows-arm64/ollama.exe \
+            dist/windows-arm64/lib/ollama/llama-server.exe
+          do
+            [ -f "$payload" ] || { echo "missing $payload"; exit 1; }
+          done
      - run: |
          ./scripts/build_windows.ps1 deps sign installer zip
      - name: Log contents after build
@ -389,31 +416,28 @@ jobs:
            dist/*.ps1
            dist/OllamaSetup.exe

-  # Pre-build each Dockerfile stage on its own runner in parallel and push the
-  # resulting layers to a per-stage registry cache.  The downstream
-  # docker-build-push job then assembles cache-hit-only.
  linux-depends:
    strategy:
      matrix:
        include:
          - arch: amd64
-            target: cpu
+            target: llama-server-cpu
          - arch: amd64
-            target: cuda-12
+            target: llama-server-cuda_v12
          - arch: amd64
-            target: cuda-13
+            target: llama-server-cuda_v13
          - arch: amd64
            target: mlx
          - arch: amd64
-            target: rocm-7
+            target: llama-server-rocm_v7_2
          - arch: amd64
-            target: vulkan
+            target: llama-server-vulkan
          - arch: arm64
-            target: cpu
+            target: llama-server-cpu
          - arch: arm64
-            target: cuda-12
+            target: llama-server-cuda_v12
          - arch: arm64
-            target: cuda-13
+            target: llama-server-cuda_v13
          - arch: arm64
            target: jetpack-5
          - arch: arm64
@ -430,7 +454,6 @@ jobs:
        with:
          username: ${{ vars.DOCKER_USER }}
          password: ${{ secrets.DOCKER_ACCESS_TOKEN }}
-      # Increase swap to handle momentary spikes in RAM from NVCC compiles
      - if: matrix.target == 'mlx'
        name: Increase Linux swap to 200 GB
        shell: bash
@ -459,12 +482,13 @@ jobs:
          provenance: false
          sbom: false
          build-args: |
+            GOFLAGS=${{ env.GOFLAGS }}
            CGO_CFLAGS=${{ env.CGO_CFLAGS }}
            CGO_CXXFLAGS=${{ env.CGO_CXXFLAGS }}
-            GOFLAGS=${{ env.GOFLAGS }}
-            APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
            OLLAMA_MLX_BUILD_JOBS=16
            OLLAMA_MLX_NVCC_THREADS=6
+            APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
+            APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
          cache-from: |
            type=registry,ref=ollama/release:cache-${{ matrix.arch }}-${{ matrix.target }}
            type=registry,ref=${{ vars.DOCKER_REPO }}:latest
@ -472,58 +496,65 @@ jobs:

  # Build each Docker variant (OS, arch, and flavor) separately. Using QEMU is unreliable and slower.
  # Heavy stages were pre-built by linux-depends; this job is cache-hit-only for those layers
-  # and just assembles, runs the Go build, and pushes the final image.
+  # and just assembles, runs the Go build, pushes the final image, and extracts release bundles.
  docker-build-push:
    strategy:
      matrix:
        include:
          - os: linux
            arch: arm64
+            archive-target: archive
            build-args: |
              CGO_CFLAGS
              CGO_CXXFLAGS
              GOFLAGS
              APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
+              APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
              OLLAMA_MLX_BUILD_JOBS=16
              OLLAMA_MLX_NVCC_THREADS=6
            cache-from: |
-              type=registry,ref=${{ vars.DOCKER_REPO }}:latest
-              type=registry,ref=ollama/release:cache-arm64-cpu
-              type=registry,ref=ollama/release:cache-arm64-cuda-12
-              type=registry,ref=ollama/release:cache-arm64-cuda-13
+              type=registry,ref=ollama/release:cache-arm64-llama-server-cpu
+              type=registry,ref=ollama/release:cache-arm64-llama-server-cuda_v12
+              type=registry,ref=ollama/release:cache-arm64-llama-server-cuda_v13
              type=registry,ref=ollama/release:cache-arm64-jetpack-5
              type=registry,ref=ollama/release:cache-arm64-jetpack-6
+              type=registry,ref=${{ vars.DOCKER_REPO }}:latest
          - os: linux
            arch: amd64
+            archive-target: archive
            build-args: |
              CGO_CFLAGS
              CGO_CXXFLAGS
              GOFLAGS
              APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
+              APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
              OLLAMA_MLX_BUILD_JOBS=16
              OLLAMA_MLX_NVCC_THREADS=6
            cache-from: |
-              type=registry,ref=${{ vars.DOCKER_REPO }}:latest
-              type=registry,ref=ollama/release:cache-amd64-cpu
-              type=registry,ref=ollama/release:cache-amd64-cuda-12
-              type=registry,ref=ollama/release:cache-amd64-cuda-13
+              type=registry,ref=ollama/release:cache-amd64-llama-server-cpu
+              type=registry,ref=ollama/release:cache-amd64-llama-server-cuda_v12
+              type=registry,ref=ollama/release:cache-amd64-llama-server-cuda_v13
              type=registry,ref=ollama/release:cache-amd64-mlx
-              type=registry,ref=ollama/release:cache-amd64-vulkan
+              type=registry,ref=ollama/release:cache-amd64-llama-server-rocm_v7_2
+              type=registry,ref=ollama/release:cache-amd64-llama-server-vulkan
+              type=registry,ref=${{ vars.DOCKER_REPO }}:latest
          - os: linux
            arch: amd64
            suffix: '-rocm'
+            archive-target: image-archive
            build-args: |
              CGO_CFLAGS
              CGO_CXXFLAGS
              GOFLAGS
              FLAVOR=rocm
              APT_MIRROR=http://azure.archive.ubuntu.com/ubuntu
+              APT_PORTS_MIRROR=http://azure.ports.ubuntu.com/ubuntu-ports
              OLLAMA_MLX_BUILD_JOBS=16
              OLLAMA_MLX_NVCC_THREADS=6
            cache-from: |
+              type=registry,ref=ollama/release:cache-amd64-llama-server-cpu
+              type=registry,ref=ollama/release:cache-amd64-llama-server-rocm_v7_2
              type=registry,ref=${{ vars.DOCKER_REPO }}:latest
-              type=registry,ref=ollama/release:cache-amd64-cpu
-              type=registry,ref=ollama/release:cache-amd64-rocm-7
    runs-on: ${{ matrix.arch == 'arm64' && format('{0}-{1}', matrix.os, matrix.arch) || matrix.os }}
    environment: release
    needs: [setup-environment, linux-depends]
@ -556,14 +587,11 @@ jobs:
          name: digest-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.suffix }}
          path: |
            ${{ runner.temp }}/${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.suffix }}.txt
-      # Re-run buildx with --target archive against buildkit's local cache to
-      # extract the release directory layout.  All upstream stages were just
-      # built above, so this is a cache-hit-only pass that just writes files.
      - uses: docker/build-push-action@v6
        with:
          context: .
          platforms: ${{ matrix.os }}/${{ matrix.arch }}
-          target: archive
+          target: ${{ matrix.archive-target }}
          provenance: false
          sbom: false
          build-args: ${{ matrix.build-args }}
@ -572,24 +600,33 @@ jobs:
      - name: Deduplicate CUDA libraries
        run: |
          ./scripts/deduplicate_cuda_libs.sh dist/${{ matrix.os }}-${{ matrix.arch }}
+      - name: Verify Linux build payloads
+        shell: bash
+        run: |
+          set -euo pipefail
+          base="dist/${{ matrix.os }}-${{ matrix.arch }}"
+          for payload in \
+            "$base/bin/ollama" \
+            "$base/lib/ollama/llama-server"
+          do
+            [ -f "$payload" ] || { echo "missing $payload"; exit 1; }
+          done
      - run: |
          for COMPONENT in bin/* lib/ollama/*; do
            case "$COMPONENT" in
              bin/ollama*)               echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
              lib/ollama/*.so*)          echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
+              lib/ollama/llama-server*|lib/ollama/llama-quantize*)  echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
              lib/ollama/cuda_v*)        echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
              lib/ollama/vulkan*)        echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
              lib/ollama/mlx*)           echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-mlx.tar.in ;;
-              lib/ollama/include*)       echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in ;;
+              lib/ollama/include*)       echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-mlx.tar.in ;;
              lib/ollama/cuda_jetpack5)  echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-jetpack5.tar.in ;;
              lib/ollama/cuda_jetpack6)  echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-jetpack6.tar.in ;;
-              lib/ollama/rocm)           echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-rocm.tar.in ;;
+              lib/ollama/rocm_v*)        echo $COMPONENT >>ollama-${{ matrix.os }}-${{ matrix.arch }}-rocm.tar.in ;;
            esac
          done
        working-directory: dist/${{ matrix.os }}-${{ matrix.arch }}
-      # rocm builds cpu + rocm libs for the container image, which
-      # creates a CPU-only amd64 tarball that would collide with the full
-      # bundle when the release job merges artifacts.
      - if: matrix.suffix == '-rocm'
        run: rm -f dist/${{ matrix.os }}-${{ matrix.arch }}/ollama-${{ matrix.os }}-${{ matrix.arch }}.tar.in
      - run: |
@ -665,6 +702,21 @@ jobs:
      - name: Copy install scripts to dist
        run: |
          cp scripts/install.sh dist/install.sh
+      - name: Verify release artifacts
+        run: |
+          required=(
+            dist/OllamaSetup.exe
+            dist/install.ps1
+            dist/install.sh
+            dist/ollama-windows-amd64.zip
+            dist/ollama-windows-arm64.zip
+          )
+          for payload in "${required[@]}"; do
+            if [ ! -f "$payload" ]; then
+              echo "::error::Missing expected release artifact: $payload"
+              exit 1
+            fi
+          done
      - name: Generate checksum file
        run: find . -type f -not -name 'sha256sum.txt' | xargs sha256sum | tee sha256sum.txt
        working-directory: dist
--- a/.github/workflows/test.yaml
+++ b/.github/workflows/test.yaml
@ -23,7 +23,7 @@ jobs:
    outputs:
      changed: ${{ steps.changes.outputs.changed }}
      app_changed: ${{ steps.changes.outputs.app_changed }}
-      vendorsha: ${{ steps.changes.outputs.vendorsha }}
+      enginehash: ${{ steps.changes.outputs.enginehash }}
    steps:
      - uses: actions/checkout@v4
        with:
@ -38,9 +38,42 @@ jobs:
              | xargs python3 -c "import sys; from pathlib import Path; print(any(Path(x).match(glob) for x in sys.argv[1:] for glob in '$*'.split(' ')))"
          }

-          echo changed=$(changed 'llama/llama.cpp/**/*' 'ml/backend/ggml/ggml/**/*' '.github/**/*') | tee -a $GITHUB_OUTPUT
+          echo changed=$(changed \
+            'CMakeLists.txt' \
+            'CMakePresets.json' \
+            'cmake/**' \
+            'cmake/**/*' \
+            'llama/server/**/*' \
+            'llama/compat/**/*' \
+            'LLAMA_CPP_VERSION' \
+            'MLX_VERSION' \
+            'MLX_C_VERSION' \
+            'llama/llama.cpp/**/*' \
+            'ml/backend/ggml/ggml/**/*' \
+            'x/imagegen/mlx/**' \
+            'x/imagegen/mlx/**/*' \
+            '.github/**/*') | tee -a $GITHUB_OUTPUT
          echo app_changed=$(changed 'app/**' 'app/**/*') | tee -a $GITHUB_OUTPUT
-          echo vendorsha=$(make -f Makefile.sync print-base) | tee -a $GITHUB_OUTPUT
+          echo enginehash=$(cat LLAMA_CPP_VERSION)-$(cat MLX_VERSION)-$(cat MLX_C_VERSION) | tee -a $GITHUB_OUTPUT
+
+  patches:
+    strategy:
+      matrix:
+        os: [ubuntu-latest, windows-latest]
+    runs-on: ${{ matrix.os }}
+    steps:
+      - uses: actions/checkout@v4
+      - name: Verify patches apply cleanly
+        shell: bash
+        run: |
+          cmake -S llama/server -B "$RUNNER_TEMP/llama-server-patch-check" \
+            -DCMAKE_BUILD_TYPE=Release \
+            -DBUILD_SHARED_LIBS=ON \
+            -DGGML_BACKEND_DL=ON \
+            -DGGML_NATIVE=OFF \
+            -DGGML_OPENMP=OFF \
+            -DGGML_CPU_ALL_VARIANTS=ON \
+            -DOLLAMA_RUNNER_DIR=

  linux:
    needs: [changes]
@ -49,23 +82,41 @@ jobs:
      matrix:
        include:
          - preset: CPU
+            superbuild_target: ollama-local
+            superbuild_dir: build/local-superbuild
+            superbuild_args: ''
+            expected_payload: lib/ollama/llama-server
+            install-go: true
          - preset: CUDA
            container: nvidia/cuda:13.0.0-devel-ubuntu22.04
-            flags: '-DCMAKE_CUDA_ARCHITECTURES=87'
+            superbuild_target: ollama-llama-server-cuda_v13
+            superbuild_dir: build/local-superbuild-cuda_v13
+            superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=87'
+            expected_payload: lib/ollama/cuda_v13/libggml-cuda.so
          - preset: ROCm
            container: rocm/dev-ubuntu-22.04:7.2.1
            extra-packages: rocm-libs
-            flags: '-DAMDGPU_TARGETS=gfx1010 -DCMAKE_PREFIX_PATH=/opt/rocm'
+            superbuild_target: ollama-llama-server-rocm_v7_2
+            superbuild_dir: build/local-superbuild-rocm_v7_2
+            superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=rocm_v7_2 -DAMDGPU_TARGETS=gfx1010 -DCMAKE_PREFIX_PATH=/opt/rocm'
+            expected_payload: lib/ollama/rocm_v7_2/libggml-hip.so
          - preset: Vulkan
            container: ubuntu:22.04
            extra-packages: >
              mesa-vulkan-drivers vulkan-tools
              libvulkan1 libvulkan-dev
-              vulkan-sdk cmake ccache g++ make
+              vulkan-sdk spirv-headers cmake ccache g++ make
+            superbuild_target: ollama-llama-server-vulkan
+            superbuild_dir: build/local-superbuild-vulkan
+            superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=vulkan'
+            expected_payload: lib/ollama/vulkan/libggml-vulkan.so
          - preset: 'MLX CUDA 13'
            container: nvidia/cuda:13.0.0-devel-ubuntu22.04
            extra-packages: libcudnn9-dev-cuda-13 libopenblas-dev liblapack-dev liblapacke-dev git curl
-            flags: '-DCMAKE_CUDA_ARCHITECTURES=87 -DMLX_CUDA_ARCHITECTURES=80-virtual -DBLAS_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu -DLAPACK_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu'
+            superbuild_target: ollama-mlx-cuda_v13
+            superbuild_dir: build/local-superbuild-mlx-cuda_v13
+            superbuild_args: '-DOLLAMA_MLX_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=87 -DMLX_CUDA_ARCHITECTURES=80-virtual -DBLAS_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu -DLAPACK_INCLUDE_DIRS=/usr/include/x86_64-linux-gnu'
+            expected_payload: lib/ollama/mlx_cuda_v13/libmlx.so
            install-go: true
    runs-on: linux
    container: ${{ matrix.container }}
@ -82,11 +133,9 @@ jobs:
            echo "deb [signed-by=/usr/share/keyrings/lunarg-archive-keyring.gpg]  https://packages.lunarg.com/vulkan/1.4.313 jammy main" | $sudo tee /etc/apt/sources.list.d/lunarg-vulkan-1.4.313-jammy.list > /dev/null
            $sudo apt-get update
          fi
-          $sudo apt-get install -y cmake ccache ${{ matrix.extra-packages }}
-          # MLX requires CMake 3.25+, install from official releases
-          if [ "${{ matrix.preset }}" = "MLX CUDA 13" ]; then
-            curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.31.2/cmake-3.31.2-linux-$(uname -m).tar.gz | $sudo tar xz -C /usr/local --strip-components 1
-          fi
+          $sudo apt-get install -y cmake ccache curl git ${{ matrix.extra-packages }}
+          # Use a current CMake for upstream llama.cpp and Vulkan dependency discovery.
+          curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.31.2/cmake-3.31.2-linux-$(uname -m).tar.gz | $sudo tar xz -C /usr/local --strip-components 1
          # Export VULKAN_SDK if provided by LunarG package (defensive)
          if [ -d "/usr/lib/x86_64-linux-gnu/vulkan" ] && [ "${{ matrix.preset }}" = "Vulkan" ]; then
            echo "VULKAN_SDK=/usr" >> $GITHUB_ENV
@ -96,17 +145,30 @@ jobs:
      - if: matrix.install-go
        name: Install Go
        run: |
+          [ -n "${{ matrix.container }}" ] || sudo=sudo
          GO_VERSION=$(awk '/^go / { print $2 }' go.mod)
-          curl -fsSL "https://golang.org/dl/go${GO_VERSION}.linux-$(dpkg --print-architecture).tar.gz" | tar xz -C /usr/local
+          curl -fsSL "https://golang.org/dl/go${GO_VERSION}.linux-$(dpkg --print-architecture).tar.gz" | $sudo tar xz -C /usr/local
          echo "/usr/local/go/bin" >> $GITHUB_PATH
      - uses: actions/cache@v4
        with:
          path: /github/home/.cache/ccache
-          key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.vendorsha }}
-      - run: |
-          cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }}
-          cmake --build --preset "${{ matrix.preset }}" -- -l $(nproc)
-
+          key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.enginehash }}
+      - name: Build native superbuild
+        if: matrix.superbuild_target
+        run: |
+          cmake -S . -B "${{ matrix.superbuild_dir }}" ${{ matrix.superbuild_args }}
+          CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) cmake --build "${{ matrix.superbuild_dir }}" --target "${{ matrix.superbuild_target }}" -- -l $(nproc)
+          test -e "${{ matrix.superbuild_dir }}/${{ matrix.expected_payload }}"
+      - name: Verify local superbuild install
+        if: matrix.superbuild_target == 'ollama-local'
+        run: |
+          ./ollama --version
+          "${{ matrix.superbuild_dir }}/lib/ollama/llama-server" --version
+          test -x "${{ matrix.superbuild_dir }}/lib/ollama/llama-quantize"
+          cmake --install "${{ matrix.superbuild_dir }}" --component ollama-local --prefix "$RUNNER_TEMP/ollama-local"
+          "$RUNNER_TEMP/ollama-local/bin/ollama" --version
+          "$RUNNER_TEMP/ollama-local/lib/ollama/llama-server" --version
+          test -x "$RUNNER_TEMP/ollama-local/lib/ollama/llama-quantize"
  windows:
    needs: [changes]
    if: needs.changes.outputs.changed == 'True'
@ -114,9 +176,16 @@ jobs:
      matrix:
        include:
          - preset: CPU
+            superbuild_target: ollama-local
+            superbuild_dir: build\local-superbuild
+            superbuild_args: ''
+            expected_payload: lib\ollama\llama-server.exe
          - preset: CUDA
            install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
-            flags: '-DCMAKE_CUDA_ARCHITECTURES=80'
+            superbuild_target: ollama-llama-server-cuda_v13
+            superbuild_dir: build\local-superbuild-cuda_v13
+            superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=80'
+            expected_payload: lib\ollama\cuda_v13\ggml-cuda.dll
            cuda-components:
              - '"cudart"'
              - '"nvcc"'
@ -127,14 +196,26 @@ jobs:
              - '"nvptxcompiler"'
            cuda-version: '13.0'
          - preset: ROCm
-            install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-WinSvr2022-For-HIP.exe
-            flags: '-DAMDGPU_TARGETS=gfx1010 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma" -DCMAKE_CXX_FLAGS="-parallel-jobs=4 -Wno-ignored-attributes -Wno-deprecated-pragma"'
+            install: https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-26.Q1-Win11-For-HIP.exe
+            rocm-version: '7.1'
+            superbuild_target: ollama-llama-server-rocm_v7_1
+            superbuild_dir: build\local-superbuild-rocm_v7_1
+            superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=rocm_v7_1 -DAMDGPU_TARGETS=gfx1010'
+            expected_payload: lib\ollama\rocm_v7_1\ggml-hip.dll
          - preset: Vulkan
            install: https://sdk.lunarg.com/sdk/download/1.4.321.1/windows/vulkansdk-windows-X64-1.4.321.1.exe
+            superbuild_target: ollama-llama-server-vulkan
+            superbuild_dir: build\local-superbuild-vulkan
+            superbuild_args: '-DOLLAMA_LLAMA_BACKENDS=vulkan'
+            expected_payload: lib\ollama\vulkan\ggml-vulkan.dll
          - preset: 'MLX CUDA 13'
            install: https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda_13.0.0_windows.exe
            cudnn-install: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.18.1.3_cuda13-archive.zip
-            flags: '-DCMAKE_CUDA_ARCHITECTURES=80 -DMLX_CUDA_ARCHITECTURES=80-virtual'
+            superbuild_target: ollama-mlx-cuda_v13
+            superbuild_dir: build\local-superbuild-mlx-cuda_v13
+            superbuild_args: '-DOLLAMA_MLX_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=80 -DMLX_CUDA_ARCHITECTURES=80-virtual'
+            expected_payload: lib\ollama\mlx_cuda_v13\mlx.dll
+            install-go: true
            cuda-components:
              - '"cudart"'
              - '"nvcc"'
@ -203,6 +284,10 @@ jobs:
          }

          $vulkanPath = (Resolve-Path "C:\VulkanSDK\*").path
+          $vulkanRuntime = Join-Path $vulkanPath "Helpers\VulkanRT.exe"
+          if (Test-Path $vulkanRuntime) {
+            Start-Process -FilePath $vulkanRuntime -ArgumentList "/s" -NoNewWindow -Wait
+          }
          echo "$vulkanPath\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
          echo "VULKAN_SDK=$vulkanPath" >> $env:GITHUB_ENV
      - if: matrix.preset == 'MLX CUDA 13'
@ -232,18 +317,44 @@ jobs:
            C:\Program Files\NVIDIA\CUDNN
          key: ${{ matrix.install }}-${{ matrix.cudnn-install }}
      - uses: actions/checkout@v4
+      - if: matrix.superbuild_target == 'ollama-local' || matrix.install-go
+        uses: actions/setup-go@v5
+        with:
+          go-version-file: 'go.mod'
      - uses: actions/cache@v4
        with:
          path: ${{ github.workspace }}\.ccache
-          key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.vendorsha }}
-      - run: |
+          key: ccache-${{ runner.os }}-${{ runner.arch }}-${{ matrix.preset }}-${{ needs.changes.outputs.enginehash }}
+      - name: Build native superbuild
+        if: matrix.superbuild_target
+        run: |
+          $ErrorActionPreference = "Stop"
          Import-Module 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\Tools\Microsoft.VisualStudio.DevShell.dll'
          Enter-VsDevShell -VsInstallPath 'C:\Program Files\Microsoft Visual Studio\2022\Enterprise' -SkipAutomaticLocation  -DevCmdArguments '-arch=x64 -no_logo'
-          cmake --preset "${{ matrix.preset }}" ${{ matrix.flags }}
-          cmake --build --preset "${{ matrix.preset }}" -- -l $([Environment]::ProcessorCount)
+          cmake -S . -B "${{ matrix.superbuild_dir }}" ${{ matrix.superbuild_args }}
+          $env:CMAKE_BUILD_PARALLEL_LEVEL = [Environment]::ProcessorCount
+          cmake --build "${{ matrix.superbuild_dir }}" --target "${{ matrix.superbuild_target }}" -- -l $([Environment]::ProcessorCount)
+          if (!(Test-Path "${{ matrix.superbuild_dir }}\${{ matrix.expected_payload }}")) {
+            throw "missing ${{ matrix.expected_payload }}"
+          }
        env:
          CMAKE_GENERATOR: Ninja
-
+      - name: Verify local superbuild install
+        if: matrix.superbuild_target == 'ollama-local'
+        run: |
+          $ErrorActionPreference = "Stop"
+          & ".\ollama.exe" --version
+          & "${{ matrix.superbuild_dir }}\lib\ollama\llama-server.exe" --version
+          if (!(Test-Path "${{ matrix.superbuild_dir }}\lib\ollama\llama-quantize.exe")) {
+            throw "missing llama-quantize.exe"
+          }
+          $installPrefix = Join-Path $env:RUNNER_TEMP "ollama-local"
+          cmake --install "${{ matrix.superbuild_dir }}" --component ollama-local --prefix "$installPrefix"
+          & "$installPrefix\bin\ollama.exe" --version
+          & "$installPrefix\lib\ollama\llama-server.exe" --version
+          if (!(Test-Path "$installPrefix\lib\ollama\llama-quantize.exe")) {
+            throw "missing installed llama-quantize.exe"
+          }
  go_mod_tidy:
    runs-on: ubuntu-latest
    steps:
@ -266,7 +377,9 @@ jobs:
          go-version-file: 'go.mod'
          cache-dependency-path: |
            go.sum
-            Makefile.sync
+            LLAMA_CPP_VERSION
+            MLX_VERSION
+            MLX_C_VERSION
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
@ -280,6 +393,17 @@ jobs:
        if: ${{ startsWith(matrix.os, 'ubuntu') }}
        working-directory: ./app/ui/app
        run: npm test
+      - name: Verify MLX generated files are current
+        if: ${{ startsWith(matrix.os, 'ubuntu') }}
+        run: |
+          cmake -S . -B build/mlx-generate -DOLLAMA_MLX_BACKENDS=cuda_v13
+          cmake --build build/mlx-generate --target ollama-mlx-generate-wrappers
+          git diff --exit-code -- \
+            x/imagegen/mlx/mlx.h \
+            x/imagegen/mlx/mlx.c \
+            x/mlxrunner/mlx/generated.h \
+            x/mlxrunner/mlx/generated.c \
+            x/mlxrunner/mlx/include/mlx/c
      - name: Run go generate
        run: go generate ./...

@ -294,12 +418,3 @@ jobs:
      - uses: golangci/golangci-lint-action@v9
        with:
          only-new-issues: true
-
-  patches:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - name: Verify patches apply cleanly and do not change files
-        run: |
-          make -f Makefile.sync clean checkout apply-patches sync
-          git diff --compact-summary --exit-code
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 3.21)
+cmake_minimum_required(VERSION 3.24)

 project(Ollama C CXX)

@ -23,30 +23,23 @@ include(GNUInstallDirs)

 find_package(Threads REQUIRED)

-set(CMAKE_BUILD_TYPE Release)
-set(BUILD_SHARED_LIBS ON)
+if(NOT CMAKE_CONFIGURATION_TYPES AND NOT CMAKE_BUILD_TYPE)
+    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
+endif()
+
+# These defaults can be overridden by presets (e.g., for static macOS llama-server builds)
+if(NOT DEFINED BUILD_SHARED_LIBS)
+    set(BUILD_SHARED_LIBS ON)
+endif()

 set(CMAKE_CXX_STANDARD 17)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
-set(CMAKE_CXX_EXTENSIONS ON) # Recent versions of MLX Requires gnu++17 extensions to compile properly
+set(CMAKE_CXX_EXTENSIONS ON) # Recent versions of MLX require gnu++17 extensions to compile properly

-set(GGML_BUILD ON)
-set(GGML_SHARED ON)
-set(GGML_CCACHE ON)
-set(GGML_BACKEND_DL ON)
-set(GGML_BACKEND_SHARED ON)
-set(GGML_SCHED_MAX_COPIES 4)
-
-set(GGML_LLAMAFILE ON)
-set(GGML_CUDA_PEER_MAX_BATCH_SIZE 128)
-set(GGML_CUDA_GRAPHS ON)
-set(GGML_CUDA_FA ON)
-set(GGML_CUDA_COMPRESSION_MODE default)
-
-if((CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_OSX_ARCHITECTURES MATCHES "arm64")
-    OR (NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_SYSTEM_PROCESSOR MATCHES "arm|aarch64|ARM64|ARMv[0-9]+"))
-    set(GGML_CPU_ALL_VARIANTS ON)
-endif()
+# GGML backend for inference is provided by llama-server (built separately via
+# llama/server/CMakeLists.txt using FetchContent from the pinned llama.cpp source).
+# The root CMake project is the orchestration entrypoint; backend-specific
+# build rules live in subprojects under cmake/.

 if(APPLE)
    set(CMAKE_BUILD_RPATH "@loader_path")
@ -55,7 +48,8 @@ if(APPLE)
 endif()

 set(OLLAMA_BUILD_DIR ${CMAKE_BINARY_DIR}/lib/ollama)
-set(OLLAMA_INSTALL_DIR ${CMAKE_INSTALL_PREFIX}/lib/ollama/${OLLAMA_RUNNER_DIR})
+set(OLLAMA_LIB_DIR "lib/ollama" CACHE STRING "Install destination for Ollama runtime payloads")
+set(OLLAMA_INSTALL_DIR ${OLLAMA_LIB_DIR}/${OLLAMA_RUNNER_DIR})

 set(CMAKE_RUNTIME_OUTPUT_DIRECTORY         ${OLLAMA_BUILD_DIR})
 set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_DEBUG   ${OLLAMA_BUILD_DIR})
@ -64,314 +58,9 @@ set(CMAKE_LIBRARY_OUTPUT_DIRECTORY         ${OLLAMA_BUILD_DIR})
 set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_DEBUG   ${OLLAMA_BUILD_DIR})
 set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE ${OLLAMA_BUILD_DIR})

-# Store ggml include paths for use with target_include_directories later.
-# We avoid global include_directories() to prevent polluting the include path
-# for other projects like MLX (whose openblas dependency has its own common.h).
-set(GGML_INCLUDE_DIRS
-    ${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src
-    ${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/include
-    ${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cpu
-    ${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cpu/amx
-)
-
-add_compile_definitions(NDEBUG GGML_VERSION=0x0 GGML_COMMIT=0x0)
-
-# Define GGML version variables for shared library SOVERSION
-# These are required by ggml/src/CMakeLists.txt for proper library versioning
-set(GGML_VERSION_MAJOR 0)
-set(GGML_VERSION_MINOR 0)
-set(GGML_VERSION_PATCH 0)
-set(GGML_VERSION "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
-
-set(GGML_CPU ON)
-add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src)
-set_property(TARGET ggml PROPERTY EXCLUDE_FROM_ALL TRUE)
-
-get_target_property(CPU_VARIANTS ggml-cpu MANUALLY_ADDED_DEPENDENCIES)
-if(NOT CPU_VARIANTS)
-    set(CPU_VARIANTS "ggml-cpu")
-endif()
-
-# Apply ggml include directories to ggml targets only (not globally)
-target_include_directories(ggml-base PRIVATE ${GGML_INCLUDE_DIRS})
-foreach(variant ${CPU_VARIANTS})
-    if(TARGET ${variant})
-        target_include_directories(${variant} PRIVATE ${GGML_INCLUDE_DIRS})
-    endif()
-endforeach()
-
-install(TARGETS ggml-base ${CPU_VARIANTS}
-    RUNTIME_DEPENDENCIES
-        PRE_EXCLUDE_REGEXES ".*"
-    RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CPU
-    LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CPU
-    FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CPU
-)
-
-check_language(CUDA)
-if(CMAKE_CUDA_COMPILER)
-    if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.24" AND NOT CMAKE_CUDA_ARCHITECTURES)
-        set(CMAKE_CUDA_ARCHITECTURES "native")
-    endif()
-
-    find_package(CUDAToolkit)
-    add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-cuda)
-    target_include_directories(ggml-cuda PRIVATE ${GGML_INCLUDE_DIRS})
-    install(TARGETS ggml-cuda
-        RUNTIME_DEPENDENCIES
-            DIRECTORIES ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR}
-            PRE_INCLUDE_REGEXES cublas cublasLt cudart
-            PRE_EXCLUDE_REGEXES ".*"
-        RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CUDA
-        LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT CUDA
-    )
-endif()
-
-set(WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX "^gfx(908|90a|1200|1201):xnack[+-]$"
-    CACHE STRING
-    "Regular expression describing AMDGPU_TARGETS not supported on Windows. Override to force building these targets. Default \"^gfx(908|90a|1200|1201):xnack[+-]$\"."
-)
-
-check_language(HIP)
-if(CMAKE_HIP_COMPILER)
-    set(HIP_PLATFORM "amd")
-
-    if(NOT AMDGPU_TARGETS)
-        find_package(hip REQUIRED)
-        list(FILTER AMDGPU_TARGETS INCLUDE REGEX "^gfx(94[012]|101[02]|1030|110[012]|120[01])$")
-    endif()
-
-    if(WIN32 AND WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX)
-        list(FILTER AMDGPU_TARGETS EXCLUDE REGEX ${WINDOWS_AMDGPU_TARGETS_EXCLUDE_REGEX})
-    endif()
-
-    if(AMDGPU_TARGETS)
-        find_package(hip REQUIRED)
-        add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-hip)
-        target_include_directories(ggml-hip PRIVATE ${GGML_INCLUDE_DIRS})
-
-        if (WIN32)
-            target_compile_definitions(ggml-hip PRIVATE GGML_CUDA_NO_PEER_COPY)
-        endif()
-
-        target_compile_definitions(ggml-hip PRIVATE GGML_HIP_NO_VMM)
-
-        install(TARGETS ggml-hip
-            RUNTIME_DEPENDENCY_SET rocm
-            RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
-            LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
-        )
-        install(RUNTIME_DEPENDENCY_SET rocm
-                DIRECTORIES ${HIP_BIN_INSTALL_DIR} ${HIP_LIB_INSTALL_DIR}
-                PRE_INCLUDE_REGEXES hipblas rocblas amdhip64 rocsolver amd_comgr hsa-runtime64 rocsparse tinfo rocprofiler-register roctx64 rocroller drm drm_amdgpu numa elf
-                PRE_EXCLUDE_REGEXES ".*"
-                POST_EXCLUDE_REGEXES "system32"
-            RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
-            LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP
-        )
-
-        foreach(HIP_LIB_BIN_INSTALL_DIR IN ITEMS ${HIP_BIN_INSTALL_DIR} ${HIP_LIB_INSTALL_DIR})
-            if(EXISTS ${HIP_LIB_BIN_INSTALL_DIR}/rocblas)
-                install(DIRECTORY ${HIP_LIB_BIN_INSTALL_DIR}/rocblas DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT HIP)
-                break()
-            endif()
-        endforeach()
-    endif()
-endif()
-
-if(NOT APPLE)
-    find_package(Vulkan)
-    if(Vulkan_FOUND)
-        add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/ml/backend/ggml/ggml/src/ggml-vulkan)
-        target_include_directories(ggml-vulkan PRIVATE ${GGML_INCLUDE_DIRS})
-        install(TARGETS ggml-vulkan
-            RUNTIME_DEPENDENCIES
-                PRE_INCLUDE_REGEXES vulkan
-                PRE_EXCLUDE_REGEXES ".*"
-            RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT Vulkan
-            LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT Vulkan
-        )
-    endif()
-endif()
-
-option(MLX_ENGINE "Enable MLX backend" OFF)
-if(MLX_ENGINE)
-    message(STATUS "Setting up MLX (this takes a while...)")
-    add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/x/imagegen/mlx)
-
-    # Find CUDA toolkit if MLX is built with CUDA support
-    find_package(CUDAToolkit)
-
-    # Build list of directories for runtime dependency resolution
-    set(MLX_RUNTIME_DIRS ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR})
-    # Add cuDNN bin paths for DLLs (Windows MLX CUDA builds)
-    # CUDNN_ROOT_DIR is the standard CMake variable for cuDNN location
-    if(DEFINED ENV{CUDNN_ROOT_DIR})
-        # cuDNN 9.x has versioned subdirectories under bin/ (e.g., bin/13.0/)
-        file(GLOB CUDNN_BIN_SUBDIRS "$ENV{CUDNN_ROOT_DIR}/bin/*")
-        list(APPEND MLX_RUNTIME_DIRS ${CUDNN_BIN_SUBDIRS})
-    endif()
-    # Add build output directory and MLX dependency build directories
-    list(APPEND MLX_RUNTIME_DIRS ${OLLAMA_BUILD_DIR})
-    # OpenBLAS DLL location (pre-built zip extracts into openblas-src/bin/)
-    list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/openblas-src/bin)
-    # NCCL: on Linux, if real NCCL is found, cmake bundles libnccl.so via the
-    # regex below. If NCCL is not found, MLX links a static stub (OBJECT lib)
-    # so there is no runtime dependency. This path covers the stub build dir
-    # for windows so we include the DLL in our dependencies.
-    list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/distributed/nccl/nccl_stub-prefix/src/nccl_stub-build/Release)
-
-    # Base regexes for runtime dependencies (cross-platform)
-    set(MLX_INCLUDE_REGEXES cublas cublasLt cudart cufft nvrtc nvrtc-builtins cudnn nccl openblas gfortran)
-    # On Windows, also include dl.dll (dlfcn-win32 POSIX emulation layer)
-    if(WIN32)
-        list(APPEND MLX_INCLUDE_REGEXES "^dl\\.dll$")
-    endif()
-
-    # Split mlx/mlxc libraries from runtime deps to avoid stripping deps
-    install(TARGETS mlx mlxc
-        RUNTIME_DEPENDENCY_SET mlx_runtime_deps
-        RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
-        LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
-        FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
-    )
-    install(RUNTIME_DEPENDENCY_SET mlx_runtime_deps
-        DIRECTORIES ${MLX_RUNTIME_DIRS}
-        PRE_INCLUDE_REGEXES ${MLX_INCLUDE_REGEXES}
-        PRE_EXCLUDE_REGEXES ".*"
-        RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
-        LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
-    )
-
-    if(TARGET jaccl)
-        install(TARGETS jaccl
-            RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
-            LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
-            FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
-        )
-    endif()
-
-    # Install the Metal library for macOS arm64 (must be colocated with the binary)
-    # Metal backend is only built for arm64, not x86_64
-    if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
-        install(FILES ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/backend/metal/kernels/mlx.metallib
-            DESTINATION ${OLLAMA_INSTALL_DIR}
-            COMPONENT MLX)
-    endif()
-
-    # Install headers for NVRTC JIT compilation at runtime.
-    # MLX's own install rules use the default component so they get skipped by
-    # --component MLX. Headers are installed alongside libmlx in OLLAMA_INSTALL_DIR.
-    #
-    # Layout:
-    #   ${OLLAMA_INSTALL_DIR}/include/cccl/{cuda,nv}/  — CCCL headers
-    #   ${OLLAMA_INSTALL_DIR}/include/*.h               — CUDA toolkit headers
-    #
-    # MLX's jit_module.cpp resolves CCCL via
-    #   current_binary_dir()[.parent_path()] / "include" / "cccl"
-    # On Linux, MLX's jit_module.cpp resolves CCCL via
-    #   current_binary_dir().parent_path() / "include" / "cccl", so we create a
-    #   symlink from lib/ollama/include -> ${OLLAMA_RUNNER_DIR}/include
-    #   This will need refinement if we add multiple CUDA versions for MLX in the future.
-    # CUDA runtime headers are found via CUDA_PATH env var (set by mlxrunner).
-    if(EXISTS ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda)
-        install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda
-            DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
-            COMPONENT MLX)
-        install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/nv
-            DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
-            COMPONENT MLX)
-        if(NOT WIN32 AND NOT APPLE)
-            install(CODE "
-                set(_link \"${CMAKE_INSTALL_PREFIX}/lib/ollama/include\")
-                set(_target \"${OLLAMA_RUNNER_DIR}/include\")
-                if(NOT EXISTS \${_link})
-                    execute_process(COMMAND \${CMAKE_COMMAND} -E create_symlink \${_target} \${_link})
-                endif()
-            " COMPONENT MLX)
-        endif()
-    endif()
-
-    # Install minimal CUDA toolkit headers needed by MLX JIT kernels.
-    # These are the transitive closure of includes from mlx/backend/cuda/device/*.cuh.
-    # The Go mlxrunner sets CUDA_PATH to OLLAMA_INSTALL_DIR so MLX finds them at
-    # $CUDA_PATH/include/*.h via NVRTC --include-path.
-    if(CUDAToolkit_FOUND)
-        # CUDAToolkit_INCLUDE_DIRS may be a semicolon-separated list
-        # (e.g. ".../include;.../include/cccl"). Find the entry that
-        # contains the CUDA runtime headers we need.
-        set(_cuda_inc "")
-        foreach(_dir ${CUDAToolkit_INCLUDE_DIRS})
-            if(EXISTS "${_dir}/cuda_runtime_api.h")
-                set(_cuda_inc "${_dir}")
-                break()
-            endif()
-        endforeach()
-        if(NOT _cuda_inc)
-            message(WARNING "Could not find cuda_runtime_api.h in CUDAToolkit_INCLUDE_DIRS: ${CUDAToolkit_INCLUDE_DIRS}")
-        else()
-            set(_dst "${OLLAMA_INSTALL_DIR}/include")
-            set(_MLX_JIT_CUDA_HEADERS
-                builtin_types.h
-                cooperative_groups.h
-                cuda_bf16.h
-                cuda_bf16.hpp
-                cuda_device_runtime_api.h
-                cuda_fp16.h
-                cuda_fp16.hpp
-                cuda_fp8.h
-                cuda_fp8.hpp
-                cuda_runtime_api.h
-                device_types.h
-                driver_types.h
-                math_constants.h
-                surface_types.h
-                texture_types.h
-                vector_functions.h
-                vector_functions.hpp
-                vector_types.h
-            )
-            foreach(_hdr ${_MLX_JIT_CUDA_HEADERS})
-                install(FILES "${_cuda_inc}/${_hdr}"
-                    DESTINATION ${_dst}
-                    COMPONENT MLX)
-            endforeach()
-            # Subdirectory headers
-            install(DIRECTORY "${_cuda_inc}/cooperative_groups"
-                DESTINATION ${_dst}
-                COMPONENT MLX
-                FILES_MATCHING PATTERN "*.h")
-            install(FILES "${_cuda_inc}/crt/host_defines.h"
-                DESTINATION "${_dst}/crt"
-                COMPONENT MLX)
-        endif()
-    endif()
-
-    # On Windows, explicitly install dl.dll (dlfcn-win32 POSIX dlopen emulation)
-    # RUNTIME_DEPENDENCIES auto-excludes it via POST_EXCLUDE_FILES_STRICT because
-    # dlfcn-win32 is a known CMake target with its own install rules (which install
-    # to the wrong destination). We must install it explicitly here.
-    if(WIN32)
-        install(FILES ${OLLAMA_BUILD_DIR}/dl.dll
-            DESTINATION ${OLLAMA_INSTALL_DIR}
-            COMPONENT MLX)
-    endif()
-
-    # Manually install CUDA runtime libraries that MLX loads via dlopen
-    # (not detected by RUNTIME_DEPENDENCIES since they aren't link-time deps)
-    if(CUDAToolkit_FOUND)
-        file(GLOB MLX_CUDA_LIBS
-            "${CUDAToolkit_LIBRARY_DIR}/libcudart.so*"
-            "${CUDAToolkit_LIBRARY_DIR}/libcublas.so*"
-            "${CUDAToolkit_LIBRARY_DIR}/libcublasLt.so*"
-            "${CUDAToolkit_LIBRARY_DIR}/libnvrtc.so*"
-            "${CUDAToolkit_LIBRARY_DIR}/libnvrtc-builtins.so*"
-            "${CUDAToolkit_LIBRARY_DIR}/libcufft.so*"
-            "${CUDAToolkit_LIBRARY_DIR}/libcudnn.so*")
-        if(MLX_CUDA_LIBS)
-            install(FILES ${MLX_CUDA_LIBS}
-                DESTINATION ${OLLAMA_INSTALL_DIR}
-                COMPONENT MLX_VENDOR)
-        endif()
-    endif()
+if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/llama/server/CMakeLists.txt")
+    set(OLLAMA_HAVE_LLAMA_SERVER TRUE)
+else()
+    set(OLLAMA_HAVE_LLAMA_SERVER FALSE)
 endif()
+include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/local.cmake)
--- a/CMakePresets.json
+++ b/CMakePresets.json
@ -11,109 +11,10 @@
      }
    },
    {
-      "name": "CPU",
-      "inherits": [ "Default" ]
-    },
-    {
-      "name": "CUDA",
-      "inherits": [ "Default" ]
-    },
-    {
-      "name": "CUDA 11",
-      "inherits": [ "CUDA" ],
-      "cacheVariables": {
-        "CMAKE_CUDA_ARCHITECTURES": "50-virtual;60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-virtual;87-virtual;89-virtual;90-virtual",
-        "CMAKE_CUDA_FLAGS": "-Wno-deprecated-gpu-targets -t 2",
-        "OLLAMA_RUNNER_DIR": "cuda_v11"
-      }
-    },
-    {
-      "name": "CUDA 12",
-      "inherits": [ "CUDA" ],
-      "cacheVariables": {
-        "CMAKE_CUDA_ARCHITECTURES": "50;52;60;61;70;75;80;86;89;90;90a;120",
-        "CMAKE_CUDA_FLAGS": "-Wno-deprecated-gpu-targets -t 2",
-        "OLLAMA_RUNNER_DIR": "cuda_v12"
-      }
-    },
-    {
-      "name": "CUDA 13",
-      "inherits": [ "CUDA" ],
-      "cacheVariables": {
-        "CMAKE_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;87-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual",
-        "CMAKE_CUDA_FLAGS": "-t 2",
-        "OLLAMA_RUNNER_DIR": "cuda_v13"
-      }
-    },
-    {
-      "name": "JetPack 5",
-      "inherits": [ "CUDA" ],
-      "cacheVariables": {
-        "CMAKE_CUDA_ARCHITECTURES": "72;87",
-        "OLLAMA_RUNNER_DIR": "cuda_jetpack5"
-      }
-    },
-    {
-      "name": "JetPack 6",
-      "inherits": [ "CUDA" ],
-      "cacheVariables": {
-        "CMAKE_CUDA_ARCHITECTURES": "87",
-        "OLLAMA_RUNNER_DIR": "cuda_jetpack6"
-      }
-    },
-    {
-      "name": "ROCm",
+      "name": "MLX Metal",
      "inherits": [ "Default" ],
      "cacheVariables": {
-        "CMAKE_HIP_PLATFORM": "amd"
-      }
-    },
-    {
-      "name": "ROCm 6",
-      "inherits": [ "ROCm" ],
-      "cacheVariables": {
-        "CMAKE_HIP_FLAGS": "-parallel-jobs=4",
-        "AMDGPU_TARGETS": "gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-",
-        "OLLAMA_RUNNER_DIR": "rocm"
-      }
-    },
-    {
-      "name": "ROCm 7",
-      "inherits": [ "ROCm" ],
-      "cacheVariables": {
-        "CMAKE_HIP_FLAGS": "-parallel-jobs=4",
-        "AMDGPU_TARGETS": "gfx942;gfx950;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1103;gfx1150;gfx1151;gfx1200;gfx1201;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-",
-        "OLLAMA_RUNNER_DIR": "rocm"
-      }
-    },
-    {
-      "name": "Vulkan",
-      "inherits": [ "Default" ],
-      "cacheVariables": {
-        "OLLAMA_RUNNER_DIR": "vulkan"
-      }
-    },
-    {
-      "name": "MLX",
-      "inherits": [ "Default" ],
-      "cacheVariables": {
-        "MLX_ENGINE": "ON",
-        "OLLAMA_RUNNER_DIR": "mlx"
-      }
-    },
-    {
-      "name": "MLX CUDA 12",
-      "inherits": [ "MLX", "CUDA 12" ],
-      "cacheVariables": {
-        "OLLAMA_RUNNER_DIR": "mlx_cuda_v12"
-      }
-    },
-    {
-      "name": "MLX CUDA 13",
-      "inherits": [ "MLX", "CUDA 13" ],
-      "cacheVariables": {
-        "MLX_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual",
-        "OLLAMA_RUNNER_DIR": "mlx_cuda_v13"
+        "OLLAMA_MLX_BACKENDS": "metal_v3;metal_v4"
      }
    }
  ],
@ -124,74 +25,9 @@
      "configuration": "Release"
    },
    {
-      "name": "CPU",
-      "configurePreset": "Default",
-      "targets": [ "ggml-cpu" ]
-    },
-    {
-      "name": "CUDA",
-      "configurePreset": "CUDA",
-      "targets": [ "ggml-cuda" ]
-    },
-    {
-      "name": "CUDA 11",
-      "inherits": [ "CUDA" ],
-      "configurePreset": "CUDA 11"
-    },
-    {
-      "name": "CUDA 12",
-      "inherits": [ "CUDA" ],
-      "configurePreset": "CUDA 12"
-    },
-    {
-      "name": "CUDA 13",
-      "inherits": [ "CUDA" ],
-      "configurePreset": "CUDA 13"
-    },
-    {
-      "name": "JetPack 5",
-      "inherits": [ "CUDA" ],
-      "configurePreset": "JetPack 5"
-    },
-    {
-      "name": "JetPack 6",
-      "inherits": [ "CUDA" ],
-      "configurePreset": "JetPack 6"
-    },
-    {
-      "name": "ROCm",
-      "configurePreset": "ROCm",
-      "targets": [ "ggml-hip" ]
-    },
-    {
-      "name": "ROCm 6",
-      "inherits": [ "ROCm" ],
-      "configurePreset": "ROCm 6"
-    },
-    {
-      "name": "ROCm 7",
-      "inherits": [ "ROCm" ],
-      "configurePreset": "ROCm 7"
-    },
-    {
-      "name": "Vulkan",
-      "targets": [ "ggml-vulkan" ],
-      "configurePreset": "Vulkan"
-    },
-    {
-      "name": "MLX",
-      "targets": [ "mlx", "mlxc" ],
-      "configurePreset": "MLX"
-    },
-    {
-      "name": "MLX CUDA 12",
-      "targets": [ "mlx", "mlxc" ],
-      "configurePreset": "MLX CUDA 12"
-    },
-    {
-      "name": "MLX CUDA 13",
-      "targets": [ "mlx", "mlxc" ],
-      "configurePreset": "MLX CUDA 13"
+      "name": "MLX Metal",
+      "targets": [ "ollama-mlx-backends" ],
+      "configurePreset": "MLX Metal"
    }
  ]
 }
--- a/259
+++ b/259
@ -37,116 +37,150 @@ RUN dnf install -y unzip \
 ENV CMAKE_GENERATOR=Ninja
 ENV LDFLAGS=-s

-FROM base AS cpu
+#
+# GPU toolchain stages — provide compilers for llama-server GPU builds
+#
+
+FROM base AS cpu-deps
 RUN dnf install -y gcc-toolset-11-gcc gcc-toolset-11-gcc-c++
 ENV PATH=/opt/rh/gcc-toolset-11/root/usr/bin:$PATH
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'CPU' \
-        && cmake --build --preset 'CPU' -- -l $(nproc) \
-        && cmake --install build --component CPU --strip

-FROM base AS cuda-11
-ARG CUDA11VERSION=11.8
-RUN dnf install -y cuda-toolkit-${CUDA11VERSION//./-}
-ENV PATH=/usr/local/cuda-11/bin:$PATH
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'CUDA 11' \
-        && cmake --build --preset 'CUDA 11' -- -l $(nproc) \
-        && cmake --install build --component CUDA --strip
-
-FROM base AS cuda-12
+FROM base AS cuda-12-deps
 ARG CUDA12VERSION=12.8
 RUN dnf install -y cuda-toolkit-${CUDA12VERSION//./-}
 ENV PATH=/usr/local/cuda-12/bin:$PATH
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'CUDA 12' \
-        && cmake --build --preset 'CUDA 12' -- -l $(nproc) \
-        && cmake --install build --component CUDA --strip

-
-FROM base AS cuda-13
+FROM base AS cuda-13-deps
 ARG CUDA13VERSION=13.0
 RUN dnf install -y cuda-toolkit-${CUDA13VERSION//./-}
 ENV PATH=/usr/local/cuda-13/bin:$PATH
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'CUDA 13' \
-        && cmake --build --preset 'CUDA 13' -- -l $(nproc) \
-        && cmake --install build --component CUDA --strip

+FROM base AS rocm-7-deps
+ENV PATH=/opt/rocm/llvm/bin:/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/opt/rocm/bin:$PATH

-FROM base AS rocm-7
-ENV PATH=/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/opt/rocm/bin:/opt/rocm/hcc/bin:$PATH
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'ROCm 7' \
-        && cmake --build --preset 'ROCm 7' -- -l $(nproc) \
-        && cmake --install build --component HIP --strip
-RUN rm -f dist/lib/ollama/rocm/rocblas/library/*gfx90[06]*
-
-FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK5VERSION} AS jetpack-5
-ARG CMAKEVERSION
-ARG NINJAVERSION
-RUN apt-get update && apt-get install -y curl ccache unzip \
-    && curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
-    && curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
-    && unzip /tmp/ninja.zip -d /usr/local/bin \
-    && rm /tmp/ninja.zip
-ENV CMAKE_GENERATOR=Ninja
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'JetPack 5' \
-        && cmake --build --preset 'JetPack 5' -- -l $(nproc) \
-        && cmake --install build --component CUDA --strip
-
-FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK6VERSION} AS jetpack-6
-ARG CMAKEVERSION
-ARG NINJAVERSION
-RUN apt-get update && apt-get install -y curl ccache unzip \
-    && curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
-    && curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
-    && unzip /tmp/ninja.zip -d /usr/local/bin \
-    && rm /tmp/ninja.zip
-ENV CMAKE_GENERATOR=Ninja
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
-RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'JetPack 6' \
-        && cmake --build --preset 'JetPack 6' -- -l $(nproc) \
-        && cmake --install build --component CUDA --strip
-
-FROM base AS vulkan
+FROM base AS vulkan-deps
 ARG VULKANVERSION
 RUN ln -s /usr/bin/python3 /usr/bin/python \
    && wget https://sdk.lunarg.com/sdk/download/${VULKANVERSION}/linux/vulkansdk-linux-x86_64-${VULKANVERSION}.tar.xz -O /tmp/vulkansdk.tar.xz \
    && tar xvf /tmp/vulkansdk.tar.xz -C /tmp \
    && /tmp/${VULKANVERSION}/vulkansdk -j 8 vulkan-headers \
+    && /tmp/${VULKANVERSION}/vulkansdk -j 8 spirv-headers \
    && /tmp/${VULKANVERSION}/vulkansdk -j 8 shaderc \
    && cp -r /tmp/${VULKANVERSION}/x86_64/include/* /usr/local/include/ \
    && cp -r /tmp/${VULKANVERSION}/x86_64/lib/* /usr/local/lib \
+    && cp -r /tmp/${VULKANVERSION}/x86_64/share/* /usr/local/share/ \
    && cp -r /tmp/${VULKANVERSION}/x86_64/bin/* /usr/local/bin/ \
    && rm -rf /tmp/${VULKANVERSION} /tmp/vulkansdk.tar.xz
-COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
+ENV VULKAN_SDK=/usr/local
+
+#
+# llama-server stages — rebuild when LLAMA_CPP_VERSION, llama/server/, or llama/compat/ changes.
+#
+# CPU stage: llama-server + ggml-base + ggml-cpu variants → lib/ollama/
+# GPU stages: GPU backend .so only → lib/ollama/<variant>/
+#
+
+FROM cpu-deps AS llama-server-cpu
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
 RUN --mount=type=cache,target=/root/.ccache \
-    cmake --preset 'Vulkan' \
-        && cmake --build --preset 'Vulkan' -- -l $(nproc) \
-        && cmake --install build --component Vulkan --strip
+    cmake -S llama/server --preset cpu \
+        && cmake --build build/llama-server-cpu -- -l $(nproc) \
+        && cmake --install build/llama-server-cpu --component llama-server --strip \
+        && for lib in \
+            /usr/lib64/libgomp.so* \
+            /usr/lib64/libomp.so* \
+            /opt/rh/gcc-toolset-11/root/usr/lib64/libgomp.so* \
+            /opt/rh/gcc-toolset-11/root/usr/lib64/libomp.so*; do \
+                [ -e "$lib" ] && cp -a "$lib" dist/lib/ollama/ || true; \
+            done
+
+FROM cuda-12-deps AS llama-server-cuda_v12
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
+RUN --mount=type=cache,target=/root/.ccache \
+    cmake -S llama/server --preset llama_cuda_v12_linux \
+        && cmake --build build/llama-server-cuda_v12 -- -l $(nproc) \
+        && cmake --install build/llama-server-cuda_v12 --component llama-server --strip
+
+FROM cuda-13-deps AS llama-server-cuda_v13
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
+RUN --mount=type=cache,target=/root/.ccache \
+    cmake -S llama/server --preset llama_cuda_v13_linux \
+        && cmake --build build/llama-server-cuda_v13 -- -l $(nproc) \
+        && cmake --install build/llama-server-cuda_v13 --component llama-server --strip
+
+FROM rocm-7-deps AS llama-server-rocm_v7_2
+ENV CC=clang CXX=clang++
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
+RUN --mount=type=cache,target=/root/.ccache \
+    cmake -S llama/server --preset rocm_v7_2_linux \
+        && cmake --build build/llama-server-rocm_v7_2 -- -l $(nproc) \
+        && cmake --install build/llama-server-rocm_v7_2 --component llama-server --strip
+RUN rm -f dist/lib/ollama/rocm_v7_2/rocblas/library/*gfx90[06]*
+
+FROM vulkan-deps AS llama-server-vulkan
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
+RUN --mount=type=cache,target=/root/.ccache \
+    cmake -S llama/server --preset vulkan \
+        && cmake --build build/llama-server-vulkan -- -l $(nproc) \
+        && cmake --install build/llama-server-vulkan --component llama-server --strip
+
+#
+# JetPack stages — self-contained with their own base images
+#
+
+FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK5VERSION} AS jetpack-5
+ARG CMAKEVERSION
+ARG NINJAVERSION
+RUN apt-get update && apt-get install -y curl ccache git unzip \
+    && curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
+    && curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
+    && unzip /tmp/ninja.zip -d /usr/local/bin \
+    && rm /tmp/ninja.zip
+ENV CMAKE_GENERATOR=Ninja
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
+RUN --mount=type=cache,target=/root/.ccache \
+    cmake -S llama/server --preset llama_cuda_jetpack5 \
+        && cmake --build build/llama-server-cuda_jetpack5 -- -l $(nproc) \
+        && cmake --install build/llama-server-cuda_jetpack5 --component llama-server --strip
+
+FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK6VERSION} AS jetpack-6
+ARG CMAKEVERSION
+ARG NINJAVERSION
+RUN apt-get update && apt-get install -y curl ccache git unzip \
+    && curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 \
+    && curl -fsSL -o /tmp/ninja.zip https://github.com/ninja-build/ninja/releases/download/v${NINJAVERSION}/ninja-linux-aarch64.zip \
+    && unzip /tmp/ninja.zip -d /usr/local/bin \
+    && rm /tmp/ninja.zip
+ENV CMAKE_GENERATOR=Ninja
+COPY LLAMA_CPP_VERSION .
+COPY llama/server llama/server
+COPY llama/compat llama/compat
+RUN --mount=type=cache,target=/root/.ccache \
+    cmake -S llama/server --preset llama_cuda_jetpack6 \
+        && cmake --build build/llama-server-cuda_jetpack6 -- -l $(nproc) \
+        && cmake --install build/llama-server-cuda_jetpack6 --component llama-server --strip
+
+#
+# MLX stage
+#

 FROM base AS mlx
 ARG CUDA13VERSION=13.0
-#   OLLAMA_MLX_BUILD_JOBS  empty -> ninja gates by load average (-l $(nproc))
 ARG OLLAMA_MLX_BUILD_JOBS=
 ARG OLLAMA_MLX_NVCC_THREADS=2
+ARG MLX_CUDA_RAM_MB=
 RUN dnf install -y cuda-toolkit-${CUDA13VERSION//./-} \
    && dnf install -y openblas-devel lapack-devel \
    && dnf install -y libcudnn9-cuda-13 libcudnn9-devel-cuda-13 \
@ -157,7 +191,7 @@ ENV LAPACK_INCLUDE_DIRS=/usr/include/openblas
 ENV CGO_LDFLAGS="-L/usr/local/cuda-13/lib64 -L/usr/local/cuda-13/targets/x86_64-linux/lib/stubs"
 WORKDIR /go/src/github.com/ollama/ollama
 COPY CMakeLists.txt CMakePresets.json .
-COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
+COPY cmake cmake
 COPY x/imagegen/mlx x/imagegen/mlx
 COPY go.mod go.sum .
 COPY MLX_VERSION MLX_C_VERSION .
@ -173,10 +207,12 @@ RUN --mount=type=cache,target=/root/.ccache \
    && if [ -f /tmp/local-mlx-c/CMakeLists.txt ]; then \
        export OLLAMA_MLX_C_SOURCE=/tmp/local-mlx-c; \
    fi \
-    && cmake --preset 'MLX CUDA 13' -DBLAS_INCLUDE_DIRS=/usr/include/openblas -DLAPACK_INCLUDE_DIRS=/usr/include/openblas -DCMAKE_CUDA_FLAGS="-t ${OLLAMA_MLX_NVCC_THREADS}" \
-        && cmake --build --preset 'MLX CUDA 13' -- -l $(nproc) ${OLLAMA_MLX_BUILD_JOBS:+-j ${OLLAMA_MLX_BUILD_JOBS}} \
-        && cmake --install build --component MLX --strip \
-        && cmake --install build --component MLX_VENDOR
+    && cmake -S . -B build/mlx_cuda_v13 -DOLLAMA_MLX_BACKENDS=cuda_v13 -DBLAS_INCLUDE_DIRS=/usr/include/openblas -DLAPACK_INCLUDE_DIRS=/usr/include/openblas -DCMAKE_CUDA_FLAGS="-t ${OLLAMA_MLX_NVCC_THREADS}" ${MLX_CUDA_RAM_MB:+-DMLX_CUDA_RAM_MB=${MLX_CUDA_RAM_MB}} -DOLLAMA_PAYLOAD_INSTALL_PREFIX=/go/src/github.com/ollama/ollama/dist \
+        && cmake --build build/mlx_cuda_v13 --target ollama-mlx-cuda_v13 -- -l $(nproc) ${OLLAMA_MLX_BUILD_JOBS:+-j ${OLLAMA_MLX_BUILD_JOBS}}
+
+#
+# Go build
+#

 FROM base AS build
 WORKDIR /go/src/github.com/ollama/ollama
@ -194,38 +230,59 @@ ENV CGO_CXXFLAGS="${CGO_CXXFLAGS}"
 RUN --mount=type=cache,target=/root/.cache/go-build \
    go build -trimpath -buildmode=pie -o /bin/ollama .

+#
+# Assembly stages — combine llama-server variants + GPU runtime libs
+#
+
 FROM --platform=linux/amd64 scratch AS amd64
-# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
-COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
-COPY --from=cuda-13 dist/lib/ollama /lib/ollama/
-COPY --from=vulkan  dist/lib/ollama  /lib/ollama/
+COPY --from=llama-server-cpu      dist/lib/ollama /lib/ollama/
+COPY --from=llama-server-cuda_v12 dist/lib/ollama /lib/ollama/
+COPY --from=llama-server-cuda_v13 dist/lib/ollama /lib/ollama/
+COPY --from=llama-server-vulkan   dist/lib/ollama /lib/ollama/
 COPY --from=mlx     /go/src/github.com/ollama/ollama/dist/lib/ollama /lib/ollama/

 FROM --platform=linux/arm64 scratch AS arm64
-# COPY --from=cuda-11 dist/lib/ollama/ /lib/ollama/
-COPY --from=cuda-12 dist/lib/ollama /lib/ollama/
-COPY --from=cuda-13 dist/lib/ollama/ /lib/ollama/
+COPY --from=llama-server-cpu dist/lib/ollama /lib/ollama/
+COPY --from=llama-server-cuda_v12 dist/lib/ollama /lib/ollama/
+COPY --from=llama-server-cuda_v13 dist/lib/ollama /lib/ollama/
 COPY --from=jetpack-5 dist/lib/ollama/ /lib/ollama/
 COPY --from=jetpack-6 dist/lib/ollama/ /lib/ollama/

 FROM scratch AS rocm
-COPY --from=rocm-7 dist/lib/ollama /lib/ollama
+COPY --from=llama-server-cpu  dist/lib/ollama /lib/ollama
+COPY --from=llama-server-rocm_v7_2 dist/lib/ollama /lib/ollama

-FROM ${FLAVOR} AS archive
-COPY --from=cpu dist/lib/ollama /lib/ollama
+FROM --platform=linux/amd64 scratch AS amd64-archive
+COPY --from=amd64 /lib/ollama /lib/ollama/
+COPY --from=llama-server-rocm_v7_2 dist/lib/ollama /lib/ollama/
+
+FROM --platform=linux/arm64 scratch AS arm64-archive
+COPY --from=arm64 /lib/ollama /lib/ollama/
+
+FROM ${TARGETARCH}-archive AS archive
+COPY --from=build /bin/ollama /bin/ollama
+
+FROM ${FLAVOR} AS image-archive
 COPY --from=build /bin/ollama /bin/ollama

 FROM ubuntu:24.04
 ARG APT_MIRROR=http://archive.ubuntu.com/ubuntu
-RUN sed -i "s|http://archive.ubuntu.com/ubuntu|$APT_MIRROR|g" /etc/apt/sources.list.d/ubuntu.sources \
+ARG APT_PORTS_MIRROR=http://ports.ubuntu.com/ubuntu-ports
+RUN sed -i \
+        -e "s|http://archive.ubuntu.com/ubuntu|$APT_MIRROR|g" \
+        -e "s|http://ports.ubuntu.com/ubuntu-ports|$APT_PORTS_MIRROR|g" \
+        /etc/apt/sources.list.d/ubuntu.sources \
    && apt-get update \
    && apt-get install -y ca-certificates libvulkan1 libopenblas0 \
-    && sed -i "s|$APT_MIRROR|http://archive.ubuntu.com/ubuntu|g" /etc/apt/sources.list.d/ubuntu.sources \
+    && sed -i \
+        -e "s|$APT_MIRROR|http://archive.ubuntu.com/ubuntu|g" \
+        -e "s|$APT_PORTS_MIRROR|http://ports.ubuntu.com/ubuntu-ports|g" \
+        /etc/apt/sources.list.d/ubuntu.sources \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*
-COPY --from=archive /bin /usr/bin
+COPY --from=image-archive /bin /usr/bin
 ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
-COPY --from=archive /lib/ollama /usr/lib/ollama
+COPY --from=image-archive /lib/ollama /usr/lib/ollama
 ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
 ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
 ENV NVIDIA_VISIBLE_DEVICES=all
--- a/1
+++ b/1
@ -0,0 +1 @@
+b9409
--- a/2
+++ b/2
@ -1 +1 @@
-e8ebdebeeb655feaa85a51f6b24ece5b6d5518d1
+2165dc08d7b33258260aa849d39f087d50e62962
--- a/Makefile.sync
+++ b/Makefile.sync
@ -1,76 +0,0 @@
-UPSTREAM=https://github.com/ggml-org/llama.cpp.git
-WORKDIR=llama/vendor
-FETCH_HEAD=ec98e2002
-
-.PHONY: help
-help:
-	@echo "Available targets:"
-	@echo "    sync                 Sync with upstream repositories"
-	@echo "    checkout             Checkout upstream repository"
-	@echo "    apply-patches        Apply patches to local repository"
-	@echo "    format-patches       Format patches from local repository"
-	@echo "    clean                Clean local repository"
-	@echo
-	@echo "Example:"
-	@echo "    make -f $(lastword $(MAKEFILE_LIST)) clean apply-patches sync"
-
-.PHONY: sync
-sync: llama/build-info.cpp ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal
-
-llama/build-info.cpp: llama/build-info.cpp.in llama/llama.cpp
-	sed -e 's|@FETCH_HEAD@|$(FETCH_HEAD)|' <$< >$@
-
-ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal: ml/backend/ggml/ggml
-	go generate ./$(@D)
-
-.PHONY: llama/llama.cpp
-llama/llama.cpp: llama/vendor
-	rsync -arvzc --delete -f "include LICENSE" -f "merge $@/.rsync-filter" $(addprefix $<,/LICENSE /) $@
-
-.PHONY: ml/backend/ggml/ggml
-ml/backend/ggml/ggml: llama/vendor
-	rsync -arvzc --delete -f "include LICENSE" -f "merge $@/.rsync-filter" $(addprefix $<,/LICENSE /ggml/) $@
-
-PATCHES=$(wildcard llama/patches/*.patch)
-PATCHED=$(join $(dir $(PATCHES)), $(addsuffix ed, $(addprefix ., $(notdir $(PATCHES)))))
-
-.PHONY: apply-patches
-.NOTPARALLEL:
-apply-patches: $(PATCHED)
-
-llama/patches/.%.patched: llama/patches/%.patch
-	@if git -c user.name=nobody -c 'user.email=<>' -C $(WORKDIR) am -3 $(realpath $<); then \
-		touch $@;                                                                           \
-	else                                                                                    \
-		echo "Patch failed. Resolve any conflicts then continue.";                          \
-		echo "1. Run 'git -C $(WORKDIR) am --continue'";                                    \
-		echo "2. Run 'make -f $(lastword $(MAKEFILE_LIST)) format-patches'";                \
-		echo "3. Run 'make -f $(lastword $(MAKEFILE_LIST)) clean apply-patches'";           \
-		exit 1;                                                                             \
-	fi
-
-.PHONY: checkout
-checkout: $(WORKDIR)
-	git -C $(WORKDIR) fetch
-	git -C $(WORKDIR) checkout -f $(FETCH_HEAD)
-
-$(WORKDIR):
-	git clone $(UPSTREAM) $(WORKDIR)
-
-.PHONY: format-patches
-format-patches: llama/patches
-	git -C $(WORKDIR) format-patch \
-		--no-signature \
-		--no-numbered \
-		--zero-commit \
-		-o $(realpath $<) \
-		$(FETCH_HEAD)
-
-.PHONY: clean
-clean: checkout
-	@git -C $(WORKDIR) am --abort || true
-	$(RM) llama/patches/.*.patched
-
-.PHONY: print-base
-print-base:
-	@echo $(FETCH_HEAD)
--- a/api/client.go
+++ b/api/client.go
@ -259,6 +259,10 @@ func (c *Client) stream(ctx context.Context, method, path string, data any, fn f
 		}
 	}

+	if err := scanner.Err(); err != nil {
+		return err
+	}
+
 	return nil
 }

--- a/api/client_test.go
+++ b/api/client_test.go
@ -3,6 +3,7 @@ package api
 import (
 	"encoding/json"
 	"fmt"
+	"io"
 	"net/http"
 	"net/http/httptest"
 	"net/url"
@ -192,6 +193,35 @@ func TestClientStream(t *testing.T) {
 	}
 }

+func TestClientStreamReportsReadErrors(t *testing.T) {
+	client := NewClient(
+		&url.URL{Scheme: "http", Host: "example.com"},
+		&http.Client{Transport: roundTripFunc(func(*http.Request) (*http.Response, error) {
+			body := failingReader{
+				data: []byte(`{"message":{"content":"partial"}}` + "\n"),
+				err:  io.ErrUnexpectedEOF,
+			}
+
+			return &http.Response{
+				StatusCode: http.StatusOK,
+				Status:     "200 OK",
+				Body:       io.NopCloser(&body),
+				Header:     make(http.Header),
+			}, nil
+		})},
+	)
+
+	err := client.stream(t.Context(), http.MethodPost, "/api/chat", nil, func([]byte) error {
+		return nil
+	})
+	if err == nil {
+		t.Fatal("expected stream read error")
+	}
+	if !strings.Contains(err.Error(), io.ErrUnexpectedEOF.Error()) {
+		t.Fatalf("expected unexpected EOF, got %v", err)
+	}
+}
+
 func TestClientDo(t *testing.T) {
 	testCases := []struct {
 		name           string
@ -320,3 +350,23 @@ func TestClientDo(t *testing.T) {
 		})
 	}
 }
+
+type roundTripFunc func(*http.Request) (*http.Response, error)
+
+func (f roundTripFunc) RoundTrip(req *http.Request) (*http.Response, error) {
+	return f(req)
+}
+
+type failingReader struct {
+	data []byte
+	err  error
+}
+
+func (r *failingReader) Read(p []byte) (int, error) {
+	if len(r.data) > 0 {
+		n := copy(p, r.data)
+		r.data = r.data[n:]
+		return n, nil
+	}
+	return 0, r.err
+}
--- a/api/types.go
+++ b/api/types.go
@ -600,12 +600,13 @@ type Options struct {

 // Runner options which must be set when the model is loaded into memory
 type Runner struct {
-	NumCtx    int   `json:"num_ctx,omitempty"`
-	NumBatch  int   `json:"num_batch,omitempty"`
-	NumGPU    int   `json:"num_gpu,omitempty"`
-	MainGPU   int   `json:"main_gpu,omitempty"`
-	UseMMap   *bool `json:"use_mmap,omitempty"`
-	NumThread int   `json:"num_thread,omitempty"`
+	NumCtx          int   `json:"num_ctx,omitempty"`
+	NumBatch        int   `json:"num_batch,omitempty"`
+	NumGPU          int   `json:"num_gpu,omitempty"`
+	MainGPU         *int  `json:"main_gpu,omitempty"`
+	UseMMap         *bool `json:"use_mmap,omitempty"`
+	NumThread       int   `json:"num_thread,omitempty"`
+	DraftNumPredict int   `json:"draft_num_predict,omitempty"`
 }

 // EmbedRequest is the request passed to [Client.Embed].
@ -672,6 +673,9 @@ type CreateRequest struct {
 	// Quantize is the quantization format for the model; leave blank to not change the quantization level.
 	Quantize string `json:"quantize,omitempty"`

+	// DraftQuantize is the quantization format for the draft model.
+	DraftQuantize string `json:"draft_quantize,omitempty"`
+
 	// From is the name of the model or file to use as the source.
 	From string `json:"from,omitempty"`

@ -681,6 +685,9 @@ type CreateRequest struct {
 	// Files is a map of files include when creating the model.
 	Files map[string]string `json:"files,omitempty"`

+	// DraftFiles is a map of draft model files to include when creating the model.
+	DraftFiles map[string]string `json:"draft_files,omitempty"`
+
 	// Adapters is a map of LoRA adapters to include when creating the model.
 	Adapters map[string]string `json:"adapters,omitempty"`

@ -1049,14 +1056,25 @@ func (opts *Options) FromMap(m map[string]any) error {
 				}
 				field.Set(reflect.ValueOf(slice))
 			case reflect.Pointer:
-				var b bool
-				if field.Type() == reflect.TypeOf(&b) {
+				switch field.Type().Elem().Kind() {
+				case reflect.Bool:
 					val, ok := val.(bool)
 					if !ok {
 						return fmt.Errorf("option %q must be of type boolean", key)
 					}
 					field.Set(reflect.ValueOf(&val))
-				} else {
+				case reflect.Int:
+					var i int
+					switch t := val.(type) {
+					case int64:
+						i = int(t)
+					case float64:
+						i = int(t)
+					default:
+						return fmt.Errorf("option %q must be of type integer", key)
+					}
+					field.Set(reflect.ValueOf(&i))
+				default:
 					return fmt.Errorf("unknown type loading config params: %v %v", field.Kind(), field.Type())
 				}
 			default:
@ -1089,11 +1107,12 @@ func DefaultOptions() Options {

 		Runner: Runner{
 			// options set when the model is loaded
-			NumCtx:    int(envconfig.ContextLength()),
-			NumBatch:  512,
-			NumGPU:    -1, // -1 here indicates that NumGPU should be set dynamically
-			NumThread: 0,  // let the runtime decide
-			UseMMap:   nil,
+			NumCtx:          int(envconfig.ContextLength()),
+			NumBatch:        512,
+			NumGPU:          -1, // -1 here indicates that NumGPU should be set dynamically
+			NumThread:       0,  // let the runtime decide
+			DraftNumPredict: 4,
+			UseMMap:         nil,
 		},
 	}
 }
@ -1297,14 +1316,20 @@ func FormatParams(params map[string][]string) (map[string]any, error) {
 					// TODO: only string slices are supported right now
 					out[key] = vals
 				case reflect.Pointer:
-					var b bool
-					if field.Type() == reflect.TypeOf(&b) {
+					switch field.Type().Elem().Kind() {
+					case reflect.Bool:
 						boolVal, err := strconv.ParseBool(vals[0])
 						if err != nil {
 							return nil, fmt.Errorf("invalid bool value %s", vals)
 						}
 						out[key] = &boolVal
-					} else {
+					case reflect.Int:
+						intVal, err := strconv.ParseInt(vals[0], 10, 64)
+						if err != nil {
+							return nil, fmt.Errorf("invalid int value %s", vals)
+						}
+						out[key] = intVal
+					default:
 						return nil, fmt.Errorf("unknown type %s for %s", field.Kind(), key)
 					}
 				default:
--- a/api/types_test.go
+++ b/api/types_test.go
@ -20,6 +20,10 @@ func testPropsMap(m map[string]ToolProperty) *ToolPropertiesMap {
 	return props
 }

+func testIntPtr(v int) *int {
+	return &v
+}
+
 // testArgs creates ToolCallFunctionArguments from a map (convenience function for tests, order not preserved)
 func testArgs(m map[string]any) ToolCallFunctionArguments {
 	args := NewToolCallFunctionArguments()
@ -168,6 +172,47 @@ func TestUseMmapParsingFromJSON(t *testing.T) {
 	}
 }

+func TestMainGPUParsingFromJSON(t *testing.T) {
+	tests := []struct {
+		name    string
+		req     string
+		wantGPU *int
+	}{
+		{
+			name: "Undefined",
+			req:  `{}`,
+		},
+		{
+			name:    "Zero",
+			req:     `{ "main_gpu": 0 }`,
+			wantGPU: testIntPtr(0),
+		},
+		{
+			name:    "Nonzero",
+			req:     `{ "main_gpu": 1 }`,
+			wantGPU: testIntPtr(1),
+		},
+	}
+
+	for _, test := range tests {
+		t.Run(test.name, func(t *testing.T) {
+			var oMap map[string]any
+			err := json.Unmarshal([]byte(test.req), &oMap)
+			require.NoError(t, err)
+
+			opts := DefaultOptions()
+			err = opts.FromMap(oMap)
+			require.NoError(t, err)
+
+			if test.wantGPU == nil {
+				assert.Nil(t, opts.MainGPU)
+			} else if assert.NotNil(t, opts.MainGPU) {
+				assert.Equal(t, *test.wantGPU, *opts.MainGPU)
+			}
+		})
+	}
+}
+
 func TestUseMmapFormatParams(t *testing.T) {
 	tr := true
 	fa := false
@ -232,6 +277,12 @@ func TestUseMmapFormatParams(t *testing.T) {
 	}
 }

+func TestMainGPUFormatParams(t *testing.T) {
+	resp, err := FormatParams(map[string][]string{"main_gpu": {"0"}})
+	require.NoError(t, err)
+	assert.Equal(t, int64(0), resp["main_gpu"])
+}
+
 func TestMessage_UnmarshalJSON(t *testing.T) {
 	tests := []struct {
 		input    string
--- a/app/ollama.iss
+++ b/app/ollama.iss
@ -90,9 +90,8 @@ DialogFontSize=12
 [Files]
 #if FileExists("..\dist\windows-ollama-app-amd64.exe")
 Source: "..\dist\windows-ollama-app-amd64.exe"; DestDir: "{app}"; DestName: "{#MyAppExeName}" ;Check: not IsArm64();  Flags: ignoreversion 64bit; BeforeInstall: TaskKill('{#MyAppExeName}')
-Source: "..\dist\windows-amd64\vc_redist.x64.exe"; DestDir: "{tmp}"; Check: not IsArm64() and vc_redist_needed(); Flags: deleteafterinstall
 Source: "..\dist\windows-amd64\ollama.exe"; DestDir: "{app}"; Check: not IsArm64(); Flags: ignoreversion 64bit; BeforeInstall: TaskKill('ollama.exe')
-Source: "..\dist\windows-amd64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
+Source: "..\dist\windows-amd64\lib\ollama\*"; Excludes: "\mlx_*\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
 #endif

 ; For local development, rely on binary compatibility at runtime since we can't cross compile
@ -103,9 +102,11 @@ Source: "..\dist\windows-ollama-app-amd64.exe"; DestDir: "{app}"; DestName: "{#M
 #endif

 #if FileExists("..\dist\windows-arm64\ollama.exe")
-Source: "..\dist\windows-arm64\vc_redist.arm64.exe"; DestDir: "{tmp}"; Check: IsArm64() and vc_redist_needed(); Flags: deleteafterinstall
 Source: "..\dist\windows-arm64\ollama.exe"; DestDir: "{app}"; Check: IsArm64(); Flags: ignoreversion 64bit; BeforeInstall: TaskKill('ollama.exe')
 #endif
+#if DirExists("..\dist\windows-arm64\lib\ollama")
+Source: "..\dist\windows-arm64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: IsArm64(); Flags: ignoreversion 64bit recursesubdirs
+#endif

 Source: ".\assets\app.ico"; DestDir: "{app}"; Flags: ignoreversion

@ -118,12 +119,6 @@ Name: "{userprograms}\{#MyAppName}"; Filename: "{app}\{#MyAppExeName}"; IconFile
 Type: files; Name: "{%LOCALAPPDATA}\Ollama\updates"

 [Run]
-#if DirExists("..\dist\windows-arm64")
-Filename: "{tmp}\vc_redist.arm64.exe"; Parameters: "/install /passive /norestart"; Check: IsArm64() and vc_redist_needed(); StatusMsg: "Installing VC++ Redistributables..."; Flags: waituntilterminated
-#endif
-#if DirExists("..\dist\windows-amd64")
-Filename: "{tmp}\vc_redist.x64.exe"; Parameters: "/install /passive /norestart"; Check: not IsArm64() and vc_redist_needed(); StatusMsg: "Installing VC++ Redistributables..."; Flags: waituntilterminated
-#endif
 Filename: "{cmd}"; Parameters: "/C set PATH={app};%PATH% & ""{app}\{#MyAppExeName}"""; Flags: postinstall nowait runhidden

 [UninstallRun]
@ -184,46 +179,6 @@ begin
  Result := Pos(';' + ExpandConstant(Param) + ';', ';' + OrigPath + ';') = 0;
 end;

-{ --- VC Runtime libraries discovery code - Only install vc_redist if it isn't already installed ----- }
-const VCRTL_MIN_V1 = 14;
-const VCRTL_MIN_V2 = 40;
-const VCRTL_MIN_V3 = 33807;
-const VCRTL_MIN_V4 = 0;
-
- // check if the minimum required vc redist is installed (by looking the registry)
-function vc_redist_needed (): Boolean;
-var
-  sRegKey: string;
-  v1: Cardinal;
-  v2: Cardinal;
-  v3: Cardinal;
-  v4: Cardinal;
-begin
-  if (IsArm64()) then begin
-    sRegKey := 'SOFTWARE\WOW6432Node\Microsoft\VisualStudio\14.0\VC\Runtimes\arm64';
-  end else begin
-    sRegKey := 'SOFTWARE\Microsoft\VisualStudio\14.0\VC\Runtimes\x64';
-  end;
-  if (RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Major', v1)  and
-      RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Minor', v2) and
-      RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Bld', v3) and
-      RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'RBld', v4)) then
-  begin
-    Log ('VC Redist version: ' + IntToStr (v1) +
-        '.' + IntToStr (v2) + '.' + IntToStr (v3) +
-        '.' + IntToStr (v4));
-    { Version info was found. Return true if later or equal to our
-       minimal required version RTL_MIN_Vx }
-    Result := not (
-        (v1 > VCRTL_MIN_V1) or ((v1 = VCRTL_MIN_V1) and
-         ((v2 > VCRTL_MIN_V2) or ((v2 = VCRTL_MIN_V2) and
-          ((v3 > VCRTL_MIN_V3) or ((v3 = VCRTL_MIN_V3) and
-           (v4 >= VCRTL_MIN_V4)))))));
-  end
-  else
-    Result := TRUE;
-end;
-
 function GetDirSize(Path: String): Int64;
 var
  FindRec: TFindRec;
--- a/cmake/local.cmake
+++ b/cmake/local.cmake
@ -0,0 +1,691 @@
+# Local Ollama superbuild targets.
+#
+# This file keeps the repository-root CMake project focused on orchestration:
+# it builds a runnable local Ollama payload by delegating llama.cpp work to the
+# llama/server CMake project and building the Go binary into a matching layout.
+
+include(ExternalProject)
+
+set(OLLAMA_LLAMA_BACKENDS "" CACHE STRING
+    "Semicolon-separated llama-server GPU backends to build: cuda_v12;cuda_v13;rocm_v7_1;rocm_v7_2;vulkan;cuda_jetpack5;cuda_jetpack6")
+set(_ollama_mlx_backends_doc "Semicolon-separated MLX backends to build: cuda_v13;metal_v3;metal_v4")
+set(OLLAMA_VERSION "0.0.0" CACHE STRING "Ollama version embedded in the local Go binary")
+set(OLLAMA_PAYLOAD_INSTALL_PREFIX "${CMAKE_BINARY_DIR}" CACHE PATH
+    "Build-time staging prefix for nested Ollama native payloads")
+
+string(REGEX REPLACE "^v" "" OLLAMA_VERSION "${OLLAMA_VERSION}")
+
+set(OLLAMA_NATIVE_CONFIG_ARG)
+if(CMAKE_CONFIGURATION_TYPES)
+    set(OLLAMA_NATIVE_CONFIG_ARG --config Release)
+endif()
+
+set(OLLAMA_NATIVE_EXTERNAL_OPTIONS)
+if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.28)
+    list(APPEND OLLAMA_NATIVE_EXTERNAL_OPTIONS BUILD_JOB_SERVER_AWARE TRUE)
+endif()
+
+function(ollama_check_metal_toolchain output_version)
+    find_program(_ollama_xcrun xcrun)
+    if(NOT _ollama_xcrun)
+        message(FATAL_ERROR
+            "MLX Metal requires Xcode command line tools. Install Xcode, run "
+            "`sudo xcode-select -s /Applications/Xcode.app/Contents/Developer`, "
+            "then install the Metal toolchain with "
+            "`xcodebuild -downloadComponent MetalToolchain`.")
+    endif()
+
+    execute_process(
+        COMMAND zsh "-c"
+            "echo \"__METAL_VERSION__\" | \"${_ollama_xcrun}\" -sdk macosx metal -E -x metal -P - 2>/dev/null | tail -1 | tr -d '\n'"
+        OUTPUT_VARIABLE _metal_version
+        RESULT_VARIABLE _metal_result)
+    if(NOT _metal_result EQUAL 0 OR NOT _metal_version MATCHES "^[0-9]+$")
+        message(FATAL_ERROR
+            "MLX Metal requires Xcode's Metal toolchain. Install Xcode, run "
+            "`sudo xcode-select -s /Applications/Xcode.app/Contents/Developer`, "
+            "then install the Metal toolchain with "
+            "`xcodebuild -downloadComponent MetalToolchain`.")
+    endif()
+
+    set(${output_version} "${_metal_version}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_macos_major_version output)
+    execute_process(
+        COMMAND sw_vers -productVersion
+        OUTPUT_VARIABLE _macos_version
+        OUTPUT_STRIP_TRAILING_WHITESPACE
+        RESULT_VARIABLE _macos_result
+        ERROR_QUIET)
+    if(_macos_result EQUAL 0)
+        string(REGEX MATCH "^[0-9]+" _macos_major "${_macos_version}")
+    endif()
+    set(${output} "${_macos_major}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_macos_sdk_major_version output)
+    execute_process(
+        COMMAND xcrun --sdk macosx --show-sdk-version
+        OUTPUT_VARIABLE _sdk_version
+        OUTPUT_STRIP_TRAILING_WHITESPACE
+        RESULT_VARIABLE _sdk_result
+        ERROR_QUIET)
+    if(_sdk_result EQUAL 0)
+        string(REGEX MATCH "^[0-9]+" _sdk_major "${_sdk_version}")
+    endif()
+    set(${output} "${_sdk_major}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_default_mlx_backends output)
+    set(_backends "")
+    if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
+        ollama_check_metal_toolchain(_metal_version)
+        ollama_macos_major_version(_macos_major)
+        ollama_macos_sdk_major_version(_sdk_major)
+        if(_macos_major AND _sdk_major AND _macos_major GREATER_EQUAL 26 AND _sdk_major GREATER_EQUAL 26)
+            set(_backends "metal_v4")
+        else()
+            set(_backends "metal_v3")
+        endif()
+        message(STATUS "Defaulting OLLAMA_MLX_BACKENDS=${_backends} for macOS arm64")
+    endif()
+    set(${output} "${_backends}" PARENT_SCOPE)
+endfunction()
+
+if(NOT DEFINED OLLAMA_MLX_BACKENDS)
+    ollama_default_mlx_backends(_ollama_default_mlx_backends)
+    set(OLLAMA_MLX_BACKENDS "${_ollama_default_mlx_backends}" CACHE STRING "${_ollama_mlx_backends_doc}")
+else()
+    set(OLLAMA_MLX_BACKENDS "${OLLAMA_MLX_BACKENDS}" CACHE STRING "${_ollama_mlx_backends_doc}")
+endif()
+
+if(NOT OLLAMA_HAVE_LLAMA_SERVER)
+    if(OLLAMA_LLAMA_BACKENDS)
+        message(FATAL_ERROR "llama/server is required when OLLAMA_LLAMA_BACKENDS is set")
+    endif()
+    if(NOT OLLAMA_MLX_BACKENDS)
+        message(FATAL_ERROR "llama/server is required for local Ollama builds")
+    endif()
+else()
+    file(READ "${CMAKE_SOURCE_DIR}/LLAMA_CPP_VERSION" OLLAMA_LLAMA_CPP_GIT_TAG)
+    string(STRIP "${OLLAMA_LLAMA_CPP_GIT_TAG}" OLLAMA_LLAMA_CPP_GIT_TAG)
+    include(${CMAKE_SOURCE_DIR}/llama/compat/compat.cmake)
+    if(DEFINED FETCHCONTENT_SOURCE_DIR_LLAMA_CPP AND NOT "${FETCHCONTENT_SOURCE_DIR_LLAMA_CPP}" STREQUAL "")
+        get_filename_component(OLLAMA_LLAMA_CPP_SOURCE_DIR
+            "${FETCHCONTENT_SOURCE_DIR_LLAMA_CPP}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
+        message(STATUS "Using llama.cpp source override: ${OLLAMA_LLAMA_CPP_SOURCE_DIR}")
+        add_custom_target(ollama-llama-cpp-source)
+    elseif(DEFINED ENV{OLLAMA_LLAMA_CPP_SOURCE})
+        get_filename_component(OLLAMA_LLAMA_CPP_SOURCE_DIR
+            "$ENV{OLLAMA_LLAMA_CPP_SOURCE}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
+        message(STATUS "Using local llama.cpp source: ${OLLAMA_LLAMA_CPP_SOURCE_DIR}")
+        add_custom_target(ollama-llama-cpp-source)
+    else()
+        set(OLLAMA_LLAMA_CPP_SOURCE_DIR "${CMAKE_BINARY_DIR}/_deps/llama_cpp-src")
+        ExternalProject_Add(ollama-llama-cpp-source
+            GIT_REPOSITORY "https://github.com/ggml-org/llama.cpp.git"
+            GIT_TAG ${OLLAMA_LLAMA_CPP_GIT_TAG}
+            GIT_SHALLOW TRUE
+            SOURCE_DIR ${OLLAMA_LLAMA_CPP_SOURCE_DIR}
+            CONFIGURE_COMMAND ""
+            BUILD_COMMAND ""
+            INSTALL_COMMAND ""
+            PATCH_COMMAND ${OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND}
+            USES_TERMINAL_DOWNLOAD TRUE
+            USES_TERMINAL_PATCH TRUE)
+    endif()
+endif()
+
+set(_mlx_source_targets)
+if(OLLAMA_MLX_BACKENDS)
+    file(READ "${CMAKE_SOURCE_DIR}/MLX_VERSION" OLLAMA_MLX_GIT_TAG)
+    string(STRIP "${OLLAMA_MLX_GIT_TAG}" OLLAMA_MLX_GIT_TAG)
+    file(READ "${CMAKE_SOURCE_DIR}/MLX_C_VERSION" OLLAMA_MLX_C_GIT_TAG)
+    string(STRIP "${OLLAMA_MLX_C_GIT_TAG}" OLLAMA_MLX_C_GIT_TAG)
+
+    if(DEFINED FETCHCONTENT_SOURCE_DIR_MLX AND NOT "${FETCHCONTENT_SOURCE_DIR_MLX}" STREQUAL "")
+        get_filename_component(OLLAMA_MLX_SOURCE_DIR
+            "${FETCHCONTENT_SOURCE_DIR_MLX}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
+        message(STATUS "Using MLX source override: ${OLLAMA_MLX_SOURCE_DIR}")
+    elseif(DEFINED ENV{OLLAMA_MLX_SOURCE})
+        get_filename_component(OLLAMA_MLX_SOURCE_DIR
+            "$ENV{OLLAMA_MLX_SOURCE}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
+        message(STATUS "Using local MLX source: ${OLLAMA_MLX_SOURCE_DIR}")
+    else()
+        set(OLLAMA_MLX_SOURCE_DIR "${CMAKE_BINARY_DIR}/_deps/mlx-src")
+        ExternalProject_Add(ollama-mlx-source
+            GIT_REPOSITORY "https://github.com/ml-explore/mlx.git"
+            GIT_TAG ${OLLAMA_MLX_GIT_TAG}
+            # MLX uses commit hashes while we track closely; switch to shallow when MLX pins move to tags.
+            GIT_SHALLOW FALSE
+            SOURCE_DIR ${OLLAMA_MLX_SOURCE_DIR}
+            CONFIGURE_COMMAND ""
+            BUILD_COMMAND ""
+            INSTALL_COMMAND ""
+            USES_TERMINAL_DOWNLOAD TRUE)
+        list(APPEND _mlx_source_targets ollama-mlx-source)
+    endif()
+
+    if(DEFINED "FETCHCONTENT_SOURCE_DIR_MLX-C" AND NOT "${FETCHCONTENT_SOURCE_DIR_MLX-C}" STREQUAL "")
+        get_filename_component(OLLAMA_MLX_C_SOURCE_DIR
+            "${FETCHCONTENT_SOURCE_DIR_MLX-C}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
+        message(STATUS "Using MLX-C source override: ${OLLAMA_MLX_C_SOURCE_DIR}")
+    elseif(DEFINED ENV{OLLAMA_MLX_C_SOURCE})
+        get_filename_component(OLLAMA_MLX_C_SOURCE_DIR
+            "$ENV{OLLAMA_MLX_C_SOURCE}" ABSOLUTE BASE_DIR "${CMAKE_SOURCE_DIR}")
+        message(STATUS "Using local MLX-C source: ${OLLAMA_MLX_C_SOURCE_DIR}")
+    else()
+        set(OLLAMA_MLX_C_SOURCE_DIR "${CMAKE_BINARY_DIR}/_deps/mlx-c-src")
+        ExternalProject_Add(ollama-mlx-c-source
+            GIT_REPOSITORY "https://github.com/ml-explore/mlx-c.git"
+            GIT_TAG ${OLLAMA_MLX_C_GIT_TAG}
+            # MLX-C uses commit hashes while we track closely; switch to shallow when MLX-C pins move to tags.
+            GIT_SHALLOW FALSE
+            SOURCE_DIR ${OLLAMA_MLX_C_SOURCE_DIR}
+            CONFIGURE_COMMAND ""
+            BUILD_COMMAND ""
+            INSTALL_COMMAND ""
+            USES_TERMINAL_DOWNLOAD TRUE)
+        list(APPEND _mlx_source_targets ollama-mlx-c-source)
+    endif()
+    add_custom_target(ollama-mlx-sources DEPENDS ${_mlx_source_targets})
+endif()
+
+set(OLLAMA_NATIVE_BUILD_TOOL_COMMAND
+    ${CMAKE_COMMAND} --build <BINARY_DIR>)
+set(OLLAMA_NATIVE_BUILD_TARGET_ARG --target)
+if(CMAKE_GENERATOR MATCHES "Makefiles")
+    set(OLLAMA_NATIVE_BUILD_TOOL_COMMAND
+        "$(MAKE)" -C <BINARY_DIR>)
+    set(OLLAMA_NATIVE_BUILD_TARGET_ARG)
+endif()
+
+function(ollama_escape_cmake_list input output)
+    string(REPLACE ";" "|" _escaped "${input}")
+    set(${output} "${_escaped}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_collect_cache_args_with_prefix prefix output)
+    get_cmake_property(_cache_variables CACHE_VARIABLES)
+    list(SORT _cache_variables)
+
+    set(_args)
+    foreach(_var IN LISTS _cache_variables)
+        if(_var MATCHES "^${prefix}")
+            ollama_escape_cmake_list("${${_var}}" _value)
+            list(APPEND _args "-D${_var}=${_value}")
+        endif()
+    endforeach()
+
+    set(${output} "${_args}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_append_cache_arg_if_set output name)
+    if(DEFINED ${name} AND NOT "${${name}}" STREQUAL "")
+        ollama_escape_cmake_list("${${name}}" _value)
+        set(${output} ${${output}} "-D${name}=${_value}" PARENT_SCOPE)
+    endif()
+endfunction()
+
+function(ollama_cache_arg_is_set name output)
+    if(DEFINED ${name} AND NOT "${${name}}" STREQUAL "")
+        set(${output} TRUE PARENT_SCOPE)
+    else()
+        set(${output} FALSE PARENT_SCOPE)
+    endif()
+endfunction()
+
+function(ollama_llama_cuda_preset backend output)
+    ollama_cache_arg_is_set(CMAKE_CUDA_ARCHITECTURES _has_cuda_arch)
+    if(_has_cuda_arch)
+        set(_preset "llama_${backend}_user_arch")
+    elseif(WIN32)
+        set(_preset "llama_${backend}_windows")
+    else()
+        set(_preset "llama_${backend}_linux")
+    endif()
+    set(${output} "${_preset}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_mlx_cuda_preset output)
+    ollama_cache_arg_is_set(MLX_CUDA_ARCHITECTURES _has_mlx_arch)
+    ollama_cache_arg_is_set(CMAKE_CUDA_ARCHITECTURES _has_cuda_arch)
+    if(_has_mlx_arch OR _has_cuda_arch)
+        set(_preset "mlx_cuda_v13_user_arch")
+    elseif(WIN32)
+        set(_preset "mlx_cuda_v13_windows")
+    else()
+        set(_preset "mlx_cuda_v13_linux")
+    endif()
+    set(${output} "${_preset}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_rocm_preset backend output)
+    ollama_cache_arg_is_set(AMDGPU_TARGETS _has_amdgpu_targets)
+    ollama_cache_arg_is_set(CMAKE_HIP_ARCHITECTURES _has_hip_arch)
+    if(_has_amdgpu_targets OR _has_hip_arch)
+        if(backend STREQUAL "rocm_v7_1" AND NOT WIN32)
+            message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_1 is only supported for Windows ROCm builds")
+        elseif(backend STREQUAL "rocm_v7_2" AND WIN32)
+            message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_2 is only supported for Linux ROCm builds")
+        endif()
+    elseif(backend STREQUAL "rocm_v7_1")
+        if(NOT WIN32)
+            message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_1 is only supported for Windows ROCm builds")
+        endif()
+        set(_preset "${backend}_windows")
+    elseif(backend STREQUAL "rocm_v7_2")
+        if(WIN32)
+            message(FATAL_ERROR "OLLAMA_LLAMA_BACKENDS=rocm_v7_2 is only supported for Linux ROCm builds")
+        endif()
+        set(_preset "${backend}_linux")
+    else()
+        message(FATAL_ERROR "Unknown ROCm backend '${backend}'")
+    endif()
+    if(_has_amdgpu_targets OR _has_hip_arch)
+        set(_preset "${backend}_user_arch")
+    endif()
+    set(${output} "${_preset}" PARENT_SCOPE)
+endfunction()
+
+function(ollama_add_llama_server_build name)
+    cmake_parse_arguments(ARG "" "PRESET;RUNNER_DIR" "TARGETS;CMAKE_ARGS" ${ARGN})
+    if(NOT ARG_TARGETS)
+        message(FATAL_ERROR "ollama_add_llama_server_build(${name}) requires TARGETS")
+    endif()
+
+    if(WIN32 AND name STREQUAL "vulkan")
+        # The Vulkan shader generator nests deeply enough to hit Windows MAX_PATH.
+        set(_build_dir ${CMAKE_BINARY_DIR}/ls-vk)
+    else()
+        set(_build_dir ${CMAKE_BINARY_DIR}/llama-server-${name})
+    endif()
+    ollama_collect_cache_args_with_prefix("GGML_" _ggml_cache_args)
+    ollama_collect_cache_args_with_prefix("LLAMA_" _llama_cache_args)
+    set(_cmake_args
+        -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
+        -DCMAKE_INSTALL_PREFIX=${OLLAMA_PAYLOAD_INSTALL_PREFIX}
+        -DOLLAMA_LIB_DIR:STRING=${OLLAMA_LIB_DIR}
+        -DOLLAMA_RUNNER_DIR=${ARG_RUNNER_DIR}
+        -DFETCHCONTENT_SOURCE_DIR_LLAMA_CPP=${OLLAMA_LLAMA_CPP_SOURCE_DIR}
+        -DOLLAMA_LLAMA_CPP_SKIP_COMPAT_PATCH=ON
+        -DGGML_NATIVE=OFF
+        -DGGML_OPENMP=OFF
+        ${ARG_CMAKE_ARGS}
+        ${_ggml_cache_args}
+        ${_llama_cache_args}
+    )
+
+    if(APPLE)
+        if(CMAKE_OSX_ARCHITECTURES)
+            list(APPEND _cmake_args
+                -DCMAKE_OSX_ARCHITECTURES=${CMAKE_OSX_ARCHITECTURES})
+        endif()
+        if(CMAKE_OSX_DEPLOYMENT_TARGET)
+            list(APPEND _cmake_args
+                -DCMAKE_OSX_DEPLOYMENT_TARGET=${CMAKE_OSX_DEPLOYMENT_TARGET})
+        endif()
+    endif()
+    set(_configure_command ${CMAKE_COMMAND}
+        -S ${CMAKE_SOURCE_DIR}/llama/server
+        -B <BINARY_DIR>
+        ${_cmake_args})
+    if(ARG_PRESET)
+        set(_configure_command ${CMAKE_COMMAND}
+            -S ${CMAKE_SOURCE_DIR}/llama/server
+            --preset ${ARG_PRESET}
+            -B <BINARY_DIR>
+            ${_cmake_args})
+    endif()
+    ExternalProject_Add(ollama-llama-server-${name}
+        SOURCE_DIR ${CMAKE_SOURCE_DIR}/llama/server
+        BINARY_DIR ${_build_dir}
+        CONFIGURE_COMMAND ${_configure_command}
+        BUILD_COMMAND ${OLLAMA_NATIVE_BUILD_TOOL_COMMAND}
+            ${OLLAMA_NATIVE_CONFIG_ARG}
+            ${OLLAMA_NATIVE_BUILD_TARGET_ARG} ${ARG_TARGETS}
+        INSTALL_COMMAND ${CMAKE_COMMAND} --install <BINARY_DIR>
+            ${OLLAMA_NATIVE_CONFIG_ARG}
+            --component llama-server
+        DEPENDS ollama-llama-cpp-source
+        LIST_SEPARATOR |
+        # ExternalProject cannot reliably infer when nested FetchContent
+        # sources, compat patches, or forwarded GGML/LLAMA cache settings need
+        # a rebuild. Always entering the sub-build keeps direct `cmake --build`
+        # iteration correct; the nested generator still performs incremental
+        # compilation.
+        BUILD_ALWAYS TRUE
+        ${OLLAMA_NATIVE_EXTERNAL_OPTIONS}
+        USES_TERMINAL_CONFIGURE TRUE
+        USES_TERMINAL_BUILD TRUE
+        USES_TERMINAL_INSTALL TRUE)
+endfunction()
+
+function(ollama_add_mlx_build name)
+    cmake_parse_arguments(ARG "" "PRESET;RUNNER_DIR" "CMAKE_ARGS" ${ARGN})
+    if(NOT ARG_RUNNER_DIR)
+        message(FATAL_ERROR "ollama_add_mlx_build(${name}) requires RUNNER_DIR")
+    endif()
+
+    set(_build_dir ${CMAKE_BINARY_DIR}/mlx-${name})
+    ollama_collect_cache_args_with_prefix("MLX_" _mlx_cache_args)
+    set(_cmake_args
+        -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
+        -DCMAKE_INSTALL_PREFIX=${OLLAMA_PAYLOAD_INSTALL_PREFIX}
+        -DOLLAMA_LIB_DIR:STRING=${OLLAMA_LIB_DIR}
+        -DOLLAMA_RUNNER_DIR=${ARG_RUNNER_DIR}
+        -DOLLAMA_SOURCE_DIR=${CMAKE_SOURCE_DIR}
+        -DFETCHCONTENT_SOURCE_DIR_MLX=${OLLAMA_MLX_SOURCE_DIR}
+        -DFETCHCONTENT_SOURCE_DIR_MLX-C=${OLLAMA_MLX_C_SOURCE_DIR}
+        -DOLLAMA_MLX_GENERATE_WRAPPERS=OFF
+        ${ARG_CMAKE_ARGS}
+        ${_mlx_cache_args}
+    )
+    foreach(_arg IN ITEMS
+            BLAS_INCLUDE_DIRS
+            LAPACK_INCLUDE_DIRS
+            CUDAToolkit_ROOT
+            CUDNN_ROOT_DIR
+            CUDNN_INCLUDE_PATH
+            CUDNN_LIBRARY_PATH
+            CMAKE_CUDA_COMPILER
+            CMAKE_CUDA_HOST_COMPILER
+            CMAKE_INCLUDE_PATH
+            CMAKE_LIBRARY_PATH
+            CMAKE_PREFIX_PATH)
+        ollama_append_cache_arg_if_set(_cmake_args ${_arg})
+    endforeach()
+
+    if(APPLE)
+        if(CMAKE_OSX_ARCHITECTURES)
+            list(APPEND _cmake_args
+                -DCMAKE_OSX_ARCHITECTURES=${CMAKE_OSX_ARCHITECTURES})
+        endif()
+    endif()
+    set(_configure_command ${CMAKE_COMMAND}
+        -S ${CMAKE_SOURCE_DIR}/cmake/mlx
+        -B <BINARY_DIR>
+        ${_cmake_args})
+    if(ARG_PRESET)
+        set(_configure_command ${CMAKE_COMMAND}
+            -S ${CMAKE_SOURCE_DIR}/cmake/mlx
+            --preset ${ARG_PRESET}
+            -B <BINARY_DIR>
+            ${_cmake_args})
+    endif()
+
+    ExternalProject_Add(ollama-mlx-${name}
+        SOURCE_DIR ${CMAKE_SOURCE_DIR}/cmake/mlx
+        BINARY_DIR ${_build_dir}
+        CONFIGURE_COMMAND ${_configure_command}
+        BUILD_COMMAND ${OLLAMA_NATIVE_BUILD_TOOL_COMMAND}
+            ${OLLAMA_NATIVE_CONFIG_ARG}
+            ${OLLAMA_NATIVE_BUILD_TARGET_ARG} mlx
+            ${OLLAMA_NATIVE_BUILD_TARGET_ARG} mlxc
+        INSTALL_COMMAND ${CMAKE_COMMAND} --install <BINARY_DIR>
+            ${OLLAMA_NATIVE_CONFIG_ARG}
+            --component MLX
+            COMMAND ${CMAKE_COMMAND} --install <BINARY_DIR>
+            ${OLLAMA_NATIVE_CONFIG_ARG}
+            --component MLX_VENDOR
+        DEPENDS ollama-mlx-sources
+        LIST_SEPARATOR |
+        BUILD_ALWAYS TRUE
+        ${OLLAMA_NATIVE_EXTERNAL_OPTIONS}
+        USES_TERMINAL_CONFIGURE TRUE
+        USES_TERMINAL_BUILD TRUE
+        USES_TERMINAL_INSTALL TRUE)
+endfunction()
+
+find_program(GO_EXECUTABLE go)
+
+if(OLLAMA_MLX_BACKENDS)
+    set(_mlx_c_headers_dir "${OLLAMA_MLX_C_SOURCE_DIR}/mlx/c")
+    set(_mlx_c_headers_dest "${CMAKE_SOURCE_DIR}/x/mlxrunner/mlx/include/mlx/c")
+
+    if(GO_EXECUTABLE AND (NOT APPLE OR CMAKE_SYSTEM_PROCESSOR STREQUAL CMAKE_HOST_SYSTEM_PROCESSOR))
+        add_custom_target(ollama-mlx-generate-wrappers
+            COMMAND ${CMAKE_COMMAND}
+                -DMLX_C_HEADERS_DIR=${_mlx_c_headers_dir}
+                -DMLX_C_HEADERS_DEST=${_mlx_c_headers_dest}
+                -P "${CMAKE_SOURCE_DIR}/cmake/vendor-mlx-c-headers.cmake"
+            COMMAND ${CMAKE_COMMAND} -E env
+                CC= CGO_CFLAGS= CGO_CXXFLAGS=
+                ${GO_EXECUTABLE} generate ./x/...
+            WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
+            DEPENDS ollama-mlx-sources
+            COMMENT "Regenerating MLX Go wrappers"
+            VERBATIM)
+    else()
+        add_custom_target(ollama-mlx-generate-wrappers
+            COMMAND ${CMAKE_COMMAND} -E echo
+                "Cannot regenerate MLX wrappers while Go is unavailable or while cross-compiling"
+            COMMAND ${CMAKE_COMMAND} -E false
+            DEPENDS ollama-mlx-sources
+            VERBATIM)
+    endif()
+endif()
+
+if(OLLAMA_HAVE_LLAMA_SERVER)
+    if(NOT OLLAMA_GO_OUTPUT)
+        if(WIN32)
+            set(OLLAMA_GO_OUTPUT ${CMAKE_SOURCE_DIR}/ollama.exe)
+        else()
+            set(OLLAMA_GO_OUTPUT ${CMAKE_SOURCE_DIR}/ollama)
+        endif()
+    endif()
+    if(NOT IS_ABSOLUTE "${OLLAMA_GO_OUTPUT}")
+        set(OLLAMA_GO_OUTPUT "${CMAKE_SOURCE_DIR}/${OLLAMA_GO_OUTPUT}")
+    endif()
+    get_filename_component(OLLAMA_GO_OUTPUT "${OLLAMA_GO_OUTPUT}" ABSOLUTE)
+    set(OLLAMA_GO_OUTPUT "${OLLAMA_GO_OUTPUT}" CACHE FILEPATH "Output path for the local Ollama Go binary")
+    get_filename_component(OLLAMA_GO_OUTPUT_DIR "${OLLAMA_GO_OUTPUT}" DIRECTORY)
+
+    set(OLLAMA_GO_LDFLAGS
+        "-s -w -X=github.com/ollama/ollama/version.Version=${OLLAMA_VERSION} -X=github.com/ollama/ollama/server.mode=release")
+    if(GO_EXECUTABLE)
+        add_custom_target(ollama-go ALL
+            COMMAND ${CMAKE_COMMAND} -E make_directory "${OLLAMA_GO_OUTPUT_DIR}"
+            COMMAND ${CMAKE_COMMAND} -E env CGO_ENABLED=1
+                ${GO_EXECUTABLE} build -trimpath -ldflags "${OLLAMA_GO_LDFLAGS}" -o "${OLLAMA_GO_OUTPUT}" .
+            WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
+            BYPRODUCTS ${OLLAMA_GO_OUTPUT}
+            COMMENT "Building Ollama Go binary"
+            VERBATIM)
+    else()
+        add_custom_target(ollama-go ALL
+            COMMAND ${CMAKE_COMMAND} -E echo
+                "Go executable not found. Install Go or set GO_EXECUTABLE to build the local Ollama binary."
+            COMMAND ${CMAKE_COMMAND} -E false
+            COMMENT "Building Ollama Go binary"
+            VERBATIM)
+    endif()
+
+    set(_cpu_args)
+    if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
+        list(APPEND _cpu_args
+            -DBUILD_SHARED_LIBS=OFF
+            -DGGML_BACKEND_DL=OFF
+            -DGGML_METAL=ON
+            -DGGML_METAL_EMBED_LIBRARY=ON)
+    else()
+        list(APPEND _cpu_args
+            -DBUILD_SHARED_LIBS=ON
+            -DGGML_BACKEND_DL=ON
+            -DGGML_CPU_ALL_VARIANTS=ON)
+        if(WIN32)
+            list(APPEND _cpu_args -DGGML_OPENMP=ON)
+        endif()
+        if(APPLE)
+            list(APPEND _cpu_args -DGGML_METAL=OFF)
+        endif()
+    endif()
+
+    ollama_add_llama_server_build(local
+        RUNNER_DIR ""
+        TARGETS llama-server llama-quantize
+        CMAKE_ARGS ${_cpu_args})
+
+    add_custom_target(ollama-local ALL
+        DEPENDS ollama-go ollama-llama-server-local
+        COMMENT "Building local Ollama payload")
+
+    install(PROGRAMS "${OLLAMA_GO_OUTPUT}"
+        DESTINATION "${CMAKE_INSTALL_BINDIR}"
+        COMPONENT ollama-local)
+endif()
+
+set(_backend_targets)
+if(OLLAMA_HAVE_LLAMA_SERVER)
+    foreach(_backend IN LISTS OLLAMA_LLAMA_BACKENDS)
+        if(_backend STREQUAL "cuda_v12")
+            ollama_llama_cuda_preset(${_backend} _cuda_preset)
+            set(_cuda_args)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
+            ollama_add_llama_server_build(${_backend}
+                PRESET ${_cuda_preset}
+                RUNNER_DIR ${_backend}
+                TARGETS ggml-cuda
+                CMAKE_ARGS ${_cuda_args})
+            list(APPEND _backend_targets ollama-llama-server-${_backend})
+        elseif(_backend STREQUAL "cuda_v13")
+            ollama_llama_cuda_preset(${_backend} _cuda_preset)
+            set(_cuda_args)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
+            ollama_add_llama_server_build(${_backend}
+                PRESET ${_cuda_preset}
+                RUNNER_DIR ${_backend}
+                TARGETS ggml-cuda
+                CMAKE_ARGS ${_cuda_args})
+            list(APPEND _backend_targets ollama-llama-server-${_backend})
+        elseif(_backend STREQUAL "rocm_v7_1" OR _backend STREQUAL "rocm_v7_2")
+            # ROCm 7.1 and 7.2 currently share build settings. Keep the backend
+            # names versioned so future packaging can install side-by-side ROCm
+            # payloads without changing the superbuild interface.
+            ollama_rocm_preset(${_backend} _rocm_preset)
+            set(_rocm_args
+                -DBUILD_SHARED_LIBS=ON
+                -DGGML_BACKEND_DL=ON
+                -DGGML_HIP=ON
+                -DCMAKE_HIP_PLATFORM=amd
+                -DOLLAMA_GPU_BACKEND=hip)
+            ollama_append_cache_arg_if_set(_rocm_args AMDGPU_TARGETS)
+            ollama_append_cache_arg_if_set(_rocm_args CMAKE_HIP_ARCHITECTURES)
+            ollama_append_cache_arg_if_set(_rocm_args CMAKE_HIP_FLAGS)
+            ollama_append_cache_arg_if_set(_rocm_args CMAKE_PREFIX_PATH)
+            ollama_add_llama_server_build(${_backend}
+                PRESET ${_rocm_preset}
+                RUNNER_DIR ${_backend}
+                TARGETS ggml-hip
+                CMAKE_ARGS ${_rocm_args})
+            list(APPEND _backend_targets ollama-llama-server-${_backend})
+        elseif(_backend STREQUAL "vulkan")
+            ollama_add_llama_server_build(vulkan
+                RUNNER_DIR vulkan
+                TARGETS ggml-vulkan
+                CMAKE_ARGS
+                    -DBUILD_SHARED_LIBS=ON
+                    -DGGML_BACKEND_DL=ON
+                    -DGGML_VULKAN=ON
+                    -DOLLAMA_GPU_BACKEND=vulkan)
+            list(APPEND _backend_targets ollama-llama-server-vulkan)
+        elseif(_backend STREQUAL "cuda_jetpack5")
+            if(CMAKE_CUDA_ARCHITECTURES)
+                set(_cuda_preset llama_cuda_jetpack5_user_arch)
+            else()
+                set(_cuda_preset llama_cuda_jetpack5)
+            endif()
+            set(_cuda_args)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
+            ollama_add_llama_server_build(${_backend}
+                PRESET ${_cuda_preset}
+                RUNNER_DIR ${_backend}
+                TARGETS ggml-cuda
+                CMAKE_ARGS ${_cuda_args})
+            list(APPEND _backend_targets ollama-llama-server-${_backend})
+        elseif(_backend STREQUAL "cuda_jetpack6")
+            if(CMAKE_CUDA_ARCHITECTURES)
+                set(_cuda_preset llama_cuda_jetpack6_user_arch)
+            else()
+                set(_cuda_preset llama_cuda_jetpack6)
+            endif()
+            set(_cuda_args)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_ARCHITECTURES)
+            ollama_append_cache_arg_if_set(_cuda_args CMAKE_CUDA_FLAGS)
+            ollama_add_llama_server_build(${_backend}
+                PRESET ${_cuda_preset}
+                RUNNER_DIR ${_backend}
+                TARGETS ggml-cuda
+                CMAKE_ARGS ${_cuda_args})
+            list(APPEND _backend_targets ollama-llama-server-${_backend})
+        else()
+            message(FATAL_ERROR
+                "Unknown OLLAMA_LLAMA_BACKENDS entry '${_backend}'")
+        endif()
+    endforeach()
+endif()
+
+if(_backend_targets)
+    add_custom_target(ollama-llama-server-backends ALL
+        DEPENDS ${_backend_targets}
+        COMMENT "Building llama-server GPU backends")
+endif()
+
+set(_mlx_targets)
+foreach(_backend IN LISTS OLLAMA_MLX_BACKENDS)
+    if(_backend STREQUAL "cuda_v13")
+        ollama_mlx_cuda_preset(_mlx_cuda_preset)
+        set(_mlx_cuda_args)
+        ollama_append_cache_arg_if_set(_mlx_cuda_args CMAKE_CUDA_ARCHITECTURES)
+        ollama_append_cache_arg_if_set(_mlx_cuda_args MLX_CUDA_ARCHITECTURES)
+        ollama_append_cache_arg_if_set(_mlx_cuda_args CMAKE_CUDA_FLAGS)
+        ollama_add_mlx_build(cuda_v13
+            PRESET ${_mlx_cuda_preset}
+            RUNNER_DIR mlx_cuda_v13
+            CMAKE_ARGS ${_mlx_cuda_args})
+        list(APPEND _mlx_targets ollama-mlx-cuda_v13)
+    elseif(_backend STREQUAL "metal_v3")
+        if(NOT APPLE)
+            message(FATAL_ERROR "OLLAMA_MLX_BACKENDS=metal_v3 is only supported on macOS")
+        endif()
+        ollama_check_metal_toolchain(_metal_version)
+        ollama_add_mlx_build(metal_v3
+            PRESET mlx_metal_v3
+            RUNNER_DIR mlx_metal_v3)
+        list(APPEND _mlx_targets ollama-mlx-metal_v3)
+    elseif(_backend STREQUAL "metal_v4")
+        if(NOT APPLE)
+            message(FATAL_ERROR "OLLAMA_MLX_BACKENDS=metal_v4 is only supported on macOS")
+        endif()
+        ollama_check_metal_toolchain(_metal_version)
+        ollama_macos_sdk_major_version(_ollama_mlx_sdk_major)
+        if(_ollama_mlx_sdk_major AND _ollama_mlx_sdk_major GREATER_EQUAL 26)
+            ollama_add_mlx_build(metal_v4
+                PRESET mlx_metal_v4
+                RUNNER_DIR mlx_metal_v4)
+            list(APPEND _mlx_targets ollama-mlx-metal_v4)
+        else()
+            message(FATAL_ERROR
+                "OLLAMA_MLX_BACKENDS=metal_v4 requires the macOS 26 SDK. "
+                "Install a newer Xcode or use OLLAMA_MLX_BACKENDS=metal_v3.")
+        endif()
+    else()
+        message(FATAL_ERROR
+            "Unknown OLLAMA_MLX_BACKENDS entry '${_backend}'")
+    endif()
+endforeach()
+
+if(_mlx_targets)
+    add_custom_target(ollama-mlx-backends ALL
+        DEPENDS ${_mlx_targets}
+        COMMENT "Building MLX backends")
+endif()
+
+install(DIRECTORY "${OLLAMA_PAYLOAD_INSTALL_PREFIX}/${OLLAMA_LIB_DIR}/"
+    DESTINATION "${OLLAMA_LIB_DIR}"
+    COMPONENT ollama-local
+    USE_SOURCE_PERMISSIONS)
--- a/cmake/mlx/CMakeLists.txt
+++ b/cmake/mlx/CMakeLists.txt
@ -0,0 +1,235 @@
+cmake_minimum_required(VERSION 3.24)
+
+project(OllamaMLX C CXX)
+
+include(CheckLanguage)
+include(GNUInstallDirs)
+
+find_package(Threads REQUIRED)
+
+if(NOT CMAKE_CONFIGURATION_TYPES AND NOT CMAKE_BUILD_TYPE)
+    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
+endif()
+
+if(NOT DEFINED BUILD_SHARED_LIBS)
+    set(BUILD_SHARED_LIBS ON)
+endif()
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CXX_EXTENSIONS ON)
+
+if(APPLE)
+    set(CMAKE_BUILD_RPATH "@loader_path")
+    set(CMAKE_INSTALL_RPATH "@loader_path")
+    set(CMAKE_BUILD_WITH_INSTALL_RPATH ON)
+endif()
+
+if(NOT DEFINED OLLAMA_SOURCE_DIR OR "${OLLAMA_SOURCE_DIR}" STREQUAL "")
+    get_filename_component(OLLAMA_SOURCE_DIR "${CMAKE_CURRENT_LIST_DIR}/../.." ABSOLUTE)
+endif()
+get_filename_component(OLLAMA_SOURCE_DIR "${OLLAMA_SOURCE_DIR}" ABSOLUTE BASE_DIR "${CMAKE_CURRENT_LIST_DIR}")
+set(OLLAMA_SOURCE_DIR "${OLLAMA_SOURCE_DIR}" CACHE PATH "Ollama repository root")
+
+set(OLLAMA_LIB_DIR "lib/ollama" CACHE STRING "Install destination for Ollama runtime payloads")
+set(OLLAMA_RUNNER_DIR "" CACHE STRING "Ollama runtime payload subdirectory")
+set(OLLAMA_BUILD_DIR ${CMAKE_BINARY_DIR}/lib/ollama)
+set(OLLAMA_INSTALL_DIR ${OLLAMA_LIB_DIR}/${OLLAMA_RUNNER_DIR})
+
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY         ${OLLAMA_BUILD_DIR})
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_DEBUG   ${OLLAMA_BUILD_DIR})
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_RELEASE ${OLLAMA_BUILD_DIR})
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY         ${OLLAMA_BUILD_DIR})
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_DEBUG   ${OLLAMA_BUILD_DIR})
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE ${OLLAMA_BUILD_DIR})
+
+if(MLX_CUDA_ARCHITECTURES OR CMAKE_CUDA_ARCHITECTURES)
+    check_language(CUDA)
+endif()
+
+option(OLLAMA_MLX_GENERATE_WRAPPERS "Regenerate MLX Go wrappers" OFF)
+
+message(STATUS "Setting up MLX (this takes a while...)")
+add_subdirectory(${OLLAMA_SOURCE_DIR}/x/imagegen/mlx ${CMAKE_BINARY_DIR}/x/imagegen/mlx)
+
+# Find CUDA toolkit if MLX is built with CUDA support.
+find_package(CUDAToolkit)
+
+# Build list of directories for runtime dependency resolution.
+set(MLX_RUNTIME_DIRS ${CUDAToolkit_BIN_DIR} ${CUDAToolkit_BIN_DIR}/x64 ${CUDAToolkit_LIBRARY_DIR})
+# Add cuDNN bin paths for DLLs (Windows MLX CUDA builds).
+# CUDNN_ROOT_DIR is the standard CMake variable for cuDNN location.
+if(CUDNN_ROOT_DIR)
+    set(_cudnn_root "${CUDNN_ROOT_DIR}")
+elseif(DEFINED ENV{CUDNN_ROOT_DIR})
+    set(_cudnn_root "$ENV{CUDNN_ROOT_DIR}")
+endif()
+if(_cudnn_root)
+    # cuDNN 9.x has versioned subdirectories under bin/ (e.g., bin/13.0/).
+    file(GLOB CUDNN_BIN_SUBDIRS "${_cudnn_root}/bin/*")
+    list(APPEND MLX_RUNTIME_DIRS ${CUDNN_BIN_SUBDIRS})
+endif()
+# Add build output directory and MLX dependency build directories.
+list(APPEND MLX_RUNTIME_DIRS ${OLLAMA_BUILD_DIR})
+# OpenBLAS DLL location (pre-built zip extracts into openblas-src/bin/).
+list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/openblas-src/bin)
+# NCCL: on Linux, if real NCCL is found, cmake bundles libnccl.so via the
+# regex below. If NCCL is not found, MLX links a static stub (OBJECT lib)
+# so there is no runtime dependency. This path covers the stub build dir
+# for windows so we include the DLL in our dependencies.
+list(APPEND MLX_RUNTIME_DIRS ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/distributed/nccl/nccl_stub-prefix/src/nccl_stub-build/Release)
+
+# Base regexes for runtime dependencies (cross-platform).
+set(MLX_INCLUDE_REGEXES cublas cublasLt cudart cufft nvrtc nvrtc-builtins cudnn nccl openblas gfortran)
+# On Windows, also include dl.dll (dlfcn-win32 POSIX emulation layer).
+if(WIN32)
+    list(APPEND MLX_INCLUDE_REGEXES "^dl\\.dll$")
+endif()
+
+# Keep mlx/mlxc targets separate from runtime dependencies so --strip only
+# applies to the binaries we build, not vendor DLLs/libs.
+install(TARGETS mlx mlxc
+    RUNTIME_DEPENDENCY_SET mlx_runtime_deps
+    RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
+    LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
+    FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
+)
+install(RUNTIME_DEPENDENCY_SET mlx_runtime_deps
+    DIRECTORIES ${MLX_RUNTIME_DIRS}
+    PRE_INCLUDE_REGEXES ${MLX_INCLUDE_REGEXES}
+    PRE_EXCLUDE_REGEXES ".*"
+    RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
+    LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX_VENDOR
+)
+
+if(TARGET jaccl)
+    install(TARGETS jaccl
+        RUNTIME DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
+        LIBRARY DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
+        FRAMEWORK DESTINATION ${OLLAMA_INSTALL_DIR} COMPONENT MLX
+    )
+endif()
+
+# Install the Metal library for macOS arm64 (must be colocated with the binary).
+# Metal backend is only built for arm64, not x86_64.
+if(APPLE AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
+    install(FILES ${CMAKE_BINARY_DIR}/_deps/mlx-build/mlx/backend/metal/kernels/mlx.metallib
+        DESTINATION ${OLLAMA_INSTALL_DIR}
+        COMPONENT MLX)
+endif()
+
+# Install headers for NVRTC JIT compilation at runtime.
+# MLX's own install rules use the default component so they get skipped by
+# --component MLX. Headers are installed alongside libmlx in OLLAMA_INSTALL_DIR.
+#
+# Layout:
+#   ${OLLAMA_INSTALL_DIR}/include/cccl/{cuda,nv}/  - CCCL headers
+#   ${OLLAMA_INSTALL_DIR}/include/*.h              - CUDA toolkit headers
+#
+# MLX's jit_module.cpp resolves CCCL via
+#   current_binary_dir()[.parent_path()] / "include" / "cccl"
+# On Linux, MLX's jit_module.cpp resolves CCCL via
+#   current_binary_dir().parent_path() / "include" / "cccl", so we create a
+#   symlink from lib/ollama/include -> ${OLLAMA_RUNNER_DIR}/include.
+# This will need refinement if we add multiple CUDA versions for MLX in the future.
+# CUDA runtime headers are found via CUDA_PATH env var (set by mlxrunner).
+if(EXISTS ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda)
+    install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/cuda
+        DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
+        COMPONENT MLX)
+    install(DIRECTORY ${CMAKE_BINARY_DIR}/_deps/cccl-src/include/nv
+        DESTINATION ${OLLAMA_INSTALL_DIR}/include/cccl
+        COMPONENT MLX)
+endif()
+
+# Install minimal CUDA toolkit headers needed by MLX JIT kernels.
+# These are the transitive closure of includes from mlx/backend/cuda/device/*.cuh.
+# The Go mlxrunner sets CUDA_PATH to OLLAMA_INSTALL_DIR so MLX finds them at
+# $CUDA_PATH/include/*.h via NVRTC --include-path.
+if(CUDAToolkit_FOUND)
+    # CUDAToolkit_INCLUDE_DIRS may be a semicolon-separated list
+    # (e.g. ".../include;.../include/cccl"). Find the entry that
+    # contains the CUDA runtime headers we need.
+    set(_cuda_inc "")
+    foreach(_dir ${CUDAToolkit_INCLUDE_DIRS})
+        if(EXISTS "${_dir}/cuda_runtime_api.h")
+            set(_cuda_inc "${_dir}")
+            break()
+        endif()
+    endforeach()
+    if(NOT _cuda_inc)
+        message(WARNING "Could not find cuda_runtime_api.h in CUDAToolkit_INCLUDE_DIRS: ${CUDAToolkit_INCLUDE_DIRS}")
+    else()
+        set(_dst "${OLLAMA_INSTALL_DIR}/include")
+        set(_MLX_JIT_CUDA_HEADERS
+            builtin_types.h
+            cooperative_groups.h
+            cuda_bf16.h
+            cuda_bf16.hpp
+            cuda_device_runtime_api.h
+            cuda_fp16.h
+            cuda_fp16.hpp
+            cuda_fp8.h
+            cuda_fp8.hpp
+            cuda_runtime_api.h
+            device_types.h
+            driver_types.h
+            math_constants.h
+            surface_types.h
+            texture_types.h
+            vector_functions.h
+            vector_functions.hpp
+            vector_types.h
+        )
+        foreach(_hdr ${_MLX_JIT_CUDA_HEADERS})
+            install(FILES "${_cuda_inc}/${_hdr}"
+                DESTINATION ${_dst}
+                COMPONENT MLX)
+        endforeach()
+        # Subdirectory headers.
+        install(DIRECTORY "${_cuda_inc}/cooperative_groups"
+            DESTINATION ${_dst}
+            COMPONENT MLX
+            FILES_MATCHING PATTERN "*.h")
+        install(FILES "${_cuda_inc}/crt/host_defines.h"
+            DESTINATION "${_dst}/crt"
+            COMPONENT MLX)
+        if(NOT WIN32 AND NOT APPLE)
+            install(CODE "
+                set(_link \"${CMAKE_INSTALL_PREFIX}/${OLLAMA_LIB_DIR}/include\")
+                set(_target \"${OLLAMA_RUNNER_DIR}/include\")
+                if(NOT EXISTS \${_link})
+                    execute_process(COMMAND \${CMAKE_COMMAND} -E create_symlink \${_target} \${_link})
+                endif()
+            " COMPONENT MLX)
+        endif()
+    endif()
+endif()
+
+# On Windows, explicitly install dl.dll (dlfcn-win32 POSIX dlopen emulation).
+# RUNTIME_DEPENDENCIES auto-excludes it via POST_EXCLUDE_FILES_STRICT because
+# dlfcn-win32 is a known CMake target with its own install rules (which install
+# to the wrong destination). We must install it explicitly here.
+if(WIN32)
+    install(FILES ${OLLAMA_BUILD_DIR}/dl.dll
+        DESTINATION ${OLLAMA_INSTALL_DIR}
+        COMPONENT MLX)
+endif()
+
+# Manually install CUDA runtime libraries that MLX loads via dlopen
+# (not detected by RUNTIME_DEPENDENCIES since they aren't link-time deps).
+if(CUDAToolkit_FOUND)
+    file(GLOB MLX_CUDA_LIBS
+        "${CUDAToolkit_LIBRARY_DIR}/libcudart.so*"
+        "${CUDAToolkit_LIBRARY_DIR}/libcublas.so*"
+        "${CUDAToolkit_LIBRARY_DIR}/libcublasLt.so*"
+        "${CUDAToolkit_LIBRARY_DIR}/libnvrtc.so*"
+        "${CUDAToolkit_LIBRARY_DIR}/libnvrtc-builtins.so*"
+        "${CUDAToolkit_LIBRARY_DIR}/libcufft.so*"
+        "${CUDAToolkit_LIBRARY_DIR}/libcudnn.so*")
+    if(MLX_CUDA_LIBS)
+        install(FILES ${MLX_CUDA_LIBS}
+            DESTINATION ${OLLAMA_INSTALL_DIR}
+            COMPONENT MLX_VENDOR)
+    endif()
+endif()
--- a/cmake/mlx/CMakePresets.json
+++ b/cmake/mlx/CMakePresets.json
@ -0,0 +1,90 @@
+{
+  "version": 3,
+  "configurePresets": [
+    {
+      "name": "default",
+      "binaryDir": "${sourceDir}/../../build/mlx",
+      "installDir": "${sourceDir}/../../dist",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "Release",
+        "CMAKE_MSVC_RUNTIME_LIBRARY": "MultiThreaded",
+        "OLLAMA_SOURCE_DIR": "${sourceDir}/../.."
+      }
+    },
+    {
+      "name": "mlx_cuda_v13_base",
+      "hidden": true,
+      "inherits": [ "default" ],
+      "cacheVariables": {
+        "CMAKE_CUDA_FLAGS": "-t 2",
+        "OLLAMA_RUNNER_DIR": "mlx_cuda_v13"
+      }
+    },
+    {
+      "name": "mlx_cuda_v13_linux",
+      "inherits": [ "mlx_cuda_v13_base" ],
+      "binaryDir": "${sourceDir}/../../build/mlx_cuda_v13",
+      "cacheVariables": {
+        "MLX_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual"
+      }
+    },
+    {
+      "name": "mlx_cuda_v13_windows",
+      "inherits": [ "mlx_cuda_v13_base" ],
+      "binaryDir": "${sourceDir}/../../build/mlx_cuda_v13",
+      "cacheVariables": {
+        "MLX_CUDA_ARCHITECTURES": "75-virtual;80-virtual;86-virtual;89-virtual;90-virtual;90a-virtual;100-virtual;103-virtual;110-virtual;120-virtual;121-virtual"
+      }
+    },
+    {
+      "name": "mlx_cuda_v13_user_arch",
+      "inherits": [ "mlx_cuda_v13_base" ],
+      "binaryDir": "${sourceDir}/../../build/mlx_cuda_v13"
+    },
+    {
+      "name": "mlx_metal_v3",
+      "inherits": [ "default" ],
+      "binaryDir": "${sourceDir}/../../build/metal-v3",
+      "cacheVariables": {
+        "CMAKE_OSX_DEPLOYMENT_TARGET": "14.0",
+        "OLLAMA_RUNNER_DIR": "mlx_metal_v3"
+      }
+    },
+    {
+      "name": "mlx_metal_v4",
+      "inherits": [ "default" ],
+      "binaryDir": "${sourceDir}/../../build/metal-v4",
+      "cacheVariables": {
+        "CMAKE_OSX_DEPLOYMENT_TARGET": "26.0",
+        "OLLAMA_RUNNER_DIR": "mlx_metal_v4"
+      }
+    }
+  ],
+  "buildPresets": [
+    {
+      "name": "mlx_cuda_v13_linux",
+      "configurePreset": "mlx_cuda_v13_linux",
+      "targets": [ "mlx", "mlxc" ]
+    },
+    {
+      "name": "mlx_cuda_v13_windows",
+      "configurePreset": "mlx_cuda_v13_windows",
+      "targets": [ "mlx", "mlxc" ]
+    },
+    {
+      "name": "mlx_cuda_v13_user_arch",
+      "configurePreset": "mlx_cuda_v13_user_arch",
+      "targets": [ "mlx", "mlxc" ]
+    },
+    {
+      "name": "mlx_metal_v3",
+      "configurePreset": "mlx_metal_v3",
+      "targets": [ "mlx", "mlxc" ]
+    },
+    {
+      "name": "mlx_metal_v4",
+      "configurePreset": "mlx_metal_v4",
+      "targets": [ "mlx", "mlxc" ]
+    }
+  ]
+}
--- a/cmake/vendor-mlx-c-headers.cmake
+++ b/cmake/vendor-mlx-c-headers.cmake
@ -0,0 +1,14 @@
+if(NOT DEFINED MLX_C_HEADERS_DIR OR NOT IS_DIRECTORY "${MLX_C_HEADERS_DIR}")
+    message(FATAL_ERROR "MLX_C_HEADERS_DIR does not exist: ${MLX_C_HEADERS_DIR}")
+endif()
+if(NOT DEFINED MLX_C_HEADERS_DEST OR "${MLX_C_HEADERS_DEST}" STREQUAL "")
+    message(FATAL_ERROR "MLX_C_HEADERS_DEST is required")
+endif()
+
+file(GLOB _mlx_c_headers LIST_DIRECTORIES false "${MLX_C_HEADERS_DIR}/*.h")
+if(NOT _mlx_c_headers)
+    message(FATAL_ERROR "No MLX-C headers found in ${MLX_C_HEADERS_DIR}")
+endif()
+
+file(MAKE_DIRECTORY "${MLX_C_HEADERS_DEST}")
+file(COPY ${_mlx_c_headers} DESTINATION "${MLX_C_HEADERS_DEST}")
--- a/cmake/windows-arm64-llvm-mingw.cmake
+++ b/cmake/windows-arm64-llvm-mingw.cmake
@ -0,0 +1,69 @@
+set(CMAKE_SYSTEM_NAME Windows)
+set(CMAKE_SYSTEM_PROCESSOR ARM64)
+
+set(_ollama_llvm_mingw_hints)
+if(DEFINED ENV{ProgramFiles})
+    file(GLOB _ollama_program_files_llvm_mingw_bins
+        LIST_DIRECTORIES true
+        "$ENV{ProgramFiles}/llvm-mingw-*-x86_64*/bin")
+    list(SORT _ollama_program_files_llvm_mingw_bins COMPARE NATURAL ORDER DESCENDING)
+    list(APPEND _ollama_llvm_mingw_hints ${_ollama_program_files_llvm_mingw_bins})
+endif()
+if(DEFINED ENV{LOCALAPPDATA})
+    file(GLOB _ollama_winget_llvm_mingw_bins
+        LIST_DIRECTORIES true
+        "$ENV{LOCALAPPDATA}/Microsoft/WinGet/Packages/MartinStorsjo.LLVM-MinGW*/llvm-mingw-*-x86_64*/bin")
+    list(SORT _ollama_winget_llvm_mingw_bins COMPARE NATURAL ORDER DESCENDING)
+    list(APPEND _ollama_llvm_mingw_hints ${_ollama_winget_llvm_mingw_bins})
+endif()
+
+if(NOT CMAKE_C_COMPILER)
+    find_program(CMAKE_C_COMPILER
+        NAMES aarch64-w64-mingw32-gcc
+        HINTS ${_ollama_llvm_mingw_hints}
+        REQUIRED)
+endif()
+
+if(NOT CMAKE_CXX_COMPILER)
+    find_program(CMAKE_CXX_COMPILER
+        NAMES aarch64-w64-mingw32-g++
+        HINTS ${_ollama_llvm_mingw_hints}
+        REQUIRED)
+endif()
+
+get_filename_component(_ollama_llvm_mingw_bin_dir "${CMAKE_CXX_COMPILER}" DIRECTORY)
+
+if(NOT HOST_CXX_COMPILER)
+    find_program(_ollama_path_host_cxx
+        NAMES clang++ g++
+        NO_CMAKE_FIND_ROOT_PATH)
+    if(_ollama_path_host_cxx)
+        set(HOST_CXX_COMPILER "${_ollama_path_host_cxx}")
+    endif()
+endif()
+if(NOT HOST_CXX_COMPILER)
+    find_program(_ollama_mingw_host_cxx
+        NAMES x86_64-w64-mingw32-g++
+        HINTS "${_ollama_llvm_mingw_bin_dir}"
+        REQUIRED)
+    if(CMAKE_HOST_WIN32)
+        # llama.cpp builds a small host-only UI embedding tool during
+        # cross-compiles, but currently models HOST_CXX_COMPILER as only an
+        # executable path and has no companion host flags hook. When the host
+        # compiler is llvm-mingw, the generated host tool otherwise depends on
+        # llvm-mingw runtime DLLs being on PATH. Keep that workaround local and
+        # explicit: wrap the compiler only to add -static for this host tool.
+        set(_ollama_host_cxx_wrapper "${CMAKE_BINARY_DIR}/ollama-host-cxx.cmd")
+        file(TO_NATIVE_PATH "${_ollama_mingw_host_cxx}" _ollama_mingw_host_cxx_native)
+        file(WRITE "${_ollama_host_cxx_wrapper}"
+            "@echo off\r\n"
+            "\"${_ollama_mingw_host_cxx_native}\" -static %*\r\n")
+        set(HOST_CXX_COMPILER "${_ollama_host_cxx_wrapper}")
+    else()
+        set(HOST_CXX_COMPILER "${_ollama_mingw_host_cxx}")
+    endif()
+endif()
+set(HOST_CXX_COMPILER "${HOST_CXX_COMPILER}" CACHE FILEPATH "Host C++ compiler for build-time tools" FORCE)
+
+string(PREPEND CMAKE_C_FLAGS_INIT "-D_WIN32_WINNT=0x0A00 ")
+string(PREPEND CMAKE_CXX_FLAGS_INIT "-D_WIN32_WINNT=0x0A00 ")
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@ -18,6 +18,7 @@ import (
 	"os"
 	"os/exec"
 	"os/signal"
+	"path"
 	"path/filepath"
 	"runtime"
 	"slices"
@ -41,6 +42,7 @@ import (
 	"github.com/ollama/ollama/cmd/config"
 	"github.com/ollama/ollama/cmd/launch"
 	"github.com/ollama/ollama/cmd/tui"
+	"github.com/ollama/ollama/discover"
 	"github.com/ollama/ollama/envconfig"
 	"github.com/ollama/ollama/format"
 	"github.com/ollama/ollama/internal/modelref"
@ -232,9 +234,6 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 	// This gates both safetensors LLM and imagegen model creation
 	experimental, _ := cmd.Flags().GetBool("experimental")
 	draftQuantize, _ := cmd.Flags().GetString("draft-quantize")
-	if draftQuantize != "" && !experimental {
-		return errors.New("--draft-quantize requires --experimental")
-	}
 	if experimental {
 		if !isLocalhost() {
 			return errors.New("remote safetensor model creation not yet supported")
@ -329,6 +328,12 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 	if quantize != "" {
 		req.Quantize = quantize
 	}
+	if draftQuantize != "" {
+		if len(req.DraftFiles) == 0 {
+			return errors.New("--draft-quantize requires a DRAFT model")
+		}
+		req.DraftQuantize = draftQuantize
+	}

 	client, err := api.ClientFromEnvironment()
 	if err != nil {
@ -339,29 +344,40 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 	g.SetLimit(max(runtime.GOMAXPROCS(0)-1, 1))

 	files := syncmap.NewSyncMap[string, string]()
+	fileNames := createRequestFileNames(req.Files)
 	for f, digest := range req.Files {
 		g.Go(func() error {
 			if _, err := createBlob(cmd, client, f, digest, p); err != nil {
 				return err
 			}

-			// TODO: this is incorrect since the file might be in a subdirectory
-			//       instead this should take the path relative to the model directory
-			//       but the current implementation does not allow this
-			files.Store(filepath.Base(f), digest)
+			files.Store(fileNames[f], digest)
 			return nil
 		})
 	}

 	adapters := syncmap.NewSyncMap[string, string]()
+	adapterNames := createRequestFileNames(req.Adapters)
 	for f, digest := range req.Adapters {
 		g.Go(func() error {
 			if _, err := createBlob(cmd, client, f, digest, p); err != nil {
 				return err
 			}

-			// TODO: same here
-			adapters.Store(filepath.Base(f), digest)
+			adapters.Store(adapterNames[f], digest)
+			return nil
+		})
+	}
+
+	draftFiles := syncmap.NewSyncMap[string, string]()
+	draftFileNames := createRequestFileNames(req.DraftFiles)
+	for f, digest := range req.DraftFiles {
+		g.Go(func() error {
+			if _, err := createBlob(cmd, client, f, digest, p); err != nil {
+				return err
+			}
+
+			draftFiles.Store(draftFileNames[f], digest)
 			return nil
 		})
 	}
@ -372,6 +388,7 @@ func CreateHandler(cmd *cobra.Command, args []string) error {

 	req.Files = files.Items()
 	req.Adapters = adapters.Items()
+	req.DraftFiles = draftFiles.Items()

 	bars := make(map[string]*progress.Bar)
 	fn := func(resp api.ProgressResponse) error {
@ -409,6 +426,65 @@ func CreateHandler(cmd *cobra.Command, args []string) error {
 	return nil
 }

+func createRequestFileNames(files map[string]string) map[string]string {
+	names := make(map[string]string, len(files))
+	root, ok := commonFileRoot(files)
+	for f := range files {
+		name := filepath.Base(f)
+		if ok {
+			abs, err := filepath.Abs(f)
+			if err == nil {
+				if rel, err := filepath.Rel(root, abs); err == nil && rel != "." && rel != ".." && !strings.HasPrefix(rel, ".."+string(filepath.Separator)) {
+					name = rel
+				}
+			}
+		}
+		names[f] = path.Clean(filepath.ToSlash(name))
+	}
+	return names
+}
+
+func commonFileRoot(files map[string]string) (string, bool) {
+	if len(files) < 2 {
+		return "", false
+	}
+
+	var root string
+	var volume string
+	for f := range files {
+		abs, err := filepath.Abs(f)
+		if err != nil {
+			return "", false
+		}
+		if nextVolume := filepath.VolumeName(abs); volume == "" {
+			volume = nextVolume
+		} else if !strings.EqualFold(volume, nextVolume) {
+			return "", false
+		}
+
+		dir := filepath.Dir(abs)
+		if root == "" {
+			root = dir
+			continue
+		}
+
+		for {
+			rel, err := filepath.Rel(root, dir)
+			if err == nil && (rel == "." || (rel != ".." && !strings.HasPrefix(rel, ".."+string(filepath.Separator)))) {
+				break
+			}
+
+			parent := filepath.Dir(root)
+			if parent == root {
+				return "", false
+			}
+			root = parent
+		}
+	}
+
+	return root, root != ""
+}
+
 func createBlob(cmd *cobra.Command, client *api.Client, path string, digest string, p *progress.Progress) (string, error) {
 	realPath, err := filepath.EvalSymlinks(path)
 	if err != nil {
@ -1277,11 +1353,28 @@ func showInfo(resp *api.ShowResponse, verbose bool, w io.Writer) error {

 	if resp.ProjectorInfo != nil {
 		tableRender("Projector", func() (rows [][]string) {
-			arch := resp.ProjectorInfo["general.architecture"].(string)
-			rows = append(rows, []string{"", "architecture", arch})
-			rows = append(rows, []string{"", "parameters", format.HumanNumber(uint64(resp.ProjectorInfo["general.parameter_count"].(float64)))})
-			rows = append(rows, []string{"", "embedding length", strconv.FormatFloat(resp.ProjectorInfo[fmt.Sprintf("%s.vision.embedding_length", arch)].(float64), 'f', -1, 64)})
-			rows = append(rows, []string{"", "dimensions", strconv.FormatFloat(resp.ProjectorInfo[fmt.Sprintf("%s.vision.projection_dim", arch)].(float64), 'f', -1, 64)})
+			arch, _ := resp.ProjectorInfo["general.architecture"].(string)
+			if arch != "" {
+				rows = append(rows, []string{"", "architecture", arch})
+			}
+			if v, ok := resp.ProjectorInfo["general.parameter_count"].(float64); ok {
+				rows = append(rows, []string{"", "parameters", format.HumanNumber(uint64(v))})
+			}
+
+			projectorValue := func(suffix string) (float64, bool) {
+				for _, modality := range []string{"vision", "audio"} {
+					if v, ok := resp.ProjectorInfo[fmt.Sprintf("%s.%s.%s", arch, modality, suffix)].(float64); ok {
+						return v, true
+					}
+				}
+				return 0, false
+			}
+			if v, ok := projectorValue("embedding_length"); ok {
+				rows = append(rows, []string{"", "embedding length", strconv.FormatFloat(v, 'f', -1, 64)})
+			}
+			if v, ok := projectorValue("projection_dim"); ok {
+				rows = append(rows, []string{"", "dimensions", strconv.FormatFloat(v, 'f', -1, 64)})
+			}
 			return
 		})
 	}
@ -2277,9 +2370,6 @@ func NewCLI() *cobra.Command {
 			if experimental, _ := cmd.Flags().GetBool("experimental"); experimental {
 				return nil
 			}
-			if draftQuantize, _ := cmd.Flags().GetString("draft-quantize"); draftQuantize != "" {
-				return errors.New("--draft-quantize requires --experimental")
-			}
 			return checkServerHeartbeat(cmd, args)
 		},
 		RunE: CreateHandler,
@ -2445,6 +2535,16 @@ func NewCLI() *cobra.Command {
 		_ = runner.Execute(args[1:])
 	})

+	var gpuDiscoverLibDirs []string
+	gpuDiscoverCmd := &cobra.Command{
+		Use:    "gpu-discover",
+		Hidden: true,
+		RunE: func(cmd *cobra.Command, _ []string) error {
+			return discover.RunNativeProbeCommand(cmd.Context(), gpuDiscoverLibDirs, os.Stdout)
+		},
+	}
+	gpuDiscoverCmd.Flags().StringArrayVar(&gpuDiscoverLibDirs, "lib-dir", nil, "Ollama runtime library directory")
+
 	envVars := envconfig.AsMap()

 	envs := []envconfig.EnvVar{envVars["OLLAMA_HOST"]}
@ -2485,6 +2585,9 @@ func NewCLI() *cobra.Command {
 				envVars["OLLAMA_KV_CACHE_TYPE"],
 				envVars["OLLAMA_LLM_LIBRARY"],
 				envVars["OLLAMA_GPU_OVERHEAD"],
+				envVars["OLLAMA_IGPU_ENABLE"],
+				envVars["LLAMA_ARG_FIT"],
+				envVars["LLAMA_ARG_FIT_TARGET"],
 				envVars["OLLAMA_LOAD_TIMEOUT"],
 			})
 		default:
@ -2509,6 +2612,7 @@ func NewCLI() *cobra.Command {
 		copyCmd,
 		deleteCmd,
 		runnerCmd,
+		gpuDiscoverCmd,
 		launch.LaunchCmd(checkServerHeartbeat, runInteractiveTUI),
 	)

--- a/cmd/cmd_test.go
+++ b/cmd/cmd_test.go
@ -1525,34 +1525,65 @@ func TestCreateHandler(t *testing.T) {
 	}
 }

-func TestCreateHandlerDraftQuantizeRequiresExperimental(t *testing.T) {
-	cmd := &cobra.Command{}
-	cmd.Flags().Bool("experimental", false, "")
-	cmd.Flags().String("draft-quantize", "mxfp8", "")
-	cmd.SetContext(t.Context())
+func TestCreateRequestFileNamesPreservesModelDirectoryLayout(t *testing.T) {
+	root := t.TempDir()
+	files := map[string]string{
+		filepath.Join(root, "model.safetensors"):            "sha256:model",
+		filepath.Join(root, "config.json"):                  "sha256:config",
+		filepath.Join(root, "2_Dense", "config.json"):       "sha256:dense-config",
+		filepath.Join(root, "2_Dense", "model.safetensors"): "sha256:dense-model",
+	}

-	err := CreateHandler(cmd, []string{"test-model"})
-	if err == nil || !strings.Contains(err.Error(), "--draft-quantize requires --experimental") {
-		t.Fatalf("error = %v, want draft-quantize requires experimental", err)
+	got := createRequestFileNames(files)
+	want := map[string]string{
+		filepath.Join(root, "model.safetensors"):            "model.safetensors",
+		filepath.Join(root, "config.json"):                  "config.json",
+		filepath.Join(root, "2_Dense", "config.json"):       "2_Dense/config.json",
+		filepath.Join(root, "2_Dense", "model.safetensors"): "2_Dense/model.safetensors",
+	}
+
+	if diff := cmp.Diff(want, got); diff != "" {
+		t.Fatalf("mismatch (-want +got):\n%s", diff)
 	}
 }

-func TestCreateHandlerDraftRequiresExperimental(t *testing.T) {
+func TestCreateRequestFileNamesPreservesRelativeModelDirectoryLayout(t *testing.T) {
+	root := t.TempDir()
+	t.Chdir(root)
+
+	files := map[string]string{
+		"model.safetensors":         "sha256:model",
+		"config.json":               "sha256:config",
+		"2_Dense/config.json":       "sha256:dense-config",
+		"2_Dense/model.safetensors": "sha256:dense-model",
+		"3_Dense/config.json":       "sha256:dense-config",
+		"3_Dense/model.safetensors": "sha256:dense-model",
+	}
+
+	got := createRequestFileNames(files)
+	for file := range files {
+		if got[file] != filepath.ToSlash(file) {
+			t.Fatalf("%s = %q, want %q", file, got[file], filepath.ToSlash(file))
+		}
+	}
+}
+
+func TestCreateHandlerDraftQuantizeRequiresDraft(t *testing.T) {
 	dir := t.TempDir()
 	modelfile := filepath.Join(dir, "Modelfile")
-	if err := os.WriteFile(modelfile, []byte("FROM base\nDRAFT ./assistant\n"), 0o644); err != nil {
+	if err := os.WriteFile(modelfile, []byte("FROM base\n"), 0o644); err != nil {
 		t.Fatal(err)
 	}

 	cmd := &cobra.Command{}
 	cmd.Flags().Bool("experimental", false, "")
-	cmd.Flags().String("draft-quantize", "", "")
 	cmd.Flags().String("file", modelfile, "")
+	cmd.Flags().String("draft-quantize", "mxfp8", "")
 	cmd.SetContext(t.Context())

 	err := CreateHandler(cmd, []string{"test-model"})
-	if err == nil || !strings.Contains(err.Error(), "DRAFT requires --experimental") {
-		t.Fatalf("error = %v, want DRAFT requires --experimental", err)
+	if err == nil || !strings.Contains(err.Error(), "--draft-quantize requires a DRAFT model") {
+		t.Fatalf("error = %v, want draft-quantize requires DRAFT", err)
 	}
 }

--- a/cmd/launch/models.go
+++ b/cmd/launch/models.go
@ -496,17 +496,6 @@ func isCloudModelName(name string) bool {
 	return modelref.HasExplicitCloudSource(name)
 }

-// filterCloudModels drops remote-only models from the given inventory.
-func filterCloudModels(existing []modelInfo) []modelInfo {
-	filtered := existing[:0]
-	for _, m := range existing {
-		if !m.Remote {
-			filtered = append(filtered, m)
-		}
-	}
-	return filtered
-}
-
 // filterCloudItems removes cloud models from selection items.
 func filterCloudItems(items []ModelItem) []ModelItem {
 	filtered := items[:0]
--- a/convert/convert.go
+++ b/convert/convert.go
@ -147,7 +147,9 @@ func (ModelParameters) KV(t *Tokenizer) KV {
 	}

 	for _, sv := range t.SpecialVocabulary {
-		kv[fmt.Sprintf("tokenizer.ggml.add_%s_token", sv.Key())] = sv.AddToken
+		if sv.AddTokenSet {
+			kv[fmt.Sprintf("tokenizer.ggml.add_%s_token", sv.Key())] = sv.AddToken
+		}
 		kv[fmt.Sprintf("tokenizer.ggml.%s_token_id", sv.Key())] = uint32(sv.ID)
 		if len(sv.IDs) > 0 {
 			kv[fmt.Sprintf("tokenizer.ggml.%s_token_ids", sv.Key())] = sv.IDs
@ -200,10 +202,32 @@ type ModelConverter interface {
 	specialTokenTypes() []string
 }

+// MultimodalConverter splits checkpoints with embedded vision/projector
+// weights into a text model GGUF and a separate projector GGUF.
+type MultimodalConverter interface {
+	ModelConverter
+	TextKV(*Tokenizer) KV
+	TextTensors([]Tensor, *Tokenizer) []*ggml.Tensor
+	ProjectorKV(*Tokenizer) KV
+	ProjectorTensors([]Tensor) []*ggml.Tensor
+}
+
 type moreParser interface {
 	parseMore(fs.FS) error
 }

+type extraTensorParser interface {
+	extraTensors(fs.FS) ([]Tensor, error)
+}
+
+type tokenizerAdjuster interface {
+	adjustTokenizer(*Tokenizer)
+}
+
+type tokenizerAwareTensorConverter interface {
+	TensorsWithTokenizer([]Tensor, *Tokenizer) []*ggml.Tensor
+}
+
 type AdapterConverter interface {
 	// KV maps parameters to LLM key-values
 	KV(ofs.Config) KV
@ -288,6 +312,8 @@ func LoadModelMetadata(fsys fs.FS) (ModelKV, *Tokenizer, error) {
 		conv = &gemma2Model{}
 	case "Gemma3ForCausalLM", "Gemma3ForConditionalGeneration":
 		conv = &gemma3Model{Architecture: p.Architectures[0]}
+	case "Gemma3TextModel":
+		conv = &embeddingGemmaModel{}
 	case "Gemma3nForConditionalGeneration":
 		conv = &gemma3nModel{}
 	case "Gemma4ForCausalLM", "Gemma4ForConditionalGeneration":
@ -348,6 +374,9 @@ func LoadModelMetadata(fsys fs.FS) (ModelKV, *Tokenizer, error) {
 	if err != nil {
 		return nil, nil, err
 	}
+	if ta, ok := conv.(tokenizerAdjuster); ok {
+		ta.adjustTokenizer(t)
+	}

 	vocabSize := int(cmp.Or(p.VocabSize, p.TextModel.VocabSize))

@ -375,7 +404,7 @@ func LoadModelMetadata(fsys fs.FS) (ModelKV, *Tokenizer, error) {
 // and files it finds in the input path.
 // Supported input model formats include safetensors.
 // Supported input tokenizers files include tokenizer.json (preferred) and tokenizer.model.
-func ConvertModel(fsys fs.FS, f *os.File) error {
+func ConvertModel(fsys fs.FS, f *os.File, projectorFiles ...*os.File) error {
 	kv, t, err := LoadModelMetadata(fsys)
 	if err != nil {
 		return err
@ -387,7 +416,47 @@ func ConvertModel(fsys fs.FS, f *os.File) error {
 		return err
 	}

-	return writeFile(f, conv.KV(t), conv.Tensors(ts))
+	if tp, ok := conv.(extraTensorParser); ok {
+		extra, err := tp.extraTensors(fsys)
+		if err != nil {
+			return err
+		}
+		ts = append(ts, extra...)
+	}
+
+	if err := ensureUniqueTensorNames(ts); err != nil {
+		return err
+	}
+
+	if mc, ok := conv.(MultimodalConverter); ok && len(projectorFiles) > 0 && projectorFiles[0] != nil {
+		projectorTensors := mc.ProjectorTensors(ts)
+		if len(projectorTensors) > 0 {
+			if err := writeFile(f, mc.TextKV(t), mc.TextTensors(ts, t)); err != nil {
+				return err
+			}
+			return writeFile(projectorFiles[0], mc.ProjectorKV(t), projectorTensors)
+		}
+	}
+
+	var tensors []*ggml.Tensor
+	if tc, ok := conv.(tokenizerAwareTensorConverter); ok {
+		tensors = tc.TensorsWithTokenizer(ts, t)
+	} else {
+		tensors = conv.Tensors(ts)
+	}
+
+	return writeFile(f, conv.KV(t), tensors)
+}
+
+func ensureUniqueTensorNames(ts []Tensor) error {
+	names := make(map[string]struct{}, len(ts))
+	for _, t := range ts {
+		if _, ok := names[t.Name()]; ok {
+			return fmt.Errorf("duplicate tensor name '%s' was found for this model", t.Name())
+		}
+		names[t.Name()] = struct{}{}
+	}
+	return nil
 }

 func writeFile(f *os.File, kv KV, ts []*ggml.Tensor) error {
--- a/convert/convert_embeddinggemma.go
+++ b/convert/convert_embeddinggemma.go
@ -0,0 +1,280 @@
+package convert
+
+import (
+	"cmp"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io/fs"
+	"path"
+	"slices"
+	"strings"
+
+	"github.com/ollama/ollama/fs/ggml"
+)
+
+type embeddingGemmaModel struct {
+	gemmaModel
+	RopeLocalTheta float32 `json:"rope_local_base_freq"`
+	RopeTheta      float32 `json:"rope_theta"`
+	SlidingWindow  uint32  `json:"sliding_window"`
+
+	poolingType  uint32
+	denseModules []embeddingGemmaDenseModule
+}
+
+type embeddingGemmaDenseModule struct {
+	path       string
+	tensorName string
+	in, out    uint32
+}
+
+var (
+	_ ModelConverter    = (*embeddingGemmaModel)(nil)
+	_ moreParser        = (*embeddingGemmaModel)(nil)
+	_ extraTensorParser = (*embeddingGemmaModel)(nil)
+	_ tokenizerAdjuster = (*embeddingGemmaModel)(nil)
+)
+
+func (m *embeddingGemmaModel) KV(t *Tokenizer) KV {
+	kv := m.ModelParameters.KV(t)
+	kv["general.architecture"] = "gemma-embedding"
+	kv["gemma-embedding.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(2048))
+	kv["gemma-embedding.embedding_length"] = m.HiddenSize
+	kv["gemma-embedding.block_count"] = m.HiddenLayers
+	kv["gemma-embedding.feed_forward_length"] = m.IntermediateSize
+	kv["gemma-embedding.attention.head_count"] = m.NumAttentionHeads
+	kv["gemma-embedding.attention.head_count_kv"] = m.NumKeyValueHeads
+	kv["gemma-embedding.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEPS, float32(1e-6))
+	kv["gemma-embedding.attention.key_length"] = m.HeadDim
+	kv["gemma-embedding.attention.value_length"] = m.HeadDim
+	kv["gemma-embedding.attention.sliding_window"] = m.SlidingWindow
+	kv["gemma-embedding.rope.freq_base"] = cmp.Or(m.RopeTheta, float32(1000000.0))
+	kv["gemma-embedding.rope.freq_base_swa"] = cmp.Or(m.RopeLocalTheta, float32(10000.0))
+	kv["gemma-embedding.pooling_type"] = cmp.Or(m.poolingType, uint32(1))
+
+	for _, dense := range m.denseModules {
+		kv["gemma-embedding."+dense.tensorName+"_feat_in"] = dense.in
+		kv["gemma-embedding."+dense.tensorName+"_feat_out"] = dense.out
+	}
+
+	return kv
+}
+
+func (m *embeddingGemmaModel) parseMore(fsys fs.FS) error {
+	bts, err := fs.ReadFile(fsys, "modules.json")
+	if err != nil {
+		if errors.Is(err, fs.ErrNotExist) {
+			return errors.New("embeddinggemma requires sentence-transformers modules.json")
+		}
+		return err
+	}
+
+	var modules []struct {
+		Type string `json:"type"`
+		Path string `json:"path"`
+	}
+
+	if err := json.Unmarshal(bts, &modules); err != nil {
+		return err
+	}
+
+	m.poolingType = 1
+	m.denseModules = nil
+	for _, module := range modules {
+		switch module.Type {
+		case "sentence_transformers.models.Pooling":
+			poolingType, err := embeddingGemmaPoolingType(fsys, module.Path)
+			if err != nil {
+				return err
+			}
+			if poolingType != 0 {
+				m.poolingType = poolingType
+			}
+		case "sentence_transformers.models.Dense":
+			dense, ok, err := embeddingGemmaDenseModuleConfig(fsys, module.Path)
+			if err != nil {
+				return err
+			}
+			if ok {
+				m.denseModules = append(m.denseModules, dense)
+			}
+		}
+	}
+
+	slices.SortFunc(m.denseModules, func(a, b embeddingGemmaDenseModule) int {
+		return strings.Compare(a.tensorName, b.tensorName)
+	})
+
+	if len(m.denseModules) != 2 ||
+		m.denseModules[0].tensorName != "dense_2" ||
+		m.denseModules[1].tensorName != "dense_3" {
+		return errors.New("embeddinggemma requires sentence-transformers 2_Dense and 3_Dense modules")
+	}
+
+	return nil
+}
+
+func (m *embeddingGemmaModel) adjustTokenizer(t *Tokenizer) {
+	n := int(m.VocabSize)
+	if n == 0 || len(t.Vocabulary.Tokens) <= n {
+		return
+	}
+
+	t.Vocabulary.Tokens = t.Vocabulary.Tokens[:n]
+	if len(t.Vocabulary.Scores) > n {
+		t.Vocabulary.Scores = t.Vocabulary.Scores[:n]
+	}
+	if len(t.Vocabulary.Types) > n {
+		t.Vocabulary.Types = t.Vocabulary.Types[:n]
+	}
+}
+
+func embeddingGemmaPoolingType(fsys fs.FS, modulePath string) (uint32, error) {
+	if modulePath == "" {
+		return 0, nil
+	}
+
+	bts, err := fs.ReadFile(fsys, path.Join(modulePath, "config.json"))
+	if err != nil {
+		if errors.Is(err, fs.ErrNotExist) {
+			return 0, nil
+		}
+		return 0, err
+	}
+
+	var cfg struct {
+		PoolingModeMeanTokens bool `json:"pooling_mode_mean_tokens"`
+		PoolingModeCLSToken   bool `json:"pooling_mode_cls_token"`
+	}
+	if err := json.Unmarshal(bts, &cfg); err != nil {
+		return 0, err
+	}
+
+	switch {
+	case cfg.PoolingModeMeanTokens:
+		return 1, nil
+	case cfg.PoolingModeCLSToken:
+		return 2, nil
+	default:
+		return 0, nil
+	}
+}
+
+func embeddingGemmaDenseModuleConfig(fsys fs.FS, modulePath string) (embeddingGemmaDenseModule, bool, error) {
+	tensorName, ok := embeddingGemmaDenseTensorName(modulePath)
+	if !ok {
+		return embeddingGemmaDenseModule{}, false, nil
+	}
+
+	weightsPath := path.Join(modulePath, "model.safetensors")
+	if _, err := fs.Stat(fsys, weightsPath); err != nil {
+		if errors.Is(err, fs.ErrNotExist) {
+			return embeddingGemmaDenseModule{}, false, nil
+		}
+		return embeddingGemmaDenseModule{}, false, err
+	}
+
+	bts, err := fs.ReadFile(fsys, path.Join(modulePath, "config.json"))
+	if err != nil {
+		return embeddingGemmaDenseModule{}, false, err
+	}
+
+	var cfg struct {
+		InFeatures  uint32 `json:"in_features"`
+		OutFeatures uint32 `json:"out_features"`
+		Bias        bool   `json:"bias"`
+	}
+	if err := json.Unmarshal(bts, &cfg); err != nil {
+		return embeddingGemmaDenseModule{}, false, err
+	}
+	if cfg.InFeatures == 0 || cfg.OutFeatures == 0 {
+		return embeddingGemmaDenseModule{}, false, errors.New("embeddinggemma dense layer config missing in/out features")
+	}
+	if cfg.Bias {
+		return embeddingGemmaDenseModule{}, false, fmt.Errorf("embeddinggemma dense layer %s has unsupported bias", modulePath)
+	}
+
+	return embeddingGemmaDenseModule{
+		path:       weightsPath,
+		tensorName: tensorName,
+		in:         cfg.InFeatures,
+		out:        cfg.OutFeatures,
+	}, true, nil
+}
+
+func embeddingGemmaDenseTensorName(modulePath string) (string, bool) {
+	switch modulePath {
+	case "2_Dense":
+		return "dense_2", true
+	case "3_Dense":
+		return "dense_3", true
+	default:
+		return "", false
+	}
+}
+
+func (m *embeddingGemmaModel) extraTensors(fsys fs.FS) ([]Tensor, error) {
+	var extra []Tensor
+	for _, dense := range m.denseModules {
+		ts, err := parseSafetensors(fsys, strings.NewReplacer("linear.", dense.tensorName+"."), dense.path)
+		if err != nil {
+			return nil, err
+		}
+
+		foundWeight := false
+		for _, t := range ts {
+			if t.Name() == dense.tensorName+".weight" {
+				extra = append(extra, t)
+				foundWeight = true
+			}
+		}
+		if !foundWeight {
+			return nil, fmt.Errorf("embeddinggemma dense module %s missing linear.weight", dense.path)
+		}
+	}
+
+	return extra, nil
+}
+
+func (m *embeddingGemmaModel) Tensors(ts []Tensor) []*ggml.Tensor {
+	out := make([]*ggml.Tensor, 0, len(ts))
+	for _, t := range ts {
+		name := t.Name()
+		if name == "norm.weight" {
+			name = "output_norm.weight"
+		}
+		if strings.HasSuffix(name, "_norm.weight") {
+			t.SetRepacker(m.addOne)
+		}
+
+		out = append(out, &ggml.Tensor{
+			Name:     name,
+			Kind:     t.Kind(),
+			Shape:    t.Shape(),
+			WriterTo: t,
+		})
+	}
+
+	return out
+}
+
+func (m *embeddingGemmaModel) Replacements() []string {
+	return []string{
+		"embed_tokens.", "token_embd.",
+		"layers.", "blk.",
+		"input_layernorm", "attn_norm",
+		"self_attn.q_proj", "attn_q",
+		"self_attn.q_norm", "attn_q_norm",
+		"self_attn.k_proj", "attn_k",
+		"self_attn.k_norm", "attn_k_norm",
+		"self_attn.v_proj", "attn_v",
+		"self_attn.o_proj", "attn_output",
+		"mlp.gate_proj", "ffn_gate",
+		"mlp.down_proj", "ffn_down",
+		"mlp.up_proj", "ffn_up",
+		"post_attention_layernorm", "post_attention_norm",
+		"pre_feedforward_layernorm", "ffn_norm",
+		"post_feedforward_layernorm", "post_ffw_norm",
+	}
+}
--- a/convert/convert_embeddinggemma_test.go
+++ b/convert/convert_embeddinggemma_test.go
@ -0,0 +1,229 @@
+package convert
+
+import (
+	"bytes"
+	"encoding/binary"
+	"encoding/json"
+	"io"
+	"math"
+	"os"
+	"path/filepath"
+	"slices"
+	"testing"
+
+	"github.com/ollama/ollama/fs/ggml"
+)
+
+func TestConvertEmbeddingGemmaSentenceTransformers(t *testing.T) {
+	tempDir := t.TempDir()
+
+	writeJSONFile(t, filepath.Join(tempDir, "config.json"), map[string]any{
+		"architectures":               []string{"Gemma3TextModel"},
+		"vocab_size":                  uint32(4),
+		"max_position_embeddings":     uint32(2048),
+		"hidden_size":                 uint32(8),
+		"num_hidden_layers":           uint32(1),
+		"intermediate_size":           uint32(12),
+		"num_attention_heads":         uint32(1),
+		"num_key_value_heads":         uint32(1),
+		"head_dim":                    uint32(8),
+		"rms_norm_eps":                float32(1e-6),
+		"rope_theta":                  float32(1000000),
+		"rope_local_base_freq":        float32(10000),
+		"sliding_window":              uint32(512),
+		"use_bidirectional_attention": true,
+	})
+	writeJSONFile(t, filepath.Join(tempDir, "tokenizer.json"), map[string]any{
+		"model": map[string]any{
+			"vocab": map[string]int{
+				"<pad>": 0,
+				"<eos>": 1,
+				"<bos>": 2,
+				"<unk>": 3,
+			},
+		},
+		"added_tokens": []map[string]any{
+			{"id": 4, "content": "<image_soft_token>", "special": true},
+		},
+	})
+	writeJSONFile(t, filepath.Join(tempDir, "modules.json"), []map[string]string{
+		{"type": "sentence_transformers.models.Transformer", "path": ""},
+		{"type": "sentence_transformers.models.Pooling", "path": "1_Pooling"},
+		{"type": "sentence_transformers.models.Dense", "path": "2_Dense"},
+		{"type": "sentence_transformers.models.Dense", "path": "3_Dense"},
+		{"type": "sentence_transformers.models.Normalize", "path": "4_Normalize"},
+	})
+	writeJSONFile(t, filepath.Join(tempDir, "1_Pooling", "config.json"), map[string]any{
+		"pooling_mode_mean_tokens": true,
+	})
+	writeJSONFile(t, filepath.Join(tempDir, "2_Dense", "config.json"), map[string]any{
+		"in_features":  uint32(8),
+		"out_features": uint32(16),
+		"bias":         false,
+	})
+	writeJSONFile(t, filepath.Join(tempDir, "3_Dense", "config.json"), map[string]any{
+		"in_features":  uint32(16),
+		"out_features": uint32(8),
+		"bias":         false,
+	})
+
+	writeSafetensorsFile(t, filepath.Join(tempDir, "model.safetensors"), []safetensorFixtureTensor{
+		{name: "embed_tokens.weight", shape: []int{4, 8}},
+		{name: "norm.weight", shape: []int{8}},
+		{name: "layers.0.input_layernorm.weight", shape: []int{8}},
+		{name: "layers.0.self_attn.q_proj.weight", shape: []int{8, 8}},
+	})
+	writeSafetensorsFile(t, filepath.Join(tempDir, "2_Dense", "model.safetensors"), []safetensorFixtureTensor{
+		{name: "linear.weight", shape: []int{16, 8}},
+	})
+	writeSafetensorsFile(t, filepath.Join(tempDir, "3_Dense", "model.safetensors"), []safetensorFixtureTensor{
+		{name: "linear.weight", shape: []int{8, 16}},
+	})
+
+	f, kv, tensors := convertFull(t, os.DirFS(tempDir))
+	defer f.Close()
+
+	if got := kv.Architecture(); got != "gemma-embedding" {
+		t.Fatalf("architecture = %q, want gemma-embedding", got)
+	}
+
+	for key, want := range map[string]uint32{
+		"dense_2_feat_in":          8,
+		"dense_2_feat_out":         16,
+		"dense_3_feat_in":          16,
+		"dense_3_feat_out":         8,
+		"pooling_type":             1,
+		"attention.sliding_window": 512,
+	} {
+		if got := kv.Uint(key); got != want {
+			t.Errorf("%s = %d, want %d", key, got, want)
+		}
+	}
+
+	if got := kv.Float("rope.freq_base_swa"); got != 10000 {
+		t.Errorf("rope.freq_base_swa = %v, want 10000", got)
+	}
+	if got := kv.Strings("tokenizer.ggml.tokens"); len(got) != 4 {
+		t.Errorf("token count = %d, want 4", len(got))
+	}
+
+	names := tensorNames(tensors)
+	for _, name := range []string{
+		"token_embd.weight",
+		"output_norm.weight",
+		"blk.0.attn_norm.weight",
+		"blk.0.attn_q.weight",
+		"dense_2.weight",
+		"dense_3.weight",
+	} {
+		if !slices.Contains(names, name) {
+			t.Errorf("missing tensor %s", name)
+		}
+	}
+
+	assertF32TensorValues(t, f, tensors, "output_norm.weight", 1)
+	assertF32TensorValues(t, f, tensors, "blk.0.attn_norm.weight", 1)
+}
+
+type safetensorFixtureTensor struct {
+	name  string
+	shape []int
+}
+
+func writeJSONFile(t *testing.T, path string, value any) {
+	t.Helper()
+
+	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	bts, err := json.Marshal(value)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	if err := os.WriteFile(path, bts, 0o644); err != nil {
+		t.Fatal(err)
+	}
+}
+
+func writeSafetensorsFile(t *testing.T, path string, tensors []safetensorFixtureTensor) {
+	t.Helper()
+
+	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	offset := 0
+	metadata := map[string]*tensorData{}
+	for _, tensor := range tensors {
+		size := 4
+		for _, dim := range tensor.shape {
+			size *= dim
+		}
+
+		metadata[tensor.name] = &tensorData{
+			Offsets: []int{offset, offset + size},
+			Type:    "F32",
+			Shape:   tensor.shape,
+		}
+		offset += size
+	}
+
+	header, err := json.Marshal(metadata)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	var buf bytes.Buffer
+	if err := binary.Write(&buf, binary.LittleEndian, int64(len(header))); err != nil {
+		t.Fatal(err)
+	}
+	if _, err := buf.Write(header); err != nil {
+		t.Fatal(err)
+	}
+	if _, err := buf.Write(make([]byte, offset)); err != nil {
+		t.Fatal(err)
+	}
+
+	if err := os.WriteFile(path, buf.Bytes(), 0o644); err != nil {
+		t.Fatal(err)
+	}
+}
+
+func tensorNames(tensors ggml.Tensors) []string {
+	names := make([]string, 0, len(tensors.Items()))
+	for _, tensor := range tensors.Items() {
+		names = append(names, tensor.Name)
+	}
+	return names
+}
+
+func assertF32TensorValues(t *testing.T, f *os.File, tensors ggml.Tensors, name string, want float32) {
+	t.Helper()
+
+	var tensor *ggml.Tensor
+	for _, item := range tensors.Items() {
+		if item.Name == name {
+			tensor = item
+			break
+		}
+	}
+	if tensor == nil {
+		t.Fatalf("missing tensor %s", name)
+	}
+	if tensor.Kind != uint32(ggml.TensorTypeF32) {
+		t.Fatalf("%s kind = %d, want F32", name, tensor.Kind)
+	}
+
+	bts := make([]byte, tensor.Size())
+	reader := io.NewSectionReader(f, int64(tensors.Offset+tensor.Offset), int64(tensor.Size()))
+	if _, err := io.ReadFull(reader, bts); err != nil {
+		t.Fatal(err)
+	}
+	for i := 0; i < len(bts); i += 4 {
+		if got := math.Float32frombits(binary.LittleEndian.Uint32(bts[i:])); got != want {
+			t.Fatalf("%s[%d] = %v, want %v", name, i/4, got, want)
+		}
+	}
+}
--- a/convert/convert_gemma3.go
+++ b/convert/convert_gemma3.go
@ -2,7 +2,11 @@ package convert

 import (
 	"cmp"
+	"fmt"
 	"slices"
+	"strings"
+
+	"github.com/ollama/ollama/fs/ggml"
 )

 type gemma3Model struct {
@ -178,3 +182,42 @@ func (p *gemma3Model) Replacements() []string {
 		"multi_modal_projector", "mm",
 	}
 }
+
+func (p *gemma3Model) TensorsWithTokenizer(ts []Tensor, t *Tokenizer) []*ggml.Tensor {
+	vocabSize := uint64(0)
+	if t != nil && t.Vocabulary != nil {
+		vocabSize = uint64(len(t.Vocabulary.Tokens))
+	}
+
+	var out []*ggml.Tensor
+	for _, tensor := range ts {
+		name := tensor.Name()
+		gt := &ggml.Tensor{
+			Name:     name,
+			Kind:     tensor.Kind(),
+			Shape:    tensor.Shape(),
+			WriterTo: tensor,
+		}
+
+		if !strings.HasPrefix(name, "v.") && strings.HasSuffix(name, "_norm.weight") {
+			tensor.SetRepacker(p.addOne)
+		}
+
+		if vocabSize > 0 && name == "token_embd.weight" && len(gt.Shape) >= 2 && gt.Shape[0] > vocabSize {
+			gt.Shape = slices.Clone(gt.Shape)
+			embdDim := gt.Shape[1]
+			gt.Shape[0] = vocabSize
+			tensor.SetRepacker(func(_ string, data []float32, _ []uint64) ([]float32, error) {
+				n := vocabSize * embdDim
+				if uint64(len(data)) < n {
+					return nil, fmt.Errorf("gemma3 token_embd.weight has %d values, need %d", len(data), n)
+				}
+				return data[:n], nil
+			})
+		}
+
+		out = append(out, gt)
+	}
+
+	return out
+}
--- a/convert/convert_gemma3_test.go
+++ b/convert/convert_gemma3_test.go
@ -0,0 +1,34 @@
+package convert
+
+import (
+	"slices"
+	"testing"
+)
+
+func TestGemma3TensorsWithTokenizerTruncatesPaddedEmbedding(t *testing.T) {
+	p := gemma3Model{}
+	embedding := &fakeTensor{
+		name:  "token_embd.weight",
+		shape: []uint64{5, 2},
+		data:  []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9},
+	}
+
+	out := p.TensorsWithTokenizer([]Tensor{embedding}, &Tokenizer{
+		Vocabulary: &Vocabulary{Tokens: []string{"a", "b", "<image>"}},
+	})
+
+	if len(out) != 1 {
+		t.Fatalf("expected 1 tensor, got %d", len(out))
+	}
+	if got, want := out[0].Shape, []uint64{3, 2}; !slices.Equal(got, want) {
+		t.Fatalf("token_embd.weight shape = %v, want %v", got, want)
+	}
+
+	got, err := embedding.repacker(embedding.name, embedding.data, embedding.shape)
+	if err != nil {
+		t.Fatalf("unexpected repacker error: %v", err)
+	}
+	if want := embedding.data[:6]; !slices.Equal(got, want) {
+		t.Fatalf("truncated embedding = %v, want %v", got, want)
+	}
+}
--- a/convert/convert_gemma3n.go
+++ b/convert/convert_gemma3n.go
@ -1,6 +1,8 @@
 package convert

 import (
+	"encoding/json"
+	"fmt"
 	"slices"
 	"strings"

@ -14,30 +16,58 @@ type gemma3nModel struct {
 	ModelParameters

 	TextModel struct {
-		ActivationSparsityPattern []float32 `json:"activation_sparsity_pattern"`
-		AltupActiveIdx            uint32    `json:"altup_active_idx"`
-		AltupCoefClip             float32   `json:"altup_coef_clip"`
-		AltupCorrectScale         bool      `json:"altup_correct_scale"`
-		AltupLRMultiplier         float32   `json:"altup_lr_multiplier"`
-		AltupNumInputs            uint32    `json:"altup_num_inputs"`
-		HeadDim                   uint32    `json:"head_dim"`
-		HiddenSize                uint32    `json:"hidden_size"`
-		HiddenSizePerLayerInput   uint32    `json:"hidden_size_per_layer_input"`
-		IntermediateSize          uint32    `json:"intermediate_size"`
-		MaxPositionEmbeddings     uint32    `json:"max_position_embeddings"`
-		NumAttentionHeads         uint32    `json:"num_attention_heads"`
-		NumHiddenLayers           uint32    `json:"num_hidden_layers"`
-		NumKeyValueHeads          uint32    `json:"num_key_value_heads"`
-		NumKVSharedLayers         uint32    `json:"num_kv_shared_layers"`
-		RMSNormEPS                float32   `json:"rms_norm_eps"`
-		RopeLocalBaseFreq         float32   `json:"rope_local_base_freq"`
-		RopeTheta                 float32   `json:"rope_theta"`
-		SlidingWindow             uint32    `json:"sliding_window"`
-		LayerTypes                []string  `json:"layer_types"`
+		ActivationSparsityPattern []float32               `json:"activation_sparsity_pattern"`
+		AltupActiveIdx            uint32                  `json:"altup_active_idx"`
+		AltupCoefClip             float32                 `json:"altup_coef_clip"`
+		AltupCorrectScale         bool                    `json:"altup_correct_scale"`
+		AltupLRMultiplier         float32                 `json:"altup_lr_multiplier"`
+		AltupNumInputs            uint32                  `json:"altup_num_inputs"`
+		HeadDim                   uint32                  `json:"head_dim"`
+		HiddenSize                uint32                  `json:"hidden_size"`
+		HiddenSizePerLayerInput   uint32                  `json:"hidden_size_per_layer_input"`
+		IntermediateSize          gemma3nIntermediateSize `json:"intermediate_size"`
+		MaxPositionEmbeddings     uint32                  `json:"max_position_embeddings"`
+		NumAttentionHeads         uint32                  `json:"num_attention_heads"`
+		NumHiddenLayers           uint32                  `json:"num_hidden_layers"`
+		NumKeyValueHeads          uint32                  `json:"num_key_value_heads"`
+		NumKVSharedLayers         uint32                  `json:"num_kv_shared_layers"`
+		RMSNormEPS                float32                 `json:"rms_norm_eps"`
+		RopeLocalBaseFreq         float32                 `json:"rope_local_base_freq"`
+		RopeTheta                 float32                 `json:"rope_theta"`
+		SlidingWindow             uint32                  `json:"sliding_window"`
+		LayerTypes                []string                `json:"layer_types"`
 	} `json:"text_config"`
 	VisionModel struct{} `json:"vision_config"`
 }

+type gemma3nIntermediateSize uint32
+
+func (s *gemma3nIntermediateSize) UnmarshalJSON(data []byte) error {
+	var scalar uint32
+	if err := json.Unmarshal(data, &scalar); err == nil {
+		*s = gemma3nIntermediateSize(scalar)
+		return nil
+	}
+
+	var values []uint32
+	if err := json.Unmarshal(data, &values); err != nil {
+		return err
+	}
+	if len(values) == 0 {
+		return fmt.Errorf("intermediate_size must not be empty")
+	}
+
+	first := values[0]
+	for _, v := range values[1:] {
+		if v != first {
+			return fmt.Errorf("intermediate_size values must match")
+		}
+	}
+
+	*s = gemma3nIntermediateSize(first)
+	return nil
+}
+
 func (m *gemma3nModel) KV(t *Tokenizer) KV {
 	kv := m.ModelParameters.KV(t)
 	kv["general.architecture"] = "gemma3n"
@ -69,7 +99,7 @@ func (m *gemma3nModel) KV(t *Tokenizer) KV {
 	kv["gemma3n.context_length"] = m.TextModel.MaxPositionEmbeddings
 	kv["gemma3n.embedding_length_per_layer_input"] = m.TextModel.HiddenSizePerLayerInput
 	kv["gemma3n.embedding_length"] = m.TextModel.HiddenSize
-	kv["gemma3n.feed_forward_length"] = m.TextModel.IntermediateSize
+	kv["gemma3n.feed_forward_length"] = uint32(m.TextModel.IntermediateSize)
 	kv["gemma3n.head_dim"] = m.TextModel.HeadDim
 	kv["gemma3n.rope.freq_base_local"] = m.TextModel.RopeLocalBaseFreq
 	kv["gemma3n.rope.freq_base"] = m.TextModel.RopeTheta
--- a/convert/convert_gemma3n_test.go
+++ b/convert/convert_gemma3n_test.go
@ -0,0 +1,55 @@
+package convert
+
+import (
+	"encoding/json"
+	"testing"
+)
+
+func TestGemma3nIntermediateSize(t *testing.T) {
+	tests := []struct {
+		name    string
+		json    string
+		want    gemma3nIntermediateSize
+		wantErr bool
+	}{
+		{
+			name: "scalar",
+			json: `8192`,
+			want: 8192,
+		},
+		{
+			name: "uniform array",
+			json: `[8192,8192,8192]`,
+			want: 8192,
+		},
+		{
+			name:    "mixed array",
+			json:    `[8192,4096]`,
+			wantErr: true,
+		},
+		{
+			name:    "empty array",
+			json:    `[]`,
+			wantErr: true,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var got gemma3nIntermediateSize
+			err := json.Unmarshal([]byte(tt.json), &got)
+			if tt.wantErr {
+				if err == nil {
+					t.Fatal("expected error")
+				}
+				return
+			}
+			if err != nil {
+				t.Fatal(err)
+			}
+			if got != tt.want {
+				t.Fatalf("got %d, want %d", got, tt.want)
+			}
+		})
+	}
+}
--- a/convert/convert_glm4moelite.go
+++ b/convert/convert_glm4moelite.go
@ -39,48 +39,72 @@ type glm4MoeLiteModel struct {
 	ExpertWeightsScale     float32 `json:"routed_scaling_factor"`

 	LeadingDenseBlockCount uint32 `json:"first_k_dense_replace"`
+
+	ExpertGroupCount     uint32 `json:"n_group"`
+	ExpertGroupUsedCount uint32 `json:"topk_group"`
 }

 func (p *glm4MoeLiteModel) KV(t *Tokenizer) KV {
 	kv := p.ModelParameters.KV(t)
-	kv["general.architecture"] = "glm4moelite"
+	kv["general.architecture"] = "deepseek2"
 	kv["general.type"] = "model"
-	kv["glm4moelite.block_count"] = p.HiddenLayers
+	kv["deepseek2.block_count"] = p.HiddenLayers

 	numHeads := p.NumAttentionHeads
-	numKVHeads := p.NumKeyValueHeads

-	kv["glm4moelite.attention.head_count"] = numHeads
-	kv["glm4moelite.attention.head_count_kv"] = numKVHeads
-	kv["glm4moelite.attention.key_length"] = p.QKNopeHeadDim + p.QKRopeHeadDim
-	kv["glm4moelite.attention.kv_lora_rank"] = p.KVLoraRank
-	kv["glm4moelite.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
-	kv["glm4moelite.attention.q_lora_rank"] = p.QLoraRank
-	kv["glm4moelite.attention.value_length"] = p.VHeadDim
-	kv["glm4moelite.context_length"] = p.MaxPositionEmbeddings
-	kv["glm4moelite.embedding_length"] = p.HiddenSize
-	kv["glm4moelite.expert_count"] = p.ExpertCount
-	kv["glm4moelite.expert_feed_forward_length"] = p.ExpertIntermediateSize
-	kv["glm4moelite.expert_shared_count"] = p.ExpertSharedCount
+	kv["deepseek2.attention.head_count"] = numHeads
+	kv["deepseek2.attention.head_count_kv"] = uint32(1)
+	kv["deepseek2.attention.key_length"] = p.KVLoraRank + p.QKRopeHeadDim
+	kv["deepseek2.attention.kv_lora_rank"] = p.KVLoraRank
+	kv["deepseek2.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
+	kv["deepseek2.attention.q_lora_rank"] = p.QLoraRank
+	kv["deepseek2.attention.value_length"] = p.KVLoraRank
+	kv["deepseek2.context_length"] = p.MaxPositionEmbeddings
+	kv["deepseek2.embedding_length"] = p.HiddenSize
+	kv["deepseek2.expert_count"] = p.ExpertCount
+	kv["deepseek2.expert_feed_forward_length"] = p.ExpertIntermediateSize
+	kv["deepseek2.expert_shared_count"] = p.ExpertSharedCount

-	kv["glm4moelite.expert_gating_func"] = uint32(2)
-	kv["glm4moelite.expert_used_count"] = p.ExpertUsedCount
-	kv["glm4moelite.expert_weights_norm"] = p.ExpertWeightsNorm
-	kv["glm4moelite.expert_weights_scale"] = p.ExpertWeightsScale
-	kv["glm4moelite.feed_forward_length"] = p.IntermediateSize
-	kv["glm4moelite.leading_dense_block_count"] = p.LeadingDenseBlockCount
+	kv["deepseek2.expert_gating_func"] = uint32(2)
+	kv["deepseek2.expert_group_count"] = cmp.Or(p.ExpertGroupCount, uint32(1))
+	kv["deepseek2.expert_group_used_count"] = cmp.Or(p.ExpertGroupUsedCount, uint32(1))
+	kv["deepseek2.expert_used_count"] = p.ExpertUsedCount
+	kv["deepseek2.expert_weights_norm"] = p.ExpertWeightsNorm
+	kv["deepseek2.expert_weights_scale"] = p.ExpertWeightsScale
+	kv["deepseek2.feed_forward_length"] = p.IntermediateSize
+	kv["deepseek2.leading_dense_block_count"] = p.LeadingDenseBlockCount

-	kv["glm4moelite.rope.dimension_count"] = p.QKRopeHeadDim
-	kv["glm4moelite.rope.freq_base"] = cmp.Or(p.RopeTheta, float32(1000000.0))
+	kv["deepseek2.rope.dimension_count"] = p.QKRopeHeadDim
+	kv["deepseek2.rope.freq_base"] = cmp.Or(p.RopeTheta, float32(1000000.0))

-	kv["glm4moelite.attention.key_length_mla"] = p.KVLoraRank + p.QKRopeHeadDim
-	kv["glm4moelite.attention.value_length_mla"] = p.KVLoraRank
+	kv["deepseek2.attention.key_length_mla"] = p.QKNopeHeadDim + p.QKRopeHeadDim
+	kv["deepseek2.attention.value_length_mla"] = p.VHeadDim

 	kv["tokenizer.ggml.pre"] = "glm4"
+	setGLM4MoeLiteExtraEOGFromEOSIDs(kv)

 	return kv
 }

+func setGLM4MoeLiteExtraEOGFromEOSIDs(kv KV) {
+	switch ids := kv["tokenizer.ggml.eos_token_ids"].(type) {
+	case []int32:
+		if len(ids) >= 2 && ids[1] >= 0 {
+			kv["tokenizer.ggml.eot_token_id"] = uint32(ids[1])
+		}
+		if len(ids) >= 3 && ids[2] >= 0 {
+			kv["tokenizer.ggml.eom_token_id"] = uint32(ids[2])
+		}
+	case []uint32:
+		if len(ids) >= 2 {
+			kv["tokenizer.ggml.eot_token_id"] = ids[1]
+		}
+		if len(ids) >= 3 {
+			kv["tokenizer.ggml.eom_token_id"] = ids[2]
+		}
+	}
+}
+
 func (p *glm4MoeLiteModel) Replacements() []string {
 	return []string{
 		"lm_head", "output",
--- a/convert/convert_glm4moelite_test.go
+++ b/convert/convert_glm4moelite_test.go
@ -0,0 +1,68 @@
+package convert
+
+import "testing"
+
+func TestGLM4MoeLiteKVUsesLlamaCppMetadata(t *testing.T) {
+	p := glm4MoeLiteModel{
+		ModelParameters:       ModelParameters{VocabSize: 151552},
+		MaxPositionEmbeddings: 202752,
+		HiddenSize:            2048,
+		HiddenLayers:          47,
+		IntermediateSize:      10240,
+		NumAttentionHeads:     20,
+		NumKeyValueHeads:      20,
+		RMSNormEPS:            1e-5,
+		RopeTheta:             1000000,
+		QKNopeHeadDim:         128,
+		QKRopeHeadDim:         64,
+		KVLoraRank:            512,
+		QLoraRank:             768,
+		VHeadDim:              128,
+		ExpertCount:           64,
+		ExpertSharedCount:     1,
+		ExpertUsedCount:       4,
+		ExpertWeightsNorm:     true,
+		ExpertWeightsScale:    1.8,
+	}
+
+	kv := p.KV(&Tokenizer{Vocabulary: &Vocabulary{Model: "gpt2", Tokens: []string{"a"}}})
+
+	if got := kv.Architecture(); got != "deepseek2" {
+		t.Fatalf("architecture = %q, want deepseek2", got)
+	}
+	for key, want := range map[string]uint32{
+		"attention.head_count":       20,
+		"attention.head_count_kv":    1,
+		"attention.key_length":       576,
+		"attention.value_length":     512,
+		"attention.key_length_mla":   192,
+		"attention.value_length_mla": 128,
+		"expert_group_count":         1,
+		"expert_group_used_count":    1,
+		"expert_gating_func":         2,
+		"rope.dimension_count":       64,
+	} {
+		if got := kv.Uint(key); got != want {
+			t.Errorf("%s = %d, want %d", key, got, want)
+		}
+	}
+	if got := kv.String("tokenizer.ggml.pre"); got != "glm4" {
+		t.Errorf("tokenizer.ggml.pre = %q, want glm4", got)
+	}
+}
+
+func TestGLM4MoeLiteKVPromotesExtraEOSIDs(t *testing.T) {
+	kv := KV{
+		"general.architecture":         "deepseek2",
+		"tokenizer.ggml.eos_token_ids": []int32{151329, 151330, 151336},
+	}
+
+	setGLM4MoeLiteExtraEOGFromEOSIDs(kv)
+
+	if got := kv.Uint("tokenizer.ggml.eot_token_id"); got != 151330 {
+		t.Errorf("eot token = %d, want 151330", got)
+	}
+	if got := kv.Uint("tokenizer.ggml.eom_token_id"); got != 151336 {
+		t.Errorf("eom token = %d, want 151336", got)
+	}
+}
--- a/convert/convert_glmocr.go
+++ b/convert/convert_glmocr.go
@ -83,6 +83,7 @@ type glmOcrModel struct {
 		HiddenSize          uint32  `json:"hidden_size"`
 		IntermediateSize    uint32  `json:"intermediate_size"`
 		NumHiddenLayers     uint32  `json:"num_hidden_layers"`
+		NumNextNPredict     uint32  `json:"num_nextn_predict_layers"`
 		NumAttentionHeads   uint32  `json:"num_attention_heads"`
 		NumKeyValueHeads    uint32  `json:"num_key_value_heads"`
 		HeadDim             uint32  `json:"head_dim"`
@ -131,7 +132,7 @@ type glmOcrModel struct {
 	} `json:"-"`
 }

-var _ ModelConverter = (*glmOcrModel)(nil)
+var _ MultimodalConverter = (*glmOcrModel)(nil)

 func (m *glmOcrModel) parseMore(fsys fs.FS) error {
 	bts, err := fs.ReadFile(fsys, "preprocessor_config.json")
@ -145,9 +146,14 @@ func (m *glmOcrModel) parseMore(fsys fs.FS) error {
 func (m *glmOcrModel) KV(t *Tokenizer) KV {
 	kv := m.ModelParameters.KV(t)
 	kv["general.architecture"] = "glmocr"
+	applyGlmOcrTokenizerKV(kv, t)

 	// Text model parameters
-	kv["glmocr.block_count"] = cmp.Or(m.TextConfig.NumHiddenLayers, 16)
+	numHiddenLayers := cmp.Or(m.TextConfig.NumHiddenLayers, 16)
+	kv["glmocr.block_count"] = numHiddenLayers + m.TextConfig.NumNextNPredict
+	if m.TextConfig.NumNextNPredict > 0 {
+		kv["glmocr.nextn_predict_layers"] = m.TextConfig.NumNextNPredict
+	}
 	kv["glmocr.embedding_length"] = cmp.Or(m.TextConfig.HiddenSize, 1536)
 	kv["glmocr.attention.head_count"] = cmp.Or(m.TextConfig.NumAttentionHeads, 16)
 	kv["glmocr.attention.head_count_kv"] = cmp.Or(m.TextConfig.NumKeyValueHeads, 8)
@ -175,8 +181,6 @@ func (m *glmOcrModel) KV(t *Tokenizer) KV {
 	kv["glmocr.vision.intermediate_size"] = cmp.Or(m.VisionConfig.IntermediateSize, 4096)
 	kv["glmocr.vision.attention.layer_norm_rms_epsilon"] = cmp.Or(m.VisionConfig.RMSNormEps, 1e-5)

-	// Preprocessor-derived image settings (min/max pixels and normalization)
-	// Note: fs.Config.keyValue() auto-prepends architecture prefix, so use full key
 	if m.Preprocessor.Size.ShortestEdge > 0 {
 		kv["glmocr.vision.min_pixels"] = m.Preprocessor.Size.ShortestEdge
 	}
@ -190,7 +194,6 @@ func (m *glmOcrModel) KV(t *Tokenizer) KV {
 		kv["glmocr.vision.image_std"] = m.Preprocessor.ImageStd
 	}

-	// Special tokens
 	kv["glmocr.image_token_id"] = m.ImageTokenID
 	kv["glmocr.image_start_token_id"] = m.ImageStartTokenID
 	kv["glmocr.image_end_token_id"] = m.ImageEndTokenID
@ -201,32 +204,249 @@ func (m *glmOcrModel) KV(t *Tokenizer) KV {
 	return kv
 }

+func applyGlmOcrTokenizerKV(kv KV, t *Tokenizer) {
+	kv["tokenizer.ggml.pre"] = "chatglm-bpe"
+	if id, ok := glmOcrTokenID(t, "<|endoftext|>"); ok {
+		kv["tokenizer.ggml.bos_token_id"] = uint32(id)
+		kv["tokenizer.ggml.unknown_token_id"] = uint32(id)
+	}
+	if id, ok := glmOcrTokenID(t, "<|user|>"); ok {
+		kv["tokenizer.ggml.eot_token_id"] = uint32(id)
+	}
+}
+
+func (m *glmOcrModel) TextKV(t *Tokenizer) KV {
+	kv := m.ModelParameters.KV(t)
+	kv["general.architecture"] = "glm4"
+	applyGlmOcrTokenizerKV(kv, t)
+
+	numHiddenLayers := cmp.Or(m.TextConfig.NumHiddenLayers, 16)
+	kv["block_count"] = numHiddenLayers + m.TextConfig.NumNextNPredict
+	if m.TextConfig.NumNextNPredict > 0 {
+		kv["nextn_predict_layers"] = m.TextConfig.NumNextNPredict
+	}
+	kv["embedding_length"] = cmp.Or(m.TextConfig.HiddenSize, 1536)
+	kv["attention.head_count"] = cmp.Or(m.TextConfig.NumAttentionHeads, 16)
+	kv["attention.head_count_kv"] = cmp.Or(m.TextConfig.NumKeyValueHeads, 8)
+	headDim := cmp.Or(m.TextConfig.HeadDim, m.TextConfig.HiddenSize/m.TextConfig.NumAttentionHeads)
+	kv["attention.key_length"] = headDim
+	kv["attention.value_length"] = headDim
+	kv["feed_forward_length"] = cmp.Or(m.TextConfig.IntermediateSize, 4608)
+	kv["attention.layer_norm_rms_epsilon"] = cmp.Or(m.TextConfig.RMSNormEps, 1e-5)
+	kv["context_length"] = cmp.Or(m.TextConfig.MaxPositionEmbed, 131072)
+	kv["rope.freq_base"] = cmp.Or(m.TextConfig.RopeParameters.RopeTheta, float32(10000))
+	partialRotaryFactor := cmp.Or(m.TextConfig.RopeParameters.PartialRotaryFactor, m.TextConfig.PartialRotaryFactor, float32(1.0))
+	kv["rope.dimension_count"] = uint32(float32(headDim) * partialRotaryFactor)
+	if len(m.TextConfig.RopeParameters.MRopeSection) > 0 {
+		sections := append([]int32(nil), m.TextConfig.RopeParameters.MRopeSection...)
+		for len(sections) < 4 {
+			sections = append(sections, 0)
+		}
+		kv["rope.dimension_sections"] = sections
+	}
+
+	return kv
+}
+
+func (m *glmOcrModel) ProjectorKV(*Tokenizer) KV {
+	kv := KV{
+		"general.architecture":                     "clip",
+		"general.type":                             "mmproj",
+		"general.file_type":                        uint32(1),
+		"general.quantization_version":             uint32(2),
+		"clip.has_vision_encoder":                  true,
+		"clip.projector_type":                      "glm4v",
+		"clip.use_silu":                            true,
+		"clip.vision.block_count":                  cmp.Or(m.VisionConfig.Depth, 24),
+		"clip.vision.embedding_length":             cmp.Or(m.VisionConfig.HiddenSize, 1024),
+		"clip.vision.attention.head_count":         cmp.Or(m.VisionConfig.NumHeads, 16),
+		"clip.vision.image_size":                   cmp.Or(m.VisionConfig.ImageSize, 336),
+		"clip.vision.patch_size":                   cmp.Or(m.VisionConfig.PatchSize, m.Preprocessor.PatchSize, 14),
+		"clip.vision.spatial_merge_size":           cmp.Or(m.VisionConfig.SpatialMergeSize, m.Preprocessor.MergeSize, 2),
+		"clip.vision.temporal_patch_size":          cmp.Or(m.VisionConfig.TemporalPatchSize, m.Preprocessor.TemporalPatchSize, 2),
+		"clip.vision.projection_dim":               cmp.Or(m.VisionConfig.OutHiddenSize, 1536),
+		"clip.vision.out_hidden_size":              cmp.Or(m.VisionConfig.OutHiddenSize, 1536),
+		"clip.vision.feed_forward_length":          cmp.Or(m.VisionConfig.IntermediateSize, 4096),
+		"clip.vision.intermediate_size":            cmp.Or(m.VisionConfig.IntermediateSize, 4096),
+		"clip.vision.attention.layer_norm_epsilon": cmp.Or(m.VisionConfig.RMSNormEps, 1e-5),
+		"clip.vision.image_token_id":               m.ImageTokenID,
+		"clip.vision.image_start_token_id":         m.ImageStartTokenID,
+		"clip.vision.image_end_token_id":           m.ImageEndTokenID,
+	}
+	if m.Preprocessor.Size.ShortestEdge > 0 {
+		kv["clip.vision.min_pixels"] = m.Preprocessor.Size.ShortestEdge
+	}
+	if m.Preprocessor.Size.LongestEdge > 0 {
+		kv["clip.vision.max_pixels"] = m.Preprocessor.Size.LongestEdge
+	}
+	if len(m.Preprocessor.ImageMean) == 3 {
+		kv["clip.vision.image_mean"] = m.Preprocessor.ImageMean
+	}
+	if len(m.Preprocessor.ImageStd) == 3 {
+		kv["clip.vision.image_std"] = m.Preprocessor.ImageStd
+	}
+
+	return kv
+}
+
+func glmOcrTokenID(t *Tokenizer, token string) (int, bool) {
+	if t == nil || t.Vocabulary == nil {
+		return 0, false
+	}
+	for i, candidate := range t.Vocabulary.Tokens {
+		if candidate == token {
+			return i, true
+		}
+	}
+	return 0, false
+}
+
+func isGlmOcrVisionTensor(name string) bool {
+	return strings.HasPrefix(name, "v.") || strings.HasPrefix(name, "mm.")
+}
+
+func (m *glmOcrModel) TextTensors(ts []Tensor, t *Tokenizer) []*ggml.Tensor {
+	textOnly := make([]Tensor, 0, len(ts))
+	for _, tensor := range ts {
+		if !isGlmOcrVisionTensor(tensor.Name()) {
+			textOnly = append(textOnly, tensor)
+		}
+	}
+	return m.Tensors(textOnly)
+}
+
+func (m *glmOcrModel) ProjectorTensors(ts []Tensor) []*ggml.Tensor {
+	var out []*ggml.Tensor
+	for _, t := range ts {
+		if !isGlmOcrVisionTensor(t.Name()) {
+			continue
+		}
+
+		name := t.Name()
+		switch {
+		case strings.HasSuffix(name, "patch_embd_0.weight"):
+			name = strings.Replace(name, "patch_embd_0.weight", "patch_embd.weight", 1)
+		case strings.HasSuffix(name, "patch_embd_1.weight"):
+			name = strings.Replace(name, "patch_embd_1.weight", "patch_embd.weight.1", 1)
+		case strings.HasSuffix(name, "patch_embd.weight.0"):
+			name = strings.Replace(name, "patch_embd.weight.0", "patch_embd.weight", 1)
+		}
+		if strings.HasSuffix(name, "patch_embd.weight") {
+			shape := t.Shape()
+			if len(shape) == 5 && shape[2] == 2 {
+				newShape := []uint64{shape[0], shape[1], shape[3], shape[4]}
+
+				t0 := t.Clone()
+				t0.SetRepacker(func(_ string, data []float32, shape []uint64) ([]float32, error) {
+					dims := make([]int, len(shape))
+					for i := range shape {
+						dims[i] = int(shape[i])
+					}
+					var tt tensor.Tensor = tensor.New(tensor.WithShape(dims...), tensor.WithBacking(data))
+					tt, err := tt.Slice(nil, nil, tensor.S(0, 1), nil, nil)
+					if err != nil {
+						return nil, err
+					}
+					tt = tensor.Materialize(tt)
+					newDims := []int{int(shape[0]), int(shape[1]), int(shape[3]), int(shape[4])}
+					if err := tt.Reshape(newDims...); err != nil {
+						return nil, err
+					}
+					if err := tt.Reshape(tt.Shape().TotalSize()); err != nil {
+						return nil, err
+					}
+					return native.VectorF32(tt.(*tensor.Dense))
+				})
+				out = append(out, &ggml.Tensor{
+					Name:     strings.Replace(name, "patch_embd.weight", "patch_embd.weight", 1),
+					Kind:     t.Kind(),
+					Shape:    newShape,
+					WriterTo: t0,
+				})
+
+				t1 := t.Clone()
+				t1.SetRepacker(func(_ string, data []float32, shape []uint64) ([]float32, error) {
+					dims := make([]int, len(shape))
+					for i := range shape {
+						dims[i] = int(shape[i])
+					}
+					var tt tensor.Tensor = tensor.New(tensor.WithShape(dims...), tensor.WithBacking(data))
+					tt, err := tt.Slice(nil, nil, tensor.S(1, 2), nil, nil)
+					if err != nil {
+						return nil, err
+					}
+					tt = tensor.Materialize(tt)
+					newDims := []int{int(shape[0]), int(shape[1]), int(shape[3]), int(shape[4])}
+					if err := tt.Reshape(newDims...); err != nil {
+						return nil, err
+					}
+					if err := tt.Reshape(tt.Shape().TotalSize()); err != nil {
+						return nil, err
+					}
+					return native.VectorF32(tt.(*tensor.Dense))
+				})
+				out = append(out, &ggml.Tensor{
+					Name:     strings.Replace(name, "patch_embd.weight", "patch_embd.weight.1", 1),
+					Kind:     t.Kind(),
+					Shape:    newShape,
+					WriterTo: t1,
+				})
+
+				continue
+			}
+		}
+
+		out = append(out, &ggml.Tensor{
+			Name:     name,
+			Kind:     t.Kind(),
+			Shape:    t.Shape(),
+			WriterTo: t,
+		})
+	}
+	return out
+}
+
 func (m *glmOcrModel) Tensors(ts []Tensor) []*ggml.Tensor {
 	var out []*ggml.Tensor

-	// Skip layers >= num_hidden_layers (Multi-Token Prediction layers not needed for basic inference)
 	numLayers := int(cmp.Or(m.TextConfig.NumHiddenLayers, 16))
-	skipLayer := func(name string) bool {
-		// Tensor names are already replaced to "blk.N.xxx" format
-		re := regexp.MustCompile(`^blk\.(\d+)`)
-		matches := re.FindStringSubmatch(name)
+	maxLayers := numLayers + int(m.TextConfig.NumNextNPredict)
+	layerRe := regexp.MustCompile(`^blk\.(\d+)`)
+	layerIndex := func(name string) (int, bool) {
+		matches := layerRe.FindStringSubmatch(name)
 		if matches == nil {
-			return false
+			return 0, false
 		}
 		blkNum, err := strconv.Atoi(matches[1])
 		if err != nil {
-			return false
+			return 0, false
 		}
-		return blkNum >= numLayers
+		return blkNum, true
 	}

 	for _, t := range ts {
 		name := t.Name()

-		// Skip next-n prediction layers (layers >= num_hidden_layers)
-		if skipLayer(name) {
+		blkNum, hasLayer := layerIndex(name)
+		if hasLayer && blkNum >= maxLayers {
 			continue
 		}
+		if hasLayer && blkNum >= numLayers {
+			switch {
+			case strings.HasSuffix(name, ".embed_tokens.weight"):
+				name = strings.Replace(name, ".embed_tokens.weight", ".nextn.embed_tokens.weight", 1)
+			case strings.HasSuffix(name, ".eh_proj.weight"):
+				name = strings.Replace(name, ".eh_proj.weight", ".nextn.eh_proj.weight", 1)
+			case strings.HasSuffix(name, ".enorm.weight"):
+				name = strings.Replace(name, ".enorm.weight", ".nextn.enorm.weight", 1)
+			case strings.HasSuffix(name, ".hnorm.weight"):
+				name = strings.Replace(name, ".hnorm.weight", ".nextn.hnorm.weight", 1)
+			case strings.HasSuffix(name, ".shared_head.head.weight"):
+				name = strings.Replace(name, ".shared_head.head.weight", ".nextn.shared_head_head.weight", 1)
+			case strings.HasSuffix(name, ".shared_head.norm.weight"):
+				name = strings.Replace(name, ".shared_head.norm.weight", ".nextn.shared_head_norm.weight", 1)
+			}
+		}

 		// Split ffn_gate_up into separate gate and up projections
 		if strings.Contains(name, "ffn_gate_up") {
@ -440,16 +660,16 @@ func (m *glmOcrModel) Replacements() []string {
 		"self_attn.q_proj", "attn_q",
 		"self_attn.k_proj", "attn_k",
 		"self_attn.v_proj", "attn_v",
-		"self_attn.o_proj", "attn_out",
+		"self_attn.o_proj", "attn_output",

 		// Language model norms
 		"input_layernorm", "attn_norm",
 		"post_attention_layernorm", "ffn_norm",
-		"post_self_attn_layernorm", "post_attn_norm",
-		"post_mlp_layernorm", "post_ffn_norm",
+		"post_self_attn_layernorm", "post_attention_norm",
+		"post_mlp_layernorm", "post_ffw_norm",

-		// Language model MLP (remove mlp. prefix so ffn_* names work)
-		"mlp.gate_up_proj", "ffn_gate_up",
+		// Language model MLP
+		"mlp.gate_up_proj", "ffn_up",
 		"mlp.down_proj", "ffn_down",
 	}
 }
--- a/convert/convert_gptoss.go
+++ b/convert/convert_gptoss.go
@ -30,7 +30,11 @@ type gptossModel struct {
 	RopeTheta             float32 `json:"rope_theta"`
 	RopeScalingFactor     float32 `json:"rope_scaling_factor"`
 	RopeScaling           struct {
-		Factor float32 `json:"factor"`
+		Type                          string  `json:"rope_type"`
+		Factor                        float32 `json:"factor"`
+		OriginalMaxPositionEmbeddings uint32  `json:"original_max_position_embeddings"`
+		BetaFast                      float32 `json:"beta_fast"`
+		BetaSlow                      float32 `json:"beta_slow"`
 	} `json:"rope_scaling"`
 	SlidingWindow uint32 `json:"sliding_window"`
 }
@ -39,23 +43,32 @@ var _ ModelConverter = (*gptossModel)(nil)

 func (m *gptossModel) KV(t *Tokenizer) KV {
 	kv := m.ModelParameters.KV(t)
-	kv["general.architecture"] = "gptoss"
+	kv["general.architecture"] = "gpt-oss"
 	kv["general.file_type"] = uint32(4)
-	kv["gptoss.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(m.RopeScalingFactor*float32(m.InitialContextLength)))
-	kv["gptoss.block_count"] = m.HiddenLayers
-	kv["gptoss.embedding_length"] = m.HiddenSize
-	kv["gptoss.feed_forward_length"] = m.IntermediateSize
-	kv["gptoss.expert_count"] = cmp.Or(m.Experts, m.LocalExperts)
-	kv["gptoss.expert_used_count"] = m.ExpertsPerToken
-	kv["gptoss.attention.head_count"] = m.AttentionHeads
-	kv["gptoss.attention.head_count_kv"] = m.KeyValueHeads
-	kv["gptoss.attention.key_length"] = m.HeadDim
-	kv["gptoss.attention.value_length"] = m.HeadDim
-	kv["gptoss.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEpsilon, 1e-5)
-	kv["gptoss.attention.sliding_window"] = m.SlidingWindow
-	kv["gptoss.rope.freq_base"] = m.RopeTheta
-	kv["gptoss.rope.scaling.factor"] = cmp.Or(m.RopeScalingFactor, m.RopeScaling.Factor)
-	kv["gptoss.rope.scaling.original_context_length"] = m.InitialContextLength
+	kv["gpt-oss.context_length"] = cmp.Or(m.MaxPositionEmbeddings, uint32(m.RopeScalingFactor*float32(m.InitialContextLength)))
+	kv["gpt-oss.block_count"] = m.HiddenLayers
+	kv["gpt-oss.embedding_length"] = m.HiddenSize
+	kv["gpt-oss.feed_forward_length"] = m.IntermediateSize
+	kv["gpt-oss.expert_feed_forward_length"] = m.IntermediateSize
+	kv["gpt-oss.expert_count"] = cmp.Or(m.Experts, m.LocalExperts)
+	kv["gpt-oss.expert_used_count"] = m.ExpertsPerToken
+	kv["gpt-oss.attention.head_count"] = m.AttentionHeads
+	kv["gpt-oss.attention.head_count_kv"] = m.KeyValueHeads
+	kv["gpt-oss.attention.key_length"] = m.HeadDim
+	kv["gpt-oss.attention.value_length"] = m.HeadDim
+	kv["gpt-oss.attention.layer_norm_rms_epsilon"] = cmp.Or(m.RMSNormEpsilon, 1e-5)
+	kv["gpt-oss.attention.sliding_window"] = m.SlidingWindow
+	kv["gpt-oss.rope.freq_base"] = m.RopeTheta
+	kv["gpt-oss.rope.scaling.type"] = cmp.Or(m.RopeScaling.Type, "yarn")
+	kv["gpt-oss.rope.scaling.factor"] = cmp.Or(m.RopeScalingFactor, m.RopeScaling.Factor)
+	kv["gpt-oss.rope.scaling.original_context_length"] = cmp.Or(m.RopeScaling.OriginalMaxPositionEmbeddings, m.InitialContextLength)
+	if m.RopeScaling.BetaFast != 0 {
+		kv["gpt-oss.rope.scaling.yarn_beta_fast"] = m.RopeScaling.BetaFast
+	}
+	if m.RopeScaling.BetaSlow != 0 {
+		kv["gpt-oss.rope.scaling.yarn_beta_slow"] = m.RopeScaling.BetaSlow
+	}
+	kv["tokenizer.ggml.pre"] = "gpt-4o"
 	kv["tokenizer.ggml.bos_token_id"] = uint32(199998) // <|startoftext|>
 	kv["tokenizer.ggml.add_bos_token"] = false
 	kv["tokenizer.ggml.eos_token_id"] = uint32(199999) // <|endoftext|>
@ -152,9 +165,9 @@ func (m *gptossModel) Replacements() []string {
 			"self_attn.q_proj", "attn_q",
 			"self_attn.k_proj", "attn_k",
 			"self_attn.v_proj", "attn_v",
-			"self_attn.o_proj", "attn_out",
-			"self_attn.sinks", "attn_sinks",
-			"post_attention_layernorm", "ffn_norm",
+			"self_attn.o_proj", "attn_output",
+			"self_attn.sinks", "attn_sinks.weight",
+			"post_attention_layernorm", "post_attention_norm",
 			"mlp.router", "ffn_gate_inp",
 			"mlp.experts.gate_up_proj_", "ffn_gate_up_exps.",
 			"mlp.experts.down_proj_", "ffn_down_exps.",
@ -169,9 +182,9 @@ func (m *gptossModel) Replacements() []string {
 			"block", "blk",
 			"attn.norm", "attn_norm",
 			"attn.qkv", "attn_qkv",
-			"attn.sinks", "attn_sinks",
-			"attn.out", "attn_out",
-			"mlp.norm", "ffn_norm",
+			"attn.sinks", "attn_sinks.weight",
+			"attn.out", "attn_output",
+			"mlp.norm", "post_attention_norm",
 			"mlp.gate", "ffn_gate_inp",
 			"mlp.mlp1_", "ffn_gate_up_exps.",
 			"mlp.mlp2_", "ffn_down_exps.",
--- a/convert/convert_gptoss_test.go
+++ b/convert/convert_gptoss_test.go
@ -0,0 +1,73 @@
+package convert
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestGptOssCreatesLlamaCppMetadataAndNames(t *testing.T) {
+	m := &gptossModel{
+		HiddenLayers:          24,
+		MaxPositionEmbeddings: 131072,
+		HiddenSize:            2880,
+		IntermediateSize:      2880,
+		AttentionHeads:        64,
+		KeyValueHeads:         8,
+		HeadDim:               64,
+		LocalExperts:          32,
+		ExpertsPerToken:       4,
+		RopeTheta:             150000,
+		InitialContextLength:  4096,
+		SlidingWindow:         128,
+	}
+	m.RopeScaling.Type = "yarn"
+	m.RopeScaling.Factor = 32
+	m.RopeScaling.OriginalMaxPositionEmbeddings = 4096
+	m.RopeScaling.BetaFast = 32
+	m.RopeScaling.BetaSlow = 1
+
+	kv := m.KV(&Tokenizer{Vocabulary: &Vocabulary{Model: "gpt2"}, Pre: "default"})
+	for k, want := range map[string]any{
+		"general.architecture":                         "gpt-oss",
+		"tokenizer.ggml.pre":                           "gpt-4o",
+		"gpt-oss.context_length":                       uint32(131072),
+		"gpt-oss.expert_feed_forward_length":           uint32(2880),
+		"gpt-oss.rope.scaling.type":                    "yarn",
+		"gpt-oss.rope.scaling.factor":                  float32(32),
+		"gpt-oss.rope.scaling.original_context_length": uint32(4096),
+		"gpt-oss.rope.scaling.yarn_beta_fast":          float32(32),
+		"gpt-oss.rope.scaling.yarn_beta_slow":          float32(1),
+	} {
+		if got := kv[k]; got != want {
+			t.Fatalf("%s = %v (%T), want %v (%T)", k, got, got, want, want)
+		}
+	}
+	if _, ok := kv["gptoss.context_length"]; ok {
+		t.Fatal("unexpected Ollama-format gptoss metadata")
+	}
+
+	replacer := strings.NewReplacer(m.Replacements()...)
+	for name, want := range map[string]string{
+		"model.layers.0.self_attn.o_proj.weight":         "blk.0.attn_output.weight",
+		"model.layers.0.self_attn.sinks":                 "blk.0.attn_sinks.weight",
+		"model.layers.0.post_attention_layernorm.weight": "blk.0.post_attention_norm.weight",
+		"model.layers.0.mlp.experts.gate_up_proj_blocks": "blk.0.ffn_gate_up_exps.blocks",
+		"model.layers.0.mlp.experts.down_proj_scales":    "blk.0.ffn_down_exps.scales",
+	} {
+		if got := replacer.Replace(name); got != want {
+			t.Fatalf("Replace(%q) = %q, want %q", name, got, want)
+		}
+	}
+
+	m.MaxPositionEmbeddings = 0
+	replacer = strings.NewReplacer(m.Replacements()...)
+	for name, want := range map[string]string{
+		"block.0.attn.out.weight": "blk.0.attn_output.weight",
+		"block.0.attn.sinks":      "blk.0.attn_sinks.weight",
+		"block.0.mlp.norm.weight": "blk.0.post_attention_norm.weight",
+	} {
+		if got := replacer.Replace(name); got != want {
+			t.Fatalf("Replace(%q) = %q, want %q", name, got, want)
+		}
+	}
+}
--- a/convert/convert_llama.go
+++ b/convert/convert_llama.go
@ -34,8 +34,6 @@ type llamaModel struct {
 		LowFrequencyFactor            float32 `json:"low_freq_factor"`
 		HighFrequencyFactor           float32 `json:"high_freq_factor"`
 		OriginalMaxPositionEmbeddings uint32  `json:"original_max_position_embeddings"`
-
-		factors ropeFactor
 	} `json:"rope_scaling"`
 	RMSNormEPS       float32 `json:"rms_norm_eps"`
 	LayerNormEPS     float32 `json:"layer_norm_eps"`
@ -83,27 +81,6 @@ func (p *llamaModel) KV(t *Tokenizer) KV {
 	if p.RopeScaling.Type == "linear" {
 		kv["llama.rope.scaling.type"] = p.RopeScaling.Type
 		kv["llama.rope.scaling.factor"] = p.RopeScaling.Factor
-	} else if p.RopeScaling.RopeType == "llama3" {
-		dim := p.HiddenSize / p.NumAttentionHeads
-		for i := uint32(0); i < dim; i += 2 {
-			factor := cmp.Or(p.RopeScaling.Factor, 8.0)
-			factorLow := cmp.Or(p.RopeScaling.LowFrequencyFactor, 1.0)
-			factorHigh := cmp.Or(p.RopeScaling.HighFrequencyFactor, 4.0)
-
-			original := cmp.Or(p.RopeScaling.OriginalMaxPositionEmbeddings, 8192)
-			lambdaLow := float32(original) / factorLow
-			lambdaHigh := float32(original) / factorHigh
-
-			lambda := 2 * math.Pi * math.Pow(float64(p.RopeTheta), float64(i)/float64(dim))
-			if lambda < float64(lambdaHigh) {
-				p.RopeScaling.factors = append(p.RopeScaling.factors, 1.0)
-			} else if lambda > float64(lambdaLow) {
-				p.RopeScaling.factors = append(p.RopeScaling.factors, factor)
-			} else {
-				smooth := (float32(original)/float32(lambda) - factorLow) / (factorHigh - factorLow)
-				p.RopeScaling.factors = append(p.RopeScaling.factors, 1.0/((1-smooth)/factor+smooth))
-			}
-		}
 	}

 	if p.NumKeyValueHeads > 0 {
@ -129,12 +106,12 @@ func (p *llamaModel) KV(t *Tokenizer) KV {
 func (p *llamaModel) Tensors(ts []Tensor) []*ggml.Tensor {
 	var out []*ggml.Tensor

-	if p.RopeScaling.factors != nil {
+	if factors := p.ropeFactors(); factors != nil {
 		out = append(out, &ggml.Tensor{
 			Name:     "rope_freqs.weight",
 			Kind:     0,
-			Shape:    []uint64{uint64(len(p.RopeScaling.factors))},
-			WriterTo: p.RopeScaling.factors,
+			Shape:    []uint64{uint64(len(factors))},
+			WriterTo: factors,
 		})
 	}

@ -157,6 +134,40 @@ func (p *llamaModel) Tensors(ts []Tensor) []*ggml.Tensor {
 	return out
 }

+func (p *llamaModel) ropeFactors() ropeFactor {
+	if p.RopeScaling.RopeType != "llama3" || p.HiddenSize == 0 || p.NumAttentionHeads == 0 || p.RopeTheta == 0 {
+		return nil
+	}
+
+	dim := p.HiddenSize / p.NumAttentionHeads
+	if dim == 0 {
+		return nil
+	}
+
+	factors := make(ropeFactor, 0, dim/2)
+	for i := uint32(0); i < dim; i += 2 {
+		factor := cmp.Or(p.RopeScaling.Factor, float32(8))
+		factorLow := cmp.Or(p.RopeScaling.LowFrequencyFactor, float32(1))
+		factorHigh := cmp.Or(p.RopeScaling.HighFrequencyFactor, float32(4))
+
+		original := cmp.Or(p.RopeScaling.OriginalMaxPositionEmbeddings, uint32(8192))
+		lambdaLow := float32(original) / factorLow
+		lambdaHigh := float32(original) / factorHigh
+
+		lambda := 2 * math.Pi * math.Pow(float64(p.RopeTheta), float64(i)/float64(dim))
+		if lambda < float64(lambdaHigh) {
+			factors = append(factors, 1)
+		} else if lambda > float64(lambdaLow) {
+			factors = append(factors, factor)
+		} else {
+			smooth := (float32(original)/float32(lambda) - factorLow) / (factorHigh - factorLow)
+			factors = append(factors, 1/((1-smooth)/factor+smooth))
+		}
+	}
+
+	return factors
+}
+
 func (p *llamaModel) Replacements() []string {
 	return []string{
 		"lm_head", "output",
--- a/convert/convert_llama_test.go
+++ b/convert/convert_llama_test.go
@ -0,0 +1,34 @@
+package convert
+
+import "testing"
+
+func TestLlama3RopeFactorsTensorDoesNotDependOnKVOrder(t *testing.T) {
+	m := &llamaModel{
+		HiddenSize:        2048,
+		NumAttentionHeads: 32,
+		RopeTheta:         500000,
+	}
+	m.RopeScaling.RopeType = "llama3"
+	m.RopeScaling.Factor = 32
+	m.RopeScaling.LowFrequencyFactor = 1
+	m.RopeScaling.HighFrequencyFactor = 4
+	m.RopeScaling.OriginalMaxPositionEmbeddings = 8192
+
+	tensors := m.Tensors(nil)
+	if len(tensors) != 1 {
+		t.Fatalf("expected rope tensor only, got %d tensors", len(tensors))
+	}
+	if tensors[0].Name != "rope_freqs.weight" {
+		t.Fatalf("expected rope_freqs.weight, got %q", tensors[0].Name)
+	}
+	if len(tensors[0].Shape) != 1 || tensors[0].Shape[0] != 32 {
+		t.Fatalf("expected rope tensor shape [32], got %v", tensors[0].Shape)
+	}
+
+	_ = m.KV(&Tokenizer{Vocabulary: &Vocabulary{}})
+
+	afterKV := m.Tensors(nil)
+	if len(afterKV) != 1 || afterKV[0].Name != "rope_freqs.weight" {
+		t.Fatalf("expected one rope tensor after KV call, got %#v", afterKV)
+	}
+}
--- a/convert/convert_mistral.go
+++ b/convert/convert_mistral.go
@ -79,20 +79,17 @@ func (p *mistral3Model) KV(t *Tokenizer) KV {
 	kv["mistral3.rope.freq_base"] = cmp.Or(p.TextModel.RopeTheta, p.TextModel.RopeParameters.RopeTheta)
 	kv["mistral3.rope.scaling.factor"] = p.TextModel.RopeParameters.Factor
 	kv["mistral3.rope.scaling.type"] = p.TextModel.RopeParameters.RopeType
-	kv["mistral3.rope.scaling.beta_fast"] = p.TextModel.RopeParameters.BetaFast
-	kv["mistral3.rope.scaling.beta_slow"] = p.TextModel.RopeParameters.BetaSlow
+	kv["mistral3.rope.scaling.yarn_beta_fast"] = p.TextModel.RopeParameters.BetaFast
+	kv["mistral3.rope.scaling.yarn_beta_slow"] = p.TextModel.RopeParameters.BetaSlow

-	if p.TextModel.RopeParameters.Mscale != nil {
-		kv["mistral3.rope.scaling.mscale"] = *p.TextModel.RopeParameters.Mscale
-	}
 	if p.TextModel.RopeParameters.MscaleAllDim != nil {
-		kv["mistral3.rope.scaling.mscale_all_dim"] = *p.TextModel.RopeParameters.MscaleAllDim
+		kv["mistral3.rope.scaling.yarn_log_multiplier"] = *p.TextModel.RopeParameters.MscaleAllDim
 	}
 	if p.TextModel.RopeParameters.OrigMaxPositionEmbeddings > 0 {
 		kv["mistral3.rope.scaling.original_context_length"] = p.TextModel.RopeParameters.OrigMaxPositionEmbeddings
 	}
 	if p.TextModel.RopeParameters.Llama4ScalingBeta != nil {
-		kv["mistral3.rope.scaling_beta"] = *p.TextModel.RopeParameters.Llama4ScalingBeta
+		kv["mistral3.attention.temperature_scale"] = *p.TextModel.RopeParameters.Llama4ScalingBeta
 	}

 	// Vision configuration
--- a/convert/convert_mistral_causal.go
+++ b/convert/convert_mistral_causal.go
@ -58,24 +58,19 @@ func (p *mistral3CausalModel) KV(t *Tokenizer) KV {
 	kv["mistral3.rope.freq_base"] = cmp.Or(p.RopeTheta, p.RopeParameters.RopeTheta)
 	kv["mistral3.rope.scaling.factor"] = p.RopeParameters.Factor
 	kv["mistral3.rope.scaling.type"] = p.RopeParameters.RopeType
-	kv["mistral3.rope.scaling.beta_fast"] = p.RopeParameters.BetaFast
-	kv["mistral3.rope.scaling.beta_slow"] = p.RopeParameters.BetaSlow
-
-	if p.RopeParameters.Mscale != nil {
-		kv["mistral3.rope.scaling.mscale"] = *p.RopeParameters.Mscale
-	}
+	kv["mistral3.rope.scaling.yarn_beta_fast"] = p.RopeParameters.BetaFast
+	kv["mistral3.rope.scaling.yarn_beta_slow"] = p.RopeParameters.BetaSlow

 	if p.RopeParameters.MscaleAllDim != nil {
-		kv["mistral3.rope.scaling.mscale_all_dim"] = *p.RopeParameters.MscaleAllDim
+		kv["mistral3.rope.scaling.yarn_log_multiplier"] = *p.RopeParameters.MscaleAllDim
 	}

 	if p.RopeParameters.OrigMaxPositionEmbeddings > 0 {
 		kv["mistral3.rope.scaling.original_context_length"] = p.RopeParameters.OrigMaxPositionEmbeddings
-		kv["mistral3.rope.scaling_beta"] = *p.RopeParameters.Llama4ScalingBeta
 	}

 	if p.RopeParameters.Llama4ScalingBeta != nil {
-		kv["mistral3.rope.scaling_beta"] = *p.RopeParameters.Llama4ScalingBeta
+		kv["mistral3.attention.temperature_scale"] = *p.RopeParameters.Llama4ScalingBeta
 	}

 	return kv
--- a/convert/convert_mistral_test.go
+++ b/convert/convert_mistral_test.go
@ -0,0 +1,70 @@
+package convert
+
+import "testing"
+
+func TestMistral3KVUsesLlamaCppRopeScalingKeys(t *testing.T) {
+	mscale := float32(0.75)
+	mscaleAllDim := float32(0)
+	temperatureScale := float32(0.125)
+
+	multimodal := &mistral3Model{}
+	multimodal.TextModel.NumAttentionHeads = 1
+	multimodal.TextModel.HeadDim = 64
+	multimodal.TextModel.RopeParameters.BetaFast = 32
+	multimodal.TextModel.RopeParameters.BetaSlow = 1
+	multimodal.TextModel.RopeParameters.Mscale = &mscale
+	multimodal.TextModel.RopeParameters.MscaleAllDim = &mscaleAllDim
+	multimodal.TextModel.RopeParameters.Llama4ScalingBeta = &temperatureScale
+
+	causal := &mistral3CausalModel{NumAttentionHeads: 1, HeadDim: 64}
+	causal.RopeParameters.BetaFast = 32
+	causal.RopeParameters.BetaSlow = 1
+	causal.RopeParameters.Mscale = &mscale
+	causal.RopeParameters.MscaleAllDim = &mscaleAllDim
+	causal.RopeParameters.Llama4ScalingBeta = &temperatureScale
+
+	tests := []struct {
+		name string
+		kv   KV
+	}{
+		{name: "multimodal", kv: multimodal.KV(mistralTestTokenizer())},
+		{name: "causal", kv: causal.KV(mistralTestTokenizer())},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			assertKVEquals(t, tt.kv, "mistral3.rope.scaling.yarn_beta_fast", float32(32))
+			assertKVEquals(t, tt.kv, "mistral3.rope.scaling.yarn_beta_slow", float32(1))
+			assertKVEquals(t, tt.kv, "mistral3.rope.scaling.yarn_log_multiplier", mscaleAllDim)
+			assertKVEquals(t, tt.kv, "mistral3.attention.temperature_scale", temperatureScale)
+
+			for _, key := range []string{
+				"mistral3.rope.scaling.beta_fast",
+				"mistral3.rope.scaling.beta_slow",
+				"mistral3.rope.scaling.mscale",
+				"mistral3.rope.scaling.mscale_all_dim",
+				"mistral3.rope.scaling_beta",
+			} {
+				if _, ok := tt.kv[key]; ok {
+					t.Fatalf("unexpected legacy key %q", key)
+				}
+			}
+		})
+	}
+}
+
+func mistralTestTokenizer() *Tokenizer {
+	return &Tokenizer{Vocabulary: &Vocabulary{}}
+}
+
+func assertKVEquals[T comparable](t *testing.T, kv KV, key string, want T) {
+	t.Helper()
+
+	got, ok := kv[key]
+	if !ok {
+		t.Fatalf("missing key %q", key)
+	}
+	if got != want {
+		t.Fatalf("%s = %v, want %v", key, got, want)
+	}
+}
--- a/convert/convert_nemotron_h.go
+++ b/convert/convert_nemotron_h.go
@ -131,8 +131,10 @@ type radioConfig struct {
 	} `json:"args"`
 }

-var _ ModelConverter = (*nemotronHModel)(nil)
-var _ ModelConverter = (*nemotronHNanoVLModel)(nil)
+var (
+	_ ModelConverter = (*nemotronHModel)(nil)
+	_ ModelConverter = (*nemotronHNanoVLModel)(nil)
+)

 func (n *nemotronHNanoVLModel) parseMore(fsys fs.FS) error {
 	if n.MaxSequenceLength > 0 {
--- a/convert/convert_olmo.go
+++ b/convert/convert_olmo.go
@ -36,39 +36,39 @@ var _ ModelConverter = (*olmoModel)(nil)

 func (p *olmoModel) KV(t *Tokenizer) KV {
 	kv := p.ModelParameters.KV(t)
-	kv["general.architecture"] = "olmo3"
-	kv["olmo3.block_count"] = p.NumHiddenLayers
-	kv["olmo3.context_length"] = p.MaxPositionEmbeddings
-	kv["olmo3.embedding_length"] = p.HiddenSize
-	kv["olmo3.feed_forward_length"] = p.IntermediateSize
-	kv["olmo3.attention.head_count"] = p.NumAttentionHeads
-	kv["olmo3.attention.head_count_kv"] = cmp.Or(p.NumKeyValueHeads, p.NumAttentionHeads)
+	kv["general.architecture"] = "olmo2"
+	kv["olmo2.block_count"] = p.NumHiddenLayers
+	kv["olmo2.context_length"] = p.MaxPositionEmbeddings
+	kv["olmo2.embedding_length"] = p.HiddenSize
+	kv["olmo2.feed_forward_length"] = p.IntermediateSize
+	kv["olmo2.attention.head_count"] = p.NumAttentionHeads
+	kv["olmo2.attention.head_count_kv"] = cmp.Or(p.NumKeyValueHeads, p.NumAttentionHeads)

 	if p.RopeTheta > 0 {
-		kv["olmo3.rope.freq_base"] = p.RopeTheta
+		kv["olmo2.rope.freq_base"] = p.RopeTheta
 	}

 	if p.RopeScaling != nil {
 		if p.RopeScaling.Factor > 0 {
-			kv["olmo3.rope.scaling.factor"] = p.RopeScaling.Factor
+			kv["olmo2.rope.scaling.factor"] = p.RopeScaling.Factor
 		}
 		if p.RopeScaling.OriginalMaxPositionEmbeds > 0 {
-			kv["olmo3.rope.scaling.original_context_length"] = p.RopeScaling.OriginalMaxPositionEmbeds
+			kv["olmo2.rope.scaling.original_context_length"] = p.RopeScaling.OriginalMaxPositionEmbeds
 		}
 		if p.RopeScaling.AttentionFactor > 0 {
-			kv["olmo3.rope.scaling.attn_factor"] = p.RopeScaling.AttentionFactor
+			kv["olmo2.rope.scaling.attn_factor"] = p.RopeScaling.AttentionFactor
 		}
 		if p.RopeScaling.RopeType != "" {
-			kv["olmo3.rope.scaling.type"] = p.RopeScaling.RopeType
+			kv["olmo2.rope.scaling.type"] = p.RopeScaling.RopeType
 		}
 	}

 	if p.RMSNormEPS > 0 {
-		kv["olmo3.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
+		kv["olmo2.attention.layer_norm_rms_epsilon"] = p.RMSNormEPS
 	}

 	if p.SlidingWindow > 0 {
-		kv["olmo3.attention.sliding_window"] = p.SlidingWindow
+		kv["olmo2.attention.sliding_window"] = p.SlidingWindow
 	}

 	if len(p.LayerTypes) > 0 {
@ -76,7 +76,7 @@ func (p *olmoModel) KV(t *Tokenizer) KV {
 		for i, layerType := range p.LayerTypes {
 			slidingPattern[i] = (layerType == "sliding_attention")
 		}
-		kv["olmo3.attention.sliding_window_pattern"] = slidingPattern
+		kv["olmo2.attention.sliding_window_pattern"] = slidingPattern
 	}

 	return kv
--- a/convert/convert_qwen3next.go
+++ b/convert/convert_qwen3next.go
@ -1,15 +1,24 @@
 package convert

 import (
+	"bufio"
+	"bytes"
+	"encoding/binary"
 	"encoding/json"
 	"fmt"
+	"io"
 	"io/fs"
+	"maps"
 	"math"
+	"os"
 	"slices"
+	"strconv"
 	"strings"

+	"github.com/d4l3k/go-bfloat16"
 	"github.com/pdevine/tensor"
 	"github.com/pdevine/tensor/native"
+	"github.com/x448/float16"

 	"github.com/ollama/ollama/fs/ggml"
 )
@ -32,6 +41,8 @@ type qwen3NextTextConfig struct {
 	MaxPositionEmbeddings uint32  `json:"max_position_embeddings"`
 	HiddenSize            uint32  `json:"hidden_size"`
 	NumHiddenLayers       uint32  `json:"num_hidden_layers"`
+	NumNextNPredictLayers uint32  `json:"num_nextn_predict_layers"`
+	MTPNumHiddenLayers    uint32  `json:"mtp_num_hidden_layers"`
 	IntermediateSize      uint32  `json:"intermediate_size"`
 	NumAttentionHeads     uint32  `json:"num_attention_heads"`
 	NumKeyValueHeads      uint32  `json:"num_key_value_heads"`
@ -66,8 +77,11 @@ type qwen3NextTextConfig struct {
 type qwen3NextVisionConfig struct {
 	Depth                  uint32  `json:"depth"`
 	HiddenSize             uint32  `json:"hidden_size"`
+	IntermediateSize       uint32  `json:"intermediate_size"`
 	NumHeads               uint32  `json:"num_heads"`
+	NumPositionEmbeddings  uint32  `json:"num_position_embeddings"`
 	InChannels             uint32  `json:"in_channels"`
+	OutHiddenSize          uint32  `json:"out_hidden_size"`
 	PatchSize              uint32  `json:"patch_size"`
 	SpatialMergeSize       uint32  `json:"spatial_merge_size"`
 	RMSNormEps             float32 `json:"layer_norm_epsilon"`
@ -96,12 +110,25 @@ type qwen3NextModel struct {
 	VisionEndTokenID   uint32 `json:"vision_end_token_id"`
 }

-var _ ModelConverter = (*qwen3NextModel)(nil)
+var (
+	_ ModelConverter      = (*qwen3NextModel)(nil)
+	_ MultimodalConverter = (*qwen3NextModel)(nil)
+)

 func (q *qwen3NextModel) parseMore(fsys fs.FS) error {
 	if q.TextConfig != nil {
 		q.qwen3NextTextConfig = *q.TextConfig
 	}
+	if q.NumNextNPredictLayers == 0 {
+		q.NumNextNPredictLayers = q.MTPNumHiddenLayers
+	}
+	if q.NumNextNPredictLayers == 0 {
+		nextn, err := qwen3NextInferNextNPredictLayers(fsys)
+		if err != nil {
+			return err
+		}
+		q.NumNextNPredictLayers = nextn
+	}

 	if q.RopeTheta == 0 {
 		q.RopeTheta = q.RopeParameters.RopeTheta
@ -182,6 +209,150 @@ func (q *qwen3NextModel) parseMore(fsys fs.FS) error {
 	return nil
 }

+func qwen3NextInferNextNPredictLayers(fsys fs.FS) (uint32, error) {
+	paths, err := fs.Glob(fsys, "*.safetensors")
+	if err != nil {
+		return 0, err
+	}
+
+	maxLayer := -1
+	hasMTP := false
+	for _, p := range paths {
+		f, err := fsys.Open(p)
+		if err != nil {
+			return 0, err
+		}
+
+		var n int64
+		if err := binary.Read(f, binary.LittleEndian, &n); err != nil {
+			f.Close()
+			return 0, err
+		}
+
+		b := bytes.NewBuffer(make([]byte, 0, n))
+		if _, err = io.CopyN(b, f, n); err != nil {
+			f.Close()
+			return 0, err
+		}
+		f.Close()
+
+		var headers map[string]safetensorMetadata
+		if err := json.NewDecoder(b).Decode(&headers); err != nil {
+			return 0, err
+		}
+
+		for name, value := range headers {
+			if value.Type == "" || !strings.HasPrefix(name, "mtp.") {
+				continue
+			}
+			hasMTP = true
+			rest := strings.TrimPrefix(name, "mtp.layers.")
+			layer, suffix, ok := strings.Cut(rest, ".")
+			if !ok {
+				continue
+			}
+			n, err := strconv.Atoi(layer)
+			if err == nil && n > maxLayer && suffix != "" {
+				maxLayer = n
+			}
+		}
+	}
+
+	if maxLayer >= 0 {
+		return uint32(maxLayer + 1), nil
+	}
+	if hasMTP {
+		return 1, nil
+	}
+	return 0, nil
+}
+
+func ConvertQwen35MTPDraft(fsys fs.FS, f *os.File, baseKV ggml.KV, baseTensors []*ggml.Tensor) error {
+	arch := baseKV.Architecture()
+	if arch != "qwen35" && arch != "qwen35moe" {
+		return fmt.Errorf("MTP draft safetensors require a qwen3.5 base model, got %q", arch)
+	}
+
+	baseBlocks := baseKV.Uint("block_count")
+	if baseBlocks == 0 {
+		return fmt.Errorf("MTP draft safetensors require a base model with block_count")
+	}
+	if baseKV.Uint("nextn_predict_layers") > 0 {
+		return fmt.Errorf("MTP draft safetensors require a base model without embedded MTP layers")
+	}
+
+	nextn, err := qwen3NextInferNextNPredictLayers(fsys)
+	if err != nil {
+		return err
+	}
+	if nextn == 0 {
+		return fmt.Errorf("MTP draft safetensors did not contain mtp tensors")
+	}
+
+	q := &qwen3NextModel{
+		qwen3NextTextConfig: qwen3NextTextConfig{
+			NumHiddenLayers:       baseBlocks,
+			NumNextNPredictLayers: nextn,
+		},
+	}
+	ts, err := parseTensors(fsys, strings.NewReplacer(q.Replacements()...))
+	if err != nil {
+		return err
+	}
+	if err := ensureUniqueTensorNames(ts); err != nil {
+		return err
+	}
+
+	mtpTensors := q.Tensors(ts)
+	if len(mtpTensors) == 0 {
+		return fmt.Errorf("MTP draft safetensors did not produce GGUF tensors")
+	}
+	for _, tensor := range mtpTensors {
+		if !qwen35MTPDraftTensorName(tensor.Name, baseBlocks, nextn) {
+			return fmt.Errorf("MTP draft safetensors produced unexpected tensor %q", tensor.Name)
+		}
+		tensor.Shape = slices.Clone(tensor.Shape)
+		slices.Reverse(tensor.Shape)
+	}
+
+	kv := maps.Clone(baseKV)
+	qwen35RemoveSplitMetadata(kv, arch)
+	kv[arch+".block_count"] = baseBlocks + nextn
+	kv[arch+".nextn_predict_layers"] = nextn
+
+	tensors := make([]*ggml.Tensor, 0, len(baseTensors)+len(mtpTensors))
+	tensors = append(tensors, baseTensors...)
+	tensors = append(tensors, mtpTensors...)
+
+	var parameters uint64
+	for _, tensor := range tensors {
+		parameters += tensor.Elements()
+	}
+	kv["general.parameter_count"] = parameters
+
+	return ggml.WriteGGUF(f, kv, tensors)
+}
+
+func qwen35RemoveSplitMetadata(kv ggml.KV, arch string) {
+	for _, key := range []string{
+		"split.no",
+		"split.count",
+		"split.tensors.count",
+	} {
+		delete(kv, key)
+		delete(kv, arch+"."+key)
+	}
+}
+
+func qwen35MTPDraftTensorName(name string, base, nextn uint32) bool {
+	for i := range nextn {
+		if strings.HasPrefix(name, fmt.Sprintf("blk.%d.", base+i)) {
+			return true
+		}
+	}
+	return false
+}
+
 func (q *qwen3NextModel) kvHeadCounts() ([]uint32, error) {
 	if len(q.LayerTypes) > 0 {
 		kv := make([]uint32, q.NumHiddenLayers)
@ -259,7 +430,10 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
 	}
 	kv["general.architecture"] = arch
 	kv["tokenizer.ggml.pre"] = "qwen35"
-	kv["block_count"] = q.NumHiddenLayers
+	kv["block_count"] = q.NumHiddenLayers + q.NumNextNPredictLayers
+	if q.NumNextNPredictLayers > 0 {
+		kv["nextn_predict_layers"] = q.NumNextNPredictLayers
+	}
 	kv["context_length"] = q.MaxPositionEmbeddings
 	kv["embedding_length"] = q.HiddenSize
 	kv["feed_forward_length"] = q.IntermediateSize
@ -282,7 +456,11 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
 	if sections := q.ropeSections(); len(sections) > 0 {
 		kv["mrope_sections"] = sections
 		kv["rope.mrope_section"] = sections
-		kv["rope.dimension_sections"] = sections
+		dimensionSections := append([]int32(nil), sections...)
+		if len(dimensionSections) == 3 {
+			dimensionSections = append(dimensionSections, 0)
+		}
+		kv["rope.dimension_sections"] = dimensionSections
 	}
 	if q.RopeParameters.MRopeInterleaved {
 		kv["rope.mrope_interleaved"] = true
@ -321,12 +499,21 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
 	}

 	if headCounts, err := q.kvHeadCounts(); err == nil {
-		kv["attention.head_count_kv"] = headCounts
+		var maxKV uint32
+		for _, count := range headCounts {
+			if count > maxKV {
+				maxKV = count
+			}
+		}
+		kv["attention.head_count_kv"] = maxKV
 	}

 	if q.VisionModel.Depth > 0 {
 		kv["vision.block_count"] = q.VisionModel.Depth
 		kv["vision.embedding_length"] = q.VisionModel.HiddenSize
+		if q.VisionModel.IntermediateSize > 0 {
+			kv["vision.feed_forward_length"] = q.VisionModel.IntermediateSize
+		}
 		kv["vision.attention.head_count"] = q.VisionModel.NumHeads
 		kv["vision.num_channels"] = q.VisionModel.InChannels
 		if q.VisionModel.PatchSize > 0 {
@ -372,6 +559,378 @@ func (q *qwen3NextModel) KV(t *Tokenizer) KV {
 	return kv
 }

+func (q *qwen3NextModel) TextKV(t *Tokenizer) KV {
+	kv := q.KV(t)
+
+	for _, key := range []string{
+		"vision.block_count",
+		"vision.embedding_length",
+		"vision.feed_forward_length",
+		"vision.attention.head_count",
+		"vision.num_channels",
+		"vision.patch_size",
+		"vision.spatial_merge_size",
+		"vision.attention.layer_norm_epsilon",
+		"vision.rope.freq_base",
+		"vision.temporal_patch_size",
+		"vision.deepstack_visual_indexes",
+		"vision.shortest_edge",
+		"vision.longest_edge",
+		"vision.image_mean",
+		"vision.image_std",
+		"image_token_id",
+		"vision_start_token_id",
+		"vision_end_token_id",
+		"mrope_sections",
+		"rope.mrope_section",
+		"rope.mrope_interleaved",
+		"ssm.v_head_reordered",
+	} {
+		delete(kv, key)
+	}
+
+	return kv
+}
+
+func (q *qwen3NextModel) ProjectorKV(*Tokenizer) KV {
+	depth := q.VisionModel.Depth
+	deepstack := make([]bool, depth)
+	for _, idx := range q.VisionModel.DeepstackVisualIndexes {
+		if idx >= 0 && uint32(idx) < depth {
+			deepstack[idx] = true
+		}
+	}
+
+	imageSize := uint32(768)
+	if q.VisionModel.NumPositionEmbeddings > 0 && q.VisionModel.PatchSize > 0 {
+		root := uint32(math.Sqrt(float64(q.VisionModel.NumPositionEmbeddings)))
+		if root*root == q.VisionModel.NumPositionEmbeddings {
+			imageSize = root * q.VisionModel.PatchSize
+		}
+	}
+
+	projectionDim := q.VisionModel.OutHiddenSize
+	if projectionDim == 0 {
+		projectionDim = q.HiddenSize
+	}
+	layerNormEps := q.VisionModel.RMSNormEps
+	if layerNormEps == 0 {
+		layerNormEps = 1e-6
+	}
+
+	kv := KV{
+		"general.architecture":                     "clip",
+		"general.type":                             "mmproj",
+		"general.file_type":                        uint32(1),
+		"general.quantization_version":             uint32(2),
+		"clip.has_vision_encoder":                  true,
+		"clip.projector_type":                      "qwen3vl_merger",
+		"clip.use_gelu":                            true,
+		"clip.vision.block_count":                  depth,
+		"clip.vision.embedding_length":             q.VisionModel.HiddenSize,
+		"clip.vision.feed_forward_length":          q.VisionModel.IntermediateSize,
+		"clip.vision.attention.head_count":         q.VisionModel.NumHeads,
+		"clip.vision.image_size":                   imageSize,
+		"clip.vision.patch_size":                   q.VisionModel.PatchSize,
+		"clip.vision.projection_dim":               projectionDim,
+		"clip.vision.spatial_merge_size":           q.VisionModel.SpatialMergeSize,
+		"clip.vision.attention.layer_norm_epsilon": layerNormEps,
+		"clip.vision.is_deepstack_layers":          deepstack,
+	}
+	if len(q.VisionModel.ImageMean) > 0 {
+		kv["clip.vision.image_mean"] = q.VisionModel.ImageMean
+	}
+	if len(q.VisionModel.ImageStd) > 0 {
+		kv["clip.vision.image_std"] = q.VisionModel.ImageStd
+	}
+
+	return kv
+}
+
+func (q *qwen3NextModel) TextTensors(ts []Tensor, _ *Tokenizer) []*ggml.Tensor {
+	var text []Tensor
+	for _, t := range ts {
+		if qwen3NextVisionTensor(t.Name()) {
+			continue
+		}
+		text = append(text, t)
+	}
+
+	return q.Tensors(text)
+}
+
+func (q *qwen3NextModel) ProjectorTensors(ts []Tensor) []*ggml.Tensor {
+	if q.VisionModel.Depth == 0 {
+		return nil
+	}
+
+	rename := strings.NewReplacer(
+		"v.pos_embed", "v.position_embd",
+		"v.patch_embed", "v.patch_embd",
+		"v.merger.norm", "v.post_ln",
+		"v.merger.linear_fc1", "mm.0",
+		"v.merger.linear_fc2", "mm.2",
+		".mlp.linear_fc1", ".ffn_up",
+		".mlp.linear_fc2", ".ffn_down",
+		".norm1", ".ln1",
+		".norm2", ".ln2",
+	)
+
+	var out []*ggml.Tensor
+	for _, t := range ts {
+		name := t.Name()
+		if !qwen3NextVisionTensor(name) {
+			continue
+		}
+
+		if name == "v.patch_embed.weight" {
+			out = append(out, q.qwen35PatchEmbedTensors(t)...)
+			continue
+		}
+
+		outName := rename.Replace(name)
+		kind := t.Kind()
+		writer := io.WriterTo(t)
+		if outName == "v.position_embd.weight" {
+			kind = tensorKindFP32
+			writer = tensorFloat32Writer{tensor: t}
+		} else if sourceDType(t) == "BF16" && kind == tensorKindFP16 {
+			kind = tensorKindBF16
+			writer = tensorBF16Writer{tensor: t}
+		}
+		out = append(out, &ggml.Tensor{
+			Name:     outName,
+			Kind:     kind,
+			Shape:    slices.Clone(t.Shape()),
+			WriterTo: writer,
+		})
+	}
+
+	return out
+}
+
+func qwen3NextVisionTensor(name string) bool {
+	return strings.HasPrefix(name, "v.")
+}
+
+func (q *qwen3NextModel) qwen35PatchEmbedTensors(t Tensor) []*ggml.Tensor {
+	shape := t.Shape()
+	if len(shape) != 5 || shape[2] != 2 {
+		return nil
+	}
+
+	outShape := []uint64{shape[0], shape[1], shape[3], shape[4]}
+	return []*ggml.Tensor{
+		{
+			Name:     "v.patch_embd.weight",
+			Kind:     tensorKindFP32,
+			Shape:    slices.Clone(outShape),
+			WriterTo: tensorFloat32Writer{tensor: t, repacker: q.qwen35PatchEmbedSlice(0)},
+		},
+		{
+			Name:     "v.patch_embd.weight.1",
+			Kind:     tensorKindFP32,
+			Shape:    slices.Clone(outShape),
+			WriterTo: tensorFloat32Writer{tensor: t, repacker: q.qwen35PatchEmbedSlice(1)},
+		},
+	}
+}
+
+func (q *qwen3NextModel) qwen35PatchEmbedSlice(slice int) Repacker {
+	return func(_ string, data []float32, shape []uint64) ([]float32, error) {
+		if len(shape) != 5 || shape[2] != 2 {
+			return nil, fmt.Errorf("qwen3next: unexpected patch_embed shape %v", shape)
+		}
+
+		outChannels := int(shape[0])
+		inChannels := int(shape[1])
+		frames := int(shape[2])
+		height := int(shape[3])
+		width := int(shape[4])
+		if slice < 0 || slice >= frames {
+			return nil, fmt.Errorf("qwen3next: patch_embed slice %d out of range", slice)
+		}
+
+		expected := outChannels * inChannels * frames * height * width
+		if len(data) != expected {
+			return nil, fmt.Errorf("qwen3next: patch_embed data size %d, expected %d", len(data), expected)
+		}
+
+		out := make([]float32, outChannels*inChannels*height*width)
+		for oc := range outChannels {
+			for ic := range inChannels {
+				for y := range height {
+					for x := range width {
+						src := ((((oc*inChannels+ic)*frames+slice)*height + y) * width) + x
+						dst := (((oc*inChannels+ic)*height + y) * width) + x
+						out[dst] = data[src]
+					}
+				}
+			}
+		}
+
+		return out, nil
+	}
+}
+
+type tensorBF16Writer struct {
+	tensor   Tensor
+	repacker Repacker
+}
+
+func (w tensorBF16Writer) WriteTo(dst io.Writer) (int64, error) {
+	data, err := tensorFloat32Data(w.tensor)
+	if err != nil {
+		return 0, err
+	}
+	if w.repacker != nil {
+		data, err = w.repacker(w.tensor.Name(), data, w.tensor.Shape())
+		if err != nil {
+			return 0, err
+		}
+	}
+
+	u8s := bfloat16.EncodeFloat32(data)
+	if _, err := dst.Write(u8s); err != nil {
+		return 0, err
+	}
+	return int64(len(u8s)), nil
+}
+
+type tensorFloat32Writer struct {
+	tensor   Tensor
+	repacker Repacker
+}
+
+func (w tensorFloat32Writer) WriteTo(dst io.Writer) (int64, error) {
+	data, err := tensorFloat32Data(w.tensor)
+	if err != nil {
+		return 0, err
+	}
+	if w.repacker != nil {
+		data, err = w.repacker(w.tensor.Name(), data, w.tensor.Shape())
+		if err != nil {
+			return 0, err
+		}
+	}
+	if err := binary.Write(dst, binary.LittleEndian, data); err != nil {
+		return 0, err
+	}
+	return int64(len(data) * 4), nil
+}
+
+func tensorFloat32Data(t Tensor) ([]float32, error) {
+	if st, ok := tensorSafetensor(t); ok {
+		return safetensorFloat32Data(st)
+	}
+
+	var buf bytes.Buffer
+	if _, err := t.WriteTo(&buf); err != nil {
+		return nil, err
+	}
+
+	switch t.Kind() {
+	case tensorKindFP32:
+		out := make([]float32, buf.Len()/4)
+		if err := binary.Read(bytes.NewReader(buf.Bytes()), binary.LittleEndian, out); err != nil {
+			return nil, err
+		}
+		return out, nil
+	case tensorKindFP16:
+		raw := make([]uint16, buf.Len()/2)
+		if err := binary.Read(bytes.NewReader(buf.Bytes()), binary.LittleEndian, raw); err != nil {
+			return nil, err
+		}
+		out := make([]float32, len(raw))
+		for i, v := range raw {
+			out[i] = float16.Frombits(v).Float32()
+		}
+		return out, nil
+	case tensorKindBF16:
+		return bfloat16.DecodeFloat32(buf.Bytes()), nil
+	default:
+		return nil, fmt.Errorf("unsupported tensor kind %d for F32 writer", t.Kind())
+	}
+}
+
+func tensorSafetensor(t Tensor) (safetensor, bool) {
+	switch t := t.(type) {
+	case safetensor:
+		return t, true
+	case *safetensor:
+		return *t, true
+	default:
+		return safetensor{}, false
+	}
+}
+
+func safetensorFloat32Data(st safetensor) ([]float32, error) {
+	f, err := st.fs.Open(st.path)
+	if err != nil {
+		return nil, err
+	}
+	defer f.Close()
+
+	var r io.Reader
+	if readerAt, ok := f.(io.ReaderAt); ok {
+		r = io.NewSectionReader(readerAt, st.offset, st.size)
+	} else if seeker, ok := f.(io.Seeker); ok {
+		if _, err := seeker.Seek(st.offset, io.SeekStart); err != nil {
+			return nil, err
+		}
+		r = f
+	} else {
+		if _, err := io.CopyN(io.Discard, f, st.offset); err != nil {
+			return nil, err
+		}
+		r = f
+	}
+
+	br := bufio.NewReaderSize(r, min(32<<10, int(st.size)))
+	var out []float32
+	switch st.dtype {
+	case "F32":
+		out = make([]float32, st.size/4)
+		if err := binary.Read(br, binary.LittleEndian, out); err != nil {
+			return nil, err
+		}
+	case "F16":
+		raw := make([]uint16, st.size/2)
+		if err := binary.Read(br, binary.LittleEndian, raw); err != nil {
+			return nil, err
+		}
+		out = make([]float32, len(raw))
+		for i, v := range raw {
+			out[i] = float16.Frombits(v).Float32()
+		}
+	case "BF16":
+		raw := make([]uint8, st.size)
+		if err := binary.Read(br, binary.LittleEndian, raw); err != nil {
+			return nil, err
+		}
+		out = bfloat16.DecodeFloat32(raw)
+	case "F8_E4M3":
+		raw := make([]uint8, st.size)
+		if err := binary.Read(br, binary.LittleEndian, raw); err != nil {
+			return nil, err
+		}
+		out, err = st.decodeFP8E4M3(raw)
+		if err != nil {
+			return nil, err
+		}
+	default:
+		return nil, fmt.Errorf("unsupported safetensor dtype %q", st.dtype)
+	}
+
+	if st.repacker != nil {
+		out, err = st.repacker(st.Name(), out, st.Shape())
+		if err != nil {
+			return nil, err
+		}
+	}
+	return out, nil
+}
+
 func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
 	var out []*ggml.Tensor

@ -398,6 +957,13 @@ func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
 		name := t.Name()
 		shape := t.Shape()

+		if names := q.mtpTensorNames(name); len(names) > 0 {
+			for _, name := range names {
+				out = q.appendDirectTensor(out, t, name)
+			}
+			continue
+		}
+
 		if strings.HasSuffix(name, ".ssm_in.weight") {
 			if qkv, gate, ok := q.splitQKVZTensor(t); ok {
 				out = append(out, qkv, gate)
@ -464,7 +1030,7 @@ func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
 			}
 			out = append(out, &ggml.Tensor{Name: name, Kind: t.Kind(), Shape: slices.Clone(shape), WriterTo: t})

-		case strings.HasSuffix(name, ".ssm_dt"):
+		case strings.HasSuffix(name, ".ssm_dt"), strings.HasSuffix(name, ".ssm_dt.bias"):
 			if q.shouldReorderVHeads() {
 				t.SetRepacker(q.repackReorderDim(0, 1))
 			}
@ -499,6 +1065,73 @@ func (q *qwen3NextModel) Tensors(ts []Tensor) []*ggml.Tensor {
 	return out
 }

+func (q *qwen3NextModel) appendDirectTensor(out []*ggml.Tensor, t Tensor, name string) []*ggml.Tensor {
+	if qwen3NextShouldShiftNorm(name) {
+		t = t.Clone()
+		t.SetRepacker(q.addOne)
+	}
+	return append(out, &ggml.Tensor{Name: name, Kind: t.Kind(), Shape: slices.Clone(t.Shape()), WriterTo: t})
+}
+
+func qwen3NextShouldShiftNorm(name string) bool {
+	if strings.HasSuffix(name, ".ssm_norm.weight") {
+		return false
+	}
+	return strings.HasSuffix(name, "_norm.weight") ||
+		strings.HasSuffix(name, ".nextn.enorm.weight") ||
+		strings.HasSuffix(name, ".nextn.hnorm.weight")
+}
+
+func (q *qwen3NextModel) mtpTensorNames(name string) []string {
+	if !strings.HasPrefix(name, "mtp.") {
+		return nil
+	}
+
+	base := q.NumHiddenLayers
+	nextn := q.NumNextNPredictLayers
+	if nextn == 0 {
+		nextn = 1
+	}
+
+	if rest := strings.TrimPrefix(name, "mtp.layers."); rest != name {
+		layer, suffix, ok := strings.Cut(rest, ".")
+		if !ok {
+			return nil
+		}
+		idx, err := strconv.ParseUint(layer, 10, 32)
+		if err != nil {
+			return nil
+		}
+		return []string{fmt.Sprintf("blk.%d.%s", base+uint32(idx), suffix)}
+	}
+
+	var suffix string
+	switch name {
+	case "mtp.fc.weight":
+		suffix = "nextn.eh_proj.weight"
+	case "mtp.pre_fc_norm_embedding.weight":
+		suffix = "nextn.enorm.weight"
+	case "mtp.pre_fc_norm_hidden.weight":
+		suffix = "nextn.hnorm.weight"
+	case "mtp.norm.weight":
+		suffix = "nextn.shared_head_norm.weight"
+	case "mtp.embed_tokens.weight":
+		suffix = "nextn.embed_tokens.weight"
+	case "mtp.shared_head.head.weight":
+		suffix = "nextn.shared_head_head.weight"
+	case "mtp.shared_head.norm.weight":
+		suffix = "nextn.shared_head_norm.weight"
+	default:
+		return nil
+	}
+
+	names := make([]string, 0, nextn)
+	for i := range nextn {
+		names = append(names, fmt.Sprintf("blk.%d.%s", base+i, suffix))
+	}
+	return names
+}
+
 func (q *qwen3NextModel) repackReorderDim(dim, headDim int) Repacker {
 	return func(_ string, data []float32, shape []uint64) ([]float32, error) {
 		if !q.shouldReorderVHeads() {
@ -925,7 +1558,7 @@ func (q *qwen3NextModel) Replacements() []string {
 		"linear_attn.in_proj_b", "ssm_beta",

 		"linear_attn.conv1d", "ssm_conv1d",
-		"linear_attn.dt_bias", "ssm_dt",
+		"linear_attn.dt_bias", "ssm_dt.bias",
 		"linear_attn.dt_proj", "ssm_dt",
 		"linear_attn.A_log", "ssm_a",
 		"linear_attn.norm", "ssm_norm",
--- a/convert/convert_qwen3next_test.go
+++ b/convert/convert_qwen3next_test.go
@ -4,10 +4,12 @@ import (
 	"bytes"
 	"encoding/binary"
 	"os"
+	"path/filepath"
 	"slices"
 	"strings"
 	"testing"

+	"github.com/d4l3k/go-bfloat16"
 	"github.com/ollama/ollama/fs/ggml"
 )

@ -106,11 +108,7 @@ func TestQwen3NextKVLegacyConfig(t *testing.T) {
 		t.Fatalf("unexpected tokenizer pre: got %v want %v", got, want)
 	}

-	headCountKV, ok := kv["attention.head_count_kv"].([]uint32)
-	if !ok {
-		t.Fatalf("attention.head_count_kv has unexpected type: %T", kv["attention.head_count_kv"])
-	}
-	if got, want := headCountKV, []uint32{0, 2, 0, 2}; !slices.Equal(got, want) {
+	if got, want := kv["attention.head_count_kv"], uint32(2); got != want {
 		t.Fatalf("unexpected attention.head_count_kv: got %v want %v", got, want)
 	}

@ -198,6 +196,7 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
 		VisionModel: qwen3NextVisionConfig{
 			Depth:                  2,
 			HiddenSize:             128,
+			IntermediateSize:       512,
 			NumHeads:               4,
 			InChannels:             3,
 			PatchSize:              16,
@ -225,11 +224,7 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
 		t.Fatalf("unexpected architecture: got %v want %v", got, want)
 	}

-	headCountKV, ok := kv["attention.head_count_kv"].([]uint32)
-	if !ok {
-		t.Fatalf("attention.head_count_kv has unexpected type: %T", kv["attention.head_count_kv"])
-	}
-	if got, want := headCountKV, []uint32{0, 4, 0, 4}; !slices.Equal(got, want) {
+	if got, want := kv["attention.head_count_kv"], uint32(4); got != want {
 		t.Fatalf("unexpected attention.head_count_kv: got %v want %v", got, want)
 	}

@ -248,7 +243,7 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
 	if !ok {
 		t.Fatalf("rope.dimension_sections has unexpected type: %T", kv["rope.dimension_sections"])
 	}
-	if got, want := ropeSections, []int32{11, 11, 10}; !slices.Equal(got, want) {
+	if got, want := ropeSections, []int32{11, 11, 10, 0}; !slices.Equal(got, want) {
 		t.Fatalf("unexpected rope.dimension_sections: got %v want %v", got, want)
 	}

@ -259,6 +254,254 @@ func TestQwen35KVFromTextConfig(t *testing.T) {
 	if got, want := kv["vision.block_count"], uint32(2); got != want {
 		t.Fatalf("unexpected vision.block_count: got %v want %v", got, want)
 	}
+	if got, want := kv["vision.feed_forward_length"], uint32(512); got != want {
+		t.Fatalf("unexpected vision.feed_forward_length: got %v want %v", got, want)
+	}
+}
+
+func TestQwen35MTPTensors(t *testing.T) {
+	m := &qwen3NextModel{
+		ModelParameters: ModelParameters{
+			ModelType: "qwen3_5",
+		},
+		qwen3NextTextConfig: qwen3NextTextConfig{
+			NumHiddenLayers:       32,
+			NumNextNPredictLayers: 1,
+		},
+	}
+
+	kv := m.KV(&Tokenizer{Vocabulary: &Vocabulary{}})
+	if got, want := kv["block_count"], uint32(33); got != want {
+		t.Fatalf("unexpected block_count: got %v want %v", got, want)
+	}
+	if got, want := kv["nextn_predict_layers"], uint32(1); got != want {
+		t.Fatalf("unexpected nextn_predict_layers: got %v want %v", got, want)
+	}
+
+	tensors := m.Tensors([]Tensor{
+		&fakeTensor{name: "mtp.fc.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
+		&fakeTensor{name: "mtp.pre_fc_norm_embedding.weight", shape: []uint64{2}, data: []float32{0, 1}},
+		&fakeTensor{name: "mtp.pre_fc_norm_hidden.weight", shape: []uint64{2}, data: []float32{0, 1}},
+		&fakeTensor{name: "mtp.norm.weight", shape: []uint64{2}, data: []float32{0, 1}},
+		&fakeTensor{name: "mtp.layers.0.attn_q.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
+		&fakeTensor{name: "mtp.layers.0.ffn_down.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
+	})
+
+	byName := map[string]*ggml.Tensor{}
+	for _, tensor := range tensors {
+		byName[tensor.Name] = tensor
+	}
+
+	for _, name := range []string{
+		"blk.32.nextn.eh_proj.weight",
+		"blk.32.nextn.enorm.weight",
+		"blk.32.nextn.hnorm.weight",
+		"blk.32.nextn.shared_head_norm.weight",
+		"blk.32.attn_q.weight",
+		"blk.32.ffn_down.weight",
+	} {
+		if _, ok := byName[name]; !ok {
+			t.Fatalf("missing MTP tensor %q", name)
+		}
+	}
+
+	for _, name := range []string{
+		"blk.32.nextn.enorm.weight",
+		"blk.32.nextn.hnorm.weight",
+		"blk.32.nextn.shared_head_norm.weight",
+	} {
+		if got, want := readTensorData(t, byName[name]), []float32{1, 2}; !slices.Equal(got, want) {
+			t.Fatalf("unexpected shifted norm values for %s: got %v want %v", name, got, want)
+		}
+	}
+}
+
+func TestQwen35NativeSplitKV(t *testing.T) {
+	m := &qwen3NextModel{
+		ModelParameters: ModelParameters{
+			ModelType: "qwen3_5",
+		},
+		TextConfig: &qwen3NextTextConfig{
+			MaxPositionEmbeddings: 16384,
+			HiddenSize:            2560,
+			NumHiddenLayers:       4,
+			IntermediateSize:      9216,
+			NumAttentionHeads:     16,
+			NumKeyValueHeads:      4,
+			HeadDim:               256,
+			RMSNormEPS:            1e-6,
+			FullAttentionInterval: 2,
+			LinearConvKernelDim:   4,
+			LinearKeyHeadDim:      128,
+			LinearNumKeyHeads:     16,
+			LinearNumValueHeads:   32,
+			LinearValueHeadDim:    128,
+			RopeParameters: qwen3NextRopeParams{
+				MRopeInterleaved:    true,
+				MropeSection:        []int32{11, 11, 10},
+				RopeTheta:           10_000_000,
+				PartialRotaryFactor: 0.25,
+			},
+		},
+		VisionModel: qwen3NextVisionConfig{
+			Depth:                 24,
+			HiddenSize:            1024,
+			IntermediateSize:      4096,
+			NumHeads:              16,
+			NumPositionEmbeddings: 2304,
+			InChannels:            3,
+			OutHiddenSize:         2560,
+			PatchSize:             16,
+			SpatialMergeSize:      2,
+		},
+		ImageTokenID:       248056,
+		VisionStartTokenID: 248053,
+		VisionEndTokenID:   248054,
+	}
+	m.VisionModel.ImageMean = []float32{0.5, 0.5, 0.5}
+	m.VisionModel.ImageStd = []float32{0.5, 0.5, 0.5}
+
+	if err := m.parseMore(os.DirFS(t.TempDir())); err != nil {
+		t.Fatal(err)
+	}
+
+	textKV := m.TextKV(&Tokenizer{Vocabulary: &Vocabulary{}})
+	for _, key := range []string{
+		"vision.block_count",
+		"image_token_id",
+		"vision_start_token_id",
+		"vision_end_token_id",
+		"mrope_sections",
+		"rope.mrope_section",
+		"rope.mrope_interleaved",
+		"ssm.v_head_reordered",
+	} {
+		if _, ok := textKV[key]; ok {
+			t.Fatalf("TextKV retained %q", key)
+		}
+	}
+	if got, want := textKV["rope.dimension_sections"], []int32{11, 11, 10, 0}; !slices.Equal(got.([]int32), want) {
+		t.Fatalf("unexpected rope.dimension_sections: got %v want %v", got, want)
+	}
+
+	projectorKV := m.ProjectorKV(&Tokenizer{Vocabulary: &Vocabulary{}})
+	if got, want := projectorKV["general.architecture"], "clip"; got != want {
+		t.Fatalf("unexpected projector architecture: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.projector_type"], "qwen3vl_merger"; got != want {
+		t.Fatalf("unexpected projector type: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.vision.feed_forward_length"], uint32(4096); got != want {
+		t.Fatalf("unexpected projector feed_forward_length: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.vision.image_size"], uint32(768); got != want {
+		t.Fatalf("unexpected projector image_size: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.vision.projection_dim"], uint32(2560); got != want {
+		t.Fatalf("unexpected projector projection_dim: got %v want %v", got, want)
+	}
+}
+
+func TestQwen35ProjectorTensors(t *testing.T) {
+	m := &qwen3NextModel{
+		VisionModel: qwen3NextVisionConfig{Depth: 1},
+	}
+
+	patch := &fakeTensor{
+		name:  "v.patch_embed.weight",
+		shape: []uint64{2, 2, 2, 1, 2},
+		data:  []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
+	}
+	tensors := m.ProjectorTensors([]Tensor{
+		patch,
+		&fakeTensor{name: "v.pos_embed.weight", shape: []uint64{4, 2}, data: []float32{0, 1, 2, 3, 4, 5, 6, 7}},
+		&fakeTensor{name: "v.blk.0.attn_qkv.weight", shape: []uint64{6, 2}, data: make([]float32, 12), sourceDType: "BF16", kind: tensorKindFP16},
+		&fakeTensor{name: "v.blk.0.mlp.linear_fc1.weight", shape: []uint64{8, 2}, data: make([]float32, 16), sourceDType: "BF16", kind: tensorKindFP16},
+		&fakeTensor{name: "token_embd.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
+		&fakeTensor{name: "mtp.fc.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
+	})
+
+	byName := map[string]*ggml.Tensor{}
+	for _, tensor := range tensors {
+		byName[tensor.Name] = tensor
+	}
+
+	if _, ok := byName["token_embd.weight"]; ok {
+		t.Fatalf("projector tensors included text tensor")
+	}
+	if _, ok := byName["mtp.fc.weight"]; ok {
+		t.Fatalf("projector tensors included MTP tensor")
+	}
+	if got := byName["v.position_embd.weight"]; got == nil || got.Kind != tensorKindFP32 {
+		t.Fatalf("position embedding was not promoted to F32: %#v", got)
+	}
+	if got := byName["v.blk.0.attn_qkv.weight"]; got == nil {
+		t.Fatalf("attn_qkv tensor missing")
+	} else if got.Kind != tensorKindBF16 {
+		t.Fatalf("attn_qkv tensor was not preserved as BF16: %#v", got)
+	}
+	if got := byName["v.blk.0.ffn_up.weight"]; got == nil {
+		t.Fatalf("ffn_up tensor missing")
+	} else if got.Kind != tensorKindBF16 {
+		t.Fatalf("ffn_up tensor was not preserved as BF16: %#v", got)
+	}
+
+	first := byName["v.patch_embd.weight"]
+	if first == nil {
+		t.Fatalf("first patch embedding slice missing")
+	}
+	if got, want := first.Shape, []uint64{2, 2, 1, 2}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected first patch shape: got %v want %v", got, want)
+	}
+	if got, want := readTensorData(t, first), []float32{0, 1, 4, 5, 8, 9, 12, 13}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected first patch data: got %v want %v", got, want)
+	}
+
+	second := byName["v.patch_embd.weight.1"]
+	if second == nil {
+		t.Fatalf("second patch embedding slice missing")
+	}
+	if got, want := readTensorData(t, second), []float32{2, 3, 6, 7, 10, 11, 14, 15}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected second patch data: got %v want %v", got, want)
+	}
+}
+
+func TestQwen35BF16ProjectorWriterPreservesSource(t *testing.T) {
+	dir := t.TempDir()
+	path := filepath.Join(dir, "tensor.bin")
+	values := []float32{1, -2, 3.5, 4.25}
+	raw := bfloat16.EncodeFloat32(values)
+	if err := os.WriteFile(path, raw, 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	st := safetensor{
+		fs:     os.DirFS(dir),
+		path:   "tensor.bin",
+		dtype:  "BF16",
+		offset: 0,
+		size:   int64(len(raw)),
+		tensorBase: &tensorBase{
+			name:  "v.blk.0.attn_qkv.weight",
+			shape: []uint64{2, 2},
+		},
+	}
+	tensor := &ggml.Tensor{
+		Name:     "v.blk.0.attn_qkv.weight",
+		Kind:     tensorKindBF16,
+		Shape:    []uint64{2, 2},
+		WriterTo: tensorBF16Writer{tensor: st},
+	}
+
+	var got bytes.Buffer
+	if n, err := tensor.WriteTo(&got); err != nil {
+		t.Fatal(err)
+	} else if n != int64(len(raw)) {
+		t.Fatalf("unexpected byte count: got %d want %d", n, len(raw))
+	}
+	if !bytes.Equal(got.Bytes(), raw) {
+		t.Fatalf("BF16 writer changed source bytes: got %x want %x", got.Bytes(), raw)
+	}
 }

 func TestQwen3NextReplacements(t *testing.T) {
@ -273,6 +516,12 @@ func TestQwen3NextReplacements(t *testing.T) {
 	if got, want := r.Replace("model.layers.1.linear_attn.in_proj_qkvz.weight"), "blk.1.ssm_in.weight"; got != want {
 		t.Fatalf("unexpected legacy replacement: got %q want %q", got, want)
 	}
+	if got, want := r.Replace("model.layers.1.linear_attn.dt_bias"), "blk.1.ssm_dt.bias"; got != want {
+		t.Fatalf("unexpected dt bias replacement: got %q want %q", got, want)
+	}
+	if got, want := r.Replace("model.layers.1.linear_attn.dt_proj.weight"), "blk.1.ssm_dt.weight"; got != want {
+		t.Fatalf("unexpected dt projection replacement: got %q want %q", got, want)
+	}
 }

 func TestQwen35ReordersVHeads(t *testing.T) {
@ -399,6 +648,33 @@ func TestQwen35ReordersSsmBetaRows(t *testing.T) {
 	}
 }

+func TestQwen35ReordersSsmDtBias(t *testing.T) {
+	m := &qwen3NextModel{
+		ModelParameters: ModelParameters{
+			ModelType: "qwen3_5",
+		},
+		qwen3NextTextConfig: qwen3NextTextConfig{
+			LinearNumKeyHeads:   2,
+			LinearNumValueHeads: 4,
+		},
+	}
+
+	out := m.Tensors([]Tensor{
+		&fakeTensor{
+			name:  "blk.0.ssm_dt.bias",
+			shape: []uint64{4},
+			data:  []float32{0, 1, 2, 3},
+		},
+	})
+	if len(out) != 1 {
+		t.Fatalf("unexpected output tensor count: got %d want 1", len(out))
+	}
+
+	if got, want := readTensorData(t, out[0]), []float32{0, 2, 1, 3}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected ssm_dt.bias data: got %v want %v", got, want)
+	}
+}
+
 func TestQwen35ReordersConv1DChannelDim(t *testing.T) {
 	m := &qwen3NextModel{
 		ModelParameters: ModelParameters{
--- a/convert/convert_qwen3vl.go
+++ b/convert/convert_qwen3vl.go
@ -3,8 +3,13 @@ package convert
 import (
 	"cmp"
 	"encoding/json"
+	"fmt"
+	"io"
 	"io/fs"
+	"math"
+	"regexp"
 	"slices"
+	"strconv"
 	"strings"

 	"github.com/ollama/ollama/fs/ggml"
@ -25,6 +30,9 @@ type qwen3VLModel struct {
 		RopeTheta              float32 `json:"rope_theta"`
 		TemporalPatchSize      uint32  `json:"temporal_patch_size"`
 		DeepstackVisualIndexes []int32 `json:"deepstack_visual_indexes"`
+		IntermediateSize       uint32  `json:"intermediate_size"`
+		OutHiddenSize          uint32  `json:"out_hidden_size"`
+		NumPositionEmbeddings  uint32  `json:"num_position_embeddings"`

 		Size struct {
 			ShortestEdge uint32 `json:"shortest_edge"`
@ -36,6 +44,8 @@ type qwen3VLModel struct {
 	} `json:"vision_config"`
 }

+var _ MultimodalConverter = (*qwen3VLModel)(nil)
+
 func (m *qwen3VLModel) parseMore(fsys fs.FS) error {
 	bts, err := fs.ReadFile(fsys, "preprocessor_config.json")
 	if err != nil {
@ -55,8 +65,20 @@ func (m *qwen3VLModel) KV(t *Tokenizer) KV {
 	// override architecture
 	kv["general.architecture"] = arch

+	if sections := m.RopeScaling.MropeSection; len(sections) > 0 {
+		dimensionSections := append([]int32(nil), sections...)
+		if len(dimensionSections) == 3 {
+			dimensionSections = append(dimensionSections, 0)
+		}
+		kv["rope.dimension_sections"] = dimensionSections
+	}
+	kv["n_deepstack_layers"] = uint32(len(m.VisionModel.DeepstackVisualIndexes))
+
 	kv["vision.block_count"] = cmp.Or(m.VisionModel.Depth, 32)
 	kv["vision.embedding_length"] = m.VisionModel.HiddenSize
+	if m.VisionModel.IntermediateSize > 0 {
+		kv["vision.feed_forward_length"] = m.VisionModel.IntermediateSize
+	}
 	kv["vision.attention.head_count"] = cmp.Or(m.VisionModel.NumHeads, 16)
 	kv["vision.num_channels"] = m.VisionModel.InChannels
 	kv["vision.patch_size"] = cmp.Or(m.VisionModel.PatchSize, 14)
@ -75,6 +97,234 @@ func (m *qwen3VLModel) KV(t *Tokenizer) KV {
 	return kv
 }

+func (m *qwen3VLModel) TextKV(t *Tokenizer) KV {
+	kv := m.KV(t)
+	for _, key := range []string{
+		"vision.block_count",
+		"vision.embedding_length",
+		"vision.feed_forward_length",
+		"vision.attention.head_count",
+		"vision.num_channels",
+		"vision.patch_size",
+		"vision.spatial_merge_size",
+		"vision.attention.layer_norm_epsilon",
+		"vision.rope.freq_base",
+		"vision.temporal_patch_size",
+		"vision.deepstack_visual_indexes",
+		"vision.shortest_edge",
+		"vision.longest_edge",
+		"vision.image_mean",
+		"vision.image_std",
+		"rope.mrope_section",
+	} {
+		delete(kv, key)
+	}
+
+	return kv
+}
+
+func (m *qwen3VLModel) ProjectorKV(*Tokenizer) KV {
+	depth := cmp.Or(m.VisionModel.Depth, uint32(32))
+	deepstack := make([]bool, depth)
+	for _, idx := range m.VisionModel.DeepstackVisualIndexes {
+		if idx >= 0 && uint32(idx) < depth {
+			deepstack[idx] = true
+		}
+	}
+
+	projectionDim := m.VisionModel.OutHiddenSize
+	if projectionDim == 0 {
+		projectionDim = m.HiddenSize
+	}
+	layerNormEps := m.VisionModel.RMSNormEps
+	if layerNormEps == 0 {
+		layerNormEps = 1e-6
+	}
+
+	kv := KV{
+		"general.architecture":                     "clip",
+		"general.type":                             "mmproj",
+		"general.file_type":                        uint32(1),
+		"general.quantization_version":             uint32(2),
+		"clip.has_vision_encoder":                  true,
+		"clip.projector_type":                      "qwen3vl_merger",
+		"clip.use_gelu":                            true,
+		"clip.vision.block_count":                  depth,
+		"clip.vision.embedding_length":             m.VisionModel.HiddenSize,
+		"clip.vision.feed_forward_length":          cmp.Or(m.VisionModel.IntermediateSize, m.VisionModel.HiddenSize*4),
+		"clip.vision.attention.head_count":         cmp.Or(m.VisionModel.NumHeads, uint32(16)),
+		"clip.vision.attention.layer_norm_epsilon": layerNormEps,
+		"clip.vision.num_channels":                 m.VisionModel.InChannels,
+		"clip.vision.patch_size":                   cmp.Or(m.VisionModel.PatchSize, uint32(14)),
+		"clip.vision.spatial_merge_size":           cmp.Or(m.VisionModel.SpatialMergeSize, uint32(2)),
+		"clip.vision.image_size":                   m.projectorImageSize(),
+		"clip.vision.projection_dim":               projectionDim,
+		"clip.vision.temporal_patch_size":          cmp.Or(m.VisionModel.TemporalPatchSize, uint32(2)),
+		"clip.vision.rope.freq_base":               cmp.Or(m.VisionModel.RopeTheta, float32(1e4)),
+		"clip.vision.is_deepstack_layers":          deepstack,
+	}
+	if m.VisionModel.Size.ShortestEdge > 0 {
+		kv["clip.vision.image_min_pixels"] = m.VisionModel.Size.ShortestEdge
+	}
+	if m.VisionModel.Size.LongestEdge > 0 {
+		kv["clip.vision.image_max_pixels"] = m.VisionModel.Size.LongestEdge
+	}
+	if len(m.VisionModel.ImageMean) == 3 {
+		kv["clip.vision.image_mean"] = m.VisionModel.ImageMean
+	}
+	if len(m.VisionModel.ImageStd) == 3 {
+		kv["clip.vision.image_std"] = m.VisionModel.ImageStd
+	}
+
+	return kv
+}
+
+func (m *qwen3VLModel) projectorImageSize() uint32 {
+	if m.VisionModel.NumPositionEmbeddings > 0 && m.VisionModel.PatchSize > 0 {
+		root := uint32(math.Sqrt(float64(m.VisionModel.NumPositionEmbeddings)))
+		if root*root == m.VisionModel.NumPositionEmbeddings {
+			return root * m.VisionModel.PatchSize
+		}
+	}
+	return uint32(768)
+}
+
+func qwen3VLVisionTensor(name string) bool {
+	return strings.HasPrefix(name, "v.") || strings.HasPrefix(name, "mm.")
+}
+
+func (m *qwen3VLModel) TextTensors(ts []Tensor, _ *Tokenizer) []*ggml.Tensor {
+	var textOnly []Tensor
+	for _, t := range ts {
+		if qwen3VLVisionTensor(t.Name()) {
+			continue
+		}
+		textOnly = append(textOnly, t)
+	}
+
+	return m.qwen3Model.Tensors(textOnly)
+}
+
+func (m *qwen3VLModel) qwen3VLProjectorRename(name string) string {
+	if strings.HasPrefix(name, "v.merger.") {
+		name = strings.Replace(name, "v.merger.linear_fc1", "mm.0", 1)
+		name = strings.Replace(name, "v.merger.linear_fc2", "mm.2", 1)
+		name = strings.Replace(name, "v.merger.norm", "v.post_ln", 1)
+		return name
+	}
+
+	if strings.HasPrefix(name, "v.deepstack.") {
+		re := regexp.MustCompile(`^v\.deepstack\.(\d+)\.(.+)$`)
+		if matches := re.FindStringSubmatch(name); matches != nil {
+			seqIdx, err := strconv.Atoi(matches[1])
+			if err == nil && seqIdx < len(m.VisionModel.DeepstackVisualIndexes) {
+				blockIdx := m.VisionModel.DeepstackVisualIndexes[seqIdx]
+				suffix := matches[2]
+				suffix = strings.Replace(suffix, "linear_fc1", "fc1", 1)
+				suffix = strings.Replace(suffix, "linear_fc2", "fc2", 1)
+				return fmt.Sprintf("v.deepstack.%d.%s", blockIdx, suffix)
+			}
+		}
+	}
+
+	return name
+}
+
+func (m *qwen3VLModel) ProjectorTensors(ts []Tensor) []*ggml.Tensor {
+	var out []*ggml.Tensor
+
+	for _, t := range ts {
+		if !qwen3VLVisionTensor(t.Name()) {
+			continue
+		}
+
+		name := m.qwen3VLProjectorRename(t.Name())
+		if name == "v.patch_embd.weight" {
+			out = append(out, m.qwen3VLPatchEmbedTensors(t)...)
+			continue
+		}
+
+		kind := t.Kind()
+		var writer io.WriterTo = t
+		if name == "v.position_embd.weight" {
+			kind = tensorKindFP32
+			writer = tensorFloat32Writer{tensor: t}
+		} else if sourceDType(t) == "BF16" && kind == tensorKindFP16 {
+			kind = tensorKindBF16
+			writer = tensorBF16Writer{tensor: t}
+		}
+
+		out = append(out, &ggml.Tensor{
+			Name:     name,
+			Kind:     kind,
+			Shape:    slices.Clone(t.Shape()),
+			WriterTo: writer,
+		})
+	}
+
+	return out
+}
+
+func (m *qwen3VLModel) qwen3VLPatchEmbedTensors(t Tensor) []*ggml.Tensor {
+	shape := t.Shape()
+	if len(shape) != 5 || shape[2] != 2 {
+		return nil
+	}
+
+	outShape := []uint64{shape[0], shape[1], shape[3], shape[4]}
+	return []*ggml.Tensor{
+		{
+			Name:     "v.patch_embd.weight",
+			Kind:     tensorKindFP32,
+			Shape:    slices.Clone(outShape),
+			WriterTo: tensorFloat32Writer{tensor: t, repacker: qwenTemporalPatchEmbedSlice(0)},
+		},
+		{
+			Name:     "v.patch_embd.weight.1",
+			Kind:     tensorKindFP32,
+			Shape:    slices.Clone(outShape),
+			WriterTo: tensorFloat32Writer{tensor: t, repacker: qwenTemporalPatchEmbedSlice(1)},
+		},
+	}
+}
+
+func qwenTemporalPatchEmbedSlice(slice int) Repacker {
+	return func(_ string, data []float32, shape []uint64) ([]float32, error) {
+		if len(shape) != 5 || shape[2] != 2 {
+			return nil, fmt.Errorf("qwen temporal patch embedding shape %v", shape)
+		}
+
+		outChannels := int(shape[0])
+		inChannels := int(shape[1])
+		frames := int(shape[2])
+		height := int(shape[3])
+		width := int(shape[4])
+		if slice < 0 || slice >= frames {
+			return nil, fmt.Errorf("qwen temporal patch embedding slice %d out of range", slice)
+		}
+
+		expected := outChannels * inChannels * frames * height * width
+		if len(data) != expected {
+			return nil, fmt.Errorf("qwen temporal patch embedding data size %d, expected %d", len(data), expected)
+		}
+
+		out := make([]float32, outChannels*inChannels*height*width)
+		for oc := range outChannels {
+			for ic := range inChannels {
+				for y := range height {
+					for x := range width {
+						src := ((((oc*inChannels+ic)*frames+slice)*height + y) * width) + x
+						dst := (((oc*inChannels+ic)*height + y) * width) + x
+						out[dst] = data[src]
+					}
+				}
+			}
+		}
+
+		return out, nil
+	}
+}
+
 func (m *qwen3VLModel) Tensors(ts []Tensor) []*ggml.Tensor {
 	var rest []Tensor
 	var out []*ggml.Tensor
@ -107,10 +357,15 @@ func (m *qwen3VLModel) Replacements() []string {
 		m.qwen3Model.Replacements(),
 		"model.language_", "",
 		"model.visual", "v",
-		"patch_embed.proj", "patch_embed",
+		"patch_embed.proj", "patch_embd",
+		"pos_embed", "position_embd",
 		"blocks", "blk",
 		"attn.qkv", "attn_qkv",
 		"attn.proj", "attn_out",
-		"deepstack_merger_list", "deepstack_merger",
+		"norm1", "ln1",
+		"norm2", "ln2",
+		"mlp.linear_fc1", "ffn_up",
+		"mlp.linear_fc2", "ffn_down",
+		"deepstack_merger_list", "deepstack",
 	)
 }
--- a/convert/convert_qwen3vl_test.go
+++ b/convert/convert_qwen3vl_test.go
@ -0,0 +1,147 @@
+package convert
+
+import (
+	"slices"
+	"testing"
+
+	"github.com/ollama/ollama/fs/ggml"
+)
+
+func TestQwen3VLTextAndProjectorKV(t *testing.T) {
+	m := &qwen3VLModel{
+		qwen3Model: qwen3Model{
+			HiddenSize: 2048,
+		},
+	}
+	m.RopeScaling.Type = "mrope"
+	m.RopeScaling.MropeSection = []int32{24, 20, 20}
+	m.VisionModel.Depth = 24
+	m.VisionModel.HiddenSize = 1024
+	m.VisionModel.IntermediateSize = 4096
+	m.VisionModel.OutHiddenSize = 2048
+	m.VisionModel.NumHeads = 16
+	m.VisionModel.InChannels = 3
+	m.VisionModel.PatchSize = 16
+	m.VisionModel.SpatialMergeSize = 2
+	m.VisionModel.NumPositionEmbeddings = 2304
+	m.VisionModel.TemporalPatchSize = 2
+	m.VisionModel.RMSNormEps = 1e-6
+	m.VisionModel.RopeTheta = 10000
+	m.VisionModel.DeepstackVisualIndexes = []int32{5, 11, 17}
+	m.VisionModel.ImageMean = []float32{0.5, 0.5, 0.5}
+	m.VisionModel.ImageStd = []float32{0.5, 0.5, 0.5}
+
+	textKV := m.TextKV(&Tokenizer{Vocabulary: &Vocabulary{}})
+	if got, want := textKV["general.architecture"], "qwen3vl"; got != want {
+		t.Fatalf("unexpected text architecture: got %v want %v", got, want)
+	}
+	if got, want := textKV["rope.dimension_sections"], []int32{24, 20, 20, 0}; !slices.Equal(got.([]int32), want) {
+		t.Fatalf("unexpected rope.dimension_sections: got %v want %v", got, want)
+	}
+	if got, want := textKV["n_deepstack_layers"], uint32(3); got != want {
+		t.Fatalf("unexpected n_deepstack_layers: got %v want %v", got, want)
+	}
+	for _, key := range []string{"vision.block_count", "vision.deepstack_visual_indexes", "rope.mrope_section"} {
+		if _, ok := textKV[key]; ok {
+			t.Fatalf("TextKV retained %q", key)
+		}
+	}
+
+	projectorKV := m.ProjectorKV(&Tokenizer{Vocabulary: &Vocabulary{}})
+	if got, want := projectorKV["general.architecture"], "clip"; got != want {
+		t.Fatalf("unexpected projector architecture: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["general.type"], "mmproj"; got != want {
+		t.Fatalf("unexpected projector type: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.projector_type"], "qwen3vl_merger"; got != want {
+		t.Fatalf("unexpected projector type: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.vision.feed_forward_length"], uint32(4096); got != want {
+		t.Fatalf("unexpected feed_forward_length: got %v want %v", got, want)
+	}
+	if got, want := projectorKV["clip.vision.image_size"], uint32(768); got != want {
+		t.Fatalf("unexpected image_size: got %v want %v", got, want)
+	}
+	mask, ok := projectorKV["clip.vision.is_deepstack_layers"].([]bool)
+	if !ok {
+		t.Fatalf("deepstack mask has unexpected type: %T", projectorKV["clip.vision.is_deepstack_layers"])
+	}
+	if len(mask) != 24 || !mask[5] || !mask[11] || !mask[17] {
+		t.Fatalf("unexpected deepstack mask: %v", mask)
+	}
+}
+
+func TestQwen3VLProjectorTensors(t *testing.T) {
+	m := &qwen3VLModel{}
+	m.VisionModel.DeepstackVisualIndexes = []int32{5, 11, 17}
+
+	tensors := m.ProjectorTensors([]Tensor{
+		&fakeTensor{
+			name:  "v.patch_embd.weight",
+			shape: []uint64{2, 2, 2, 1, 2},
+			data:  []float32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
+		},
+		&fakeTensor{name: "v.position_embd.weight", shape: []uint64{4, 2}, data: []float32{0, 1, 2, 3, 4, 5, 6, 7}},
+		&fakeTensor{name: "v.merger.linear_fc1.weight", shape: []uint64{4, 2}, data: make([]float32, 8)},
+		&fakeTensor{name: "v.merger.linear_fc2.bias", shape: []uint64{4}, data: make([]float32, 4)},
+		&fakeTensor{name: "v.merger.norm.weight", shape: []uint64{2}, data: make([]float32, 2)},
+		&fakeTensor{name: "v.deepstack.0.linear_fc1.weight", shape: []uint64{4, 2}, data: make([]float32, 8)},
+		&fakeTensor{name: "v.deepstack.1.norm.bias", shape: []uint64{2}, data: make([]float32, 2)},
+		&fakeTensor{name: "v.blk.0.attn_qkv.weight", shape: []uint64{6, 2}, data: make([]float32, 12), sourceDType: "BF16", kind: tensorKindFP16},
+		&fakeTensor{name: "token_embd.weight", shape: []uint64{2, 2}, data: make([]float32, 4)},
+	})
+
+	byName := map[string]uint32{}
+	for _, tensor := range tensors {
+		byName[tensor.Name] = tensor.Kind
+	}
+
+	if _, ok := byName["token_embd.weight"]; ok {
+		t.Fatalf("projector tensors included text tensor")
+	}
+	if got := byName["v.position_embd.weight"]; got != tensorKindFP32 {
+		t.Fatalf("position embedding was not promoted to F32: %d", got)
+	}
+	if got := byName["v.blk.0.attn_qkv.weight"]; got != tensorKindBF16 {
+		t.Fatalf("BF16 projector tensor was not preserved: %d", got)
+	}
+	for _, name := range []string{
+		"mm.0.weight",
+		"mm.2.bias",
+		"v.post_ln.weight",
+		"v.deepstack.5.fc1.weight",
+		"v.deepstack.11.norm.bias",
+	} {
+		if _, ok := byName[name]; !ok {
+			t.Fatalf("missing projector tensor %q", name)
+		}
+	}
+
+	firstTensor := tensorsByName(tensors)["v.patch_embd.weight"]
+	if firstTensor == nil {
+		t.Fatalf("first patch embedding slice missing")
+	}
+	if got, want := firstTensor.Shape, []uint64{2, 2, 1, 2}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected first patch shape: got %v want %v", got, want)
+	}
+	if got, want := readTensorData(t, firstTensor), []float32{0, 1, 4, 5, 8, 9, 12, 13}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected first patch data: got %v want %v", got, want)
+	}
+
+	secondTensor := tensorsByName(tensors)["v.patch_embd.weight.1"]
+	if secondTensor == nil {
+		t.Fatalf("second patch embedding slice missing")
+	}
+	if got, want := readTensorData(t, secondTensor), []float32{2, 3, 6, 7, 10, 11, 14, 15}; !slices.Equal(got, want) {
+		t.Fatalf("unexpected second patch data: got %v want %v", got, want)
+	}
+}
+
+func tensorsByName(tensors []*ggml.Tensor) map[string]*ggml.Tensor {
+	byName := map[string]*ggml.Tensor{}
+	for _, tensor := range tensors {
+		byName[tensor.Name] = tensor
+	}
+	return byName
+}
--- a/convert/tensor_test.go
+++ b/convert/tensor_test.go
@ -22,6 +22,7 @@ type fakeTensor struct {
 	data  []float32

 	sourceDType string
+	kind        uint32
 	repacker    Repacker
 }

@ -34,6 +35,9 @@ func (f fakeTensor) Shape() []uint64 {
 }

 func (f fakeTensor) Kind() uint32 {
+	if f.kind != 0 {
+		return f.kind
+	}
 	return 0
 }

@ -51,6 +55,7 @@ func (f fakeTensor) Clone() Tensor {
 		shape:       slices.Clone(f.shape),
 		data:        slices.Clone(f.data),
 		sourceDType: f.sourceDType,
+		kind:        f.kind,
 		repacker:    f.repacker,
 	}
 }
--- a/convert/tokenizer.go
+++ b/convert/tokenizer.go
@ -149,6 +149,7 @@ func parseTokenizer(fsys fs.FS, specialTokenTypes []string) (*Tokenizer, error)
 				if err := json.Unmarshal(bts, &sv.AddToken); err != nil {
 					return nil, err
 				}
+				sv.AddTokenSet = true
 			}

 			if bts, ok := p[fmt.Sprintf("%s_token", st)]; ok {
@ -314,6 +315,10 @@ type SpecialVocabulary struct {
 	ID       int
 	Content  string
 	AddToken bool
+	// AddTokenSet tracks whether tokenizer_config.json explicitly defined the
+	// add_*_token setting. Missing and explicit false have different GGUF
+	// semantics for some tokenizers.
+	AddTokenSet bool

 	// IDs is populated by generation_config.json
 	IDs []int32
--- a/convert/tokenizer_test.go
+++ b/convert/tokenizer_test.go
@ -184,8 +184,8 @@ func TestParseTokenizer(t *testing.T) {
 				},
 				SpecialVocabulary: []*SpecialVocabulary{
 					{Type: "pad", Content: "<pad>", ID: 0, AddToken: false},
-					{Type: "eos", Content: "<eos>", ID: 1, AddToken: false},
-					{Type: "bos", Content: "<bos>", ID: 2, AddToken: true},
+					{Type: "eos", Content: "<eos>", ID: 1, AddToken: false, AddTokenSet: true},
+					{Type: "bos", Content: "<bos>", ID: 2, AddToken: true, AddTokenSet: true},
 					{Type: "unk", Content: "<unk>", ID: 3, AddToken: false},
 				},
 				Pre: "default",
@ -380,8 +380,8 @@ func TestParseTokenizer(t *testing.T) {
 					Types:  []int32{3, 3, 3, 3},
 				},
 				SpecialVocabulary: []*SpecialVocabulary{
-					{Type: "eos", Content: "<eos>", ID: 1, IDs: []int32{1, 2, 3}, AddToken: false},
-					{Type: "bos", Content: "<bos>", ID: 0, AddToken: true},
+					{Type: "eos", Content: "<eos>", ID: 1, IDs: []int32{1, 2, 3}, AddToken: false, AddTokenSet: true},
+					{Type: "bos", Content: "<bos>", ID: 0, AddToken: true, AddTokenSet: true},
 				},
 				Pre: "default",
 			},
@ -423,3 +423,23 @@ func TestParseTokenizer(t *testing.T) {
 		})
 	}
 }
+
+func TestModelParametersKVOmitsMissingAddToken(t *testing.T) {
+	kv := ModelParameters{}.KV(&Tokenizer{
+		Vocabulary: &Vocabulary{Model: "gpt2"},
+		SpecialVocabulary: []*SpecialVocabulary{
+			{Type: "bos", Content: "<bos>", ID: 1},
+			{Type: "eos", Content: "<eos>", ID: 2, AddToken: false, AddTokenSet: true},
+		},
+	})
+
+	if _, ok := kv["tokenizer.ggml.add_bos_token"]; ok {
+		t.Errorf("tokenizer.ggml.add_bos_token should be omitted when add_bos_token is absent")
+	}
+	if got := kv["tokenizer.ggml.bos_token_id"]; got != uint32(1) {
+		t.Errorf("tokenizer.ggml.bos_token_id = %v, want 1", got)
+	}
+	if got, ok := kv["tokenizer.ggml.add_eos_token"]; !ok || got != false {
+		t.Errorf("tokenizer.ggml.add_eos_token = %v, %v; want explicit false", got, ok)
+	}
+}
--- a/discover/amd.go
+++ b/discover/amd.go
@ -0,0 +1,487 @@
+// AMD discovery needs a small amount of backend-specific handling beyond the
+// generic llama-server device list. ROCm devices expose their real capability
+// as gfx targets, and the shipped rocBLAS kernels define which of those
+// targets are actually usable. On Linux, KFD topology and DRM sysfs attributes
+// provide the integrated-vs-discrete signal needed for scheduler decisions. On
+// Windows, older HIP driver installs can also leave ROCm libraries present but
+// too old to support GPU inference. These helpers keep that extra validation
+// and warning logic in one place.
+package discover
+
+import (
+	"bufio"
+	"log/slog"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"regexp"
+	"runtime"
+	"sort"
+	"strconv"
+	"strings"
+
+	"github.com/ollama/ollama/ml"
+)
+
+// gfxTargetRegex matches ROCm stderr lines like:
+//
+//	Device 0: AMD Radeon RX 6700 XT, gfx1031 (0x1031), VMM: no, Wave Size: 32, VRAM: 12272 MiB
+//	Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
+var gfxTargetRegex = regexp.MustCompile(
+	`Device\s+(\d+):.*,\s+(gfx[0-9a-f]+)[\s:(]`,
+)
+
+var pciIDRegex = regexp.MustCompile(`^[0-9a-fA-F]{4}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-7]$`)
+
+func parseROCmGFXTargets(output string) map[int]string {
+	gfxByIndex := make(map[int]string)
+
+	scanner := bufio.NewScanner(strings.NewReader(output))
+	for scanner.Scan() {
+		if matches := gfxTargetRegex.FindStringSubmatch(scanner.Text()); matches != nil {
+			idx, _ := strconv.Atoi(matches[1])
+			gfxByIndex[idx] = matches[2]
+		}
+	}
+
+	return gfxByIndex
+}
+
+func parseGFXTarget(gfx string) (int, int) {
+	gfx, ok := strings.CutPrefix(gfx, "gfx")
+	if !ok || len(gfx) < 3 {
+		return 0, 0
+	}
+
+	major, err := strconv.ParseInt(gfx[:len(gfx)-2], 16, 32)
+	if err != nil {
+		return 0, 0
+	}
+	minor, err := strconv.ParseInt(gfx[len(gfx)-2:], 16, 32)
+	if err != nil {
+		return 0, 0
+	}
+
+	return int(major), int(minor)
+}
+
+// HSA_OVERRIDE_GFX_VERSION changes the effective HIP/rocBLAS target even
+// though KFD/sysfs still reports the physical ASIC.
+func hsaOverrideGFXTarget() string {
+	return rocmGFXTargetOverride(os.Getenv("HSA_OVERRIDE_GFX_VERSION"))
+}
+
+func rocmGFXTargetOverride(value string) string {
+	value = strings.TrimSpace(value)
+	if value == "" {
+		return ""
+	}
+	if strings.HasPrefix(value, "gfx") {
+		if major, minor := parseGFXTarget(value); major != 0 || minor != 0 {
+			return value
+		}
+		return ""
+	}
+
+	parts := strings.Split(value, ".")
+	if len(parts) != 3 {
+		return ""
+	}
+
+	var digits [3]uint64
+	for i, part := range parts {
+		digit, err := strconv.ParseUint(part, 10, 8)
+		if err != nil || digit > 0xf {
+			return ""
+		}
+		digits[i] = digit
+	}
+
+	return "gfx" +
+		strconv.FormatUint(digits[0], 10) +
+		strconv.FormatUint(digits[1], 16) +
+		strconv.FormatUint(digits[2], 16)
+}
+
+func setROCmGFXTarget(device *ml.DeviceInfo, gfx string) {
+	if gfx == "" || device.Library != "ROCm" {
+		return
+	}
+	device.GFXTarget = gfx
+	device.ComputeMajor, device.ComputeMinor = parseGFXTarget(gfx)
+}
+
+// rocblasGFXTargets scans the rocblas library directory for supported gfx targets
+// by looking for TensileLibrary_lazy_gfxNNNN.dat files.
+func rocblasGFXTargets(libDirs []string) map[string]bool {
+	targets := make(map[string]bool)
+	for _, dir := range libDirs {
+		files, _ := filepath.Glob(filepath.Join(dir, "rocblas", "library", "TensileLibrary_lazy_gfx*.dat"))
+		for _, f := range files {
+			base := filepath.Base(f)
+			if t, ok := strings.CutPrefix(base, "TensileLibrary_lazy_"); ok {
+				if t, ok = strings.CutSuffix(t, ".dat"); ok {
+					targets[t] = true
+				}
+			}
+		}
+	}
+	return targets
+}
+
+type rocmLinuxSysfsDevice struct {
+	pciID      string
+	gfxTarget  string
+	integrated bool
+	known      bool
+}
+
+func refineLinuxROCmDevices(devices []ml.DeviceInfo) []ml.DeviceInfo {
+	if runtime.GOOS != "linux" {
+		return devices
+	}
+	applyLinuxROCmRefinement(devices, "/sys")
+	return devices
+}
+
+func applyLinuxROCmRefinement(devices []ml.DeviceInfo, sysfsRoot string) bool {
+	var rocmIndexes []int
+	for i, device := range devices {
+		if device.Library == "ROCm" {
+			rocmIndexes = append(rocmIndexes, i)
+		}
+	}
+	if len(rocmIndexes) == 0 {
+		return false
+	}
+
+	sysfsDevices, err := readROCmLinuxSysfsDevices(sysfsRoot)
+	if err != nil {
+		slog.Debug("linux rocm device refinement unavailable", "error", err)
+		return false
+	}
+	if len(sysfsDevices) != len(rocmIndexes) {
+		slog.Debug("linux rocm device refinement skipped: device count mismatch",
+			"llama_server_count", len(rocmIndexes), "kfd_count", len(sysfsDevices))
+		return false
+	}
+
+	byPCI := map[string]rocmLinuxSysfsDevice{}
+	byGFX := uniqueROCmSysfsDevicesByGFX(sysfsDevices)
+	for _, sysfsDevice := range sysfsDevices {
+		if sysfsDevice.pciID != "" {
+			byPCI[strings.ToLower(sysfsDevice.pciID)] = sysfsDevice
+		}
+	}
+
+	refined := 0
+	for i, rocmIndex := range rocmIndexes {
+		device := &devices[rocmIndex]
+		sysfsDevice, ok := matchROCmLinuxSysfsDevice(*device, i, sysfsDevices, byPCI, byGFX)
+		if !ok {
+			slog.Debug("linux rocm device refinement skipped: no stable match",
+				"device", device.Name, "pci_id", device.PCIID, "gfx", device.GFXTarget)
+			continue
+		}
+		applyROCmLinuxSysfsDevice(device, sysfsDevice)
+		refined++
+	}
+
+	if refined == 0 {
+		return false
+	}
+
+	slog.Debug("linux rocm device refinement applied", "devices", refined)
+	return true
+}
+
+func uniqueROCmSysfsDevicesByGFX(sysfsDevices []rocmLinuxSysfsDevice) map[string]rocmLinuxSysfsDevice {
+	byGFX := map[string]rocmLinuxSysfsDevice{}
+	duplicates := map[string]bool{}
+	for _, sysfsDevice := range sysfsDevices {
+		if sysfsDevice.gfxTarget == "" {
+			continue
+		}
+		if _, ok := byGFX[sysfsDevice.gfxTarget]; ok {
+			duplicates[sysfsDevice.gfxTarget] = true
+			continue
+		}
+		byGFX[sysfsDevice.gfxTarget] = sysfsDevice
+	}
+	for gfx := range duplicates {
+		delete(byGFX, gfx)
+	}
+	return byGFX
+}
+
+func matchROCmLinuxSysfsDevice(device ml.DeviceInfo, index int, sysfsDevices []rocmLinuxSysfsDevice, byPCI, byGFX map[string]rocmLinuxSysfsDevice) (rocmLinuxSysfsDevice, bool) {
+	// ROCm visibility envs can remap backend ordinals while sysfs stays in
+	// physical KFD order, so prefer stable identity before index fallback.
+	if device.PCIID != "" {
+		if sysfsDevice, ok := byPCI[strings.ToLower(device.PCIID)]; ok {
+			return sysfsDevice, true
+		}
+	}
+
+	if device.GFXTarget != "" {
+		if sysfsDevice, ok := byGFX[device.GFXTarget]; ok {
+			return sysfsDevice, true
+		}
+	}
+
+	if index >= len(sysfsDevices) {
+		return rocmLinuxSysfsDevice{}, false
+	}
+	sysfsDevice := sysfsDevices[index]
+	if sysfsDevice.gfxTarget != "" && device.GFXTarget != "" && sysfsDevice.gfxTarget != device.GFXTarget {
+		slog.Debug("linux rocm device refinement index mismatch",
+			"device", device.Name, "llama_server_gfx", device.GFXTarget, "kfd_gfx", sysfsDevice.gfxTarget)
+		return rocmLinuxSysfsDevice{}, false
+	}
+	return sysfsDevice, true
+}
+
+func applyROCmLinuxSysfsDevice(device *ml.DeviceInfo, sysfsDevice rocmLinuxSysfsDevice) {
+	if sysfsDevice.pciID != "" {
+		device.PCIID = sysfsDevice.pciID
+	}
+	if sysfsDevice.known {
+		device.Integrated = sysfsDevice.integrated
+	}
+}
+
+func readROCmLinuxSysfsDevices(sysfsRoot string) ([]rocmLinuxSysfsDevice, error) {
+	nodeRoot := filepath.Join(sysfsRoot, "class", "kfd", "kfd", "topology", "nodes")
+	entries, err := os.ReadDir(nodeRoot)
+	if err != nil {
+		return nil, err
+	}
+
+	sort.Slice(entries, func(i, j int) bool {
+		left, _ := strconv.Atoi(entries[i].Name())
+		right, _ := strconv.Atoi(entries[j].Name())
+		return left < right
+	})
+
+	var devices []rocmLinuxSysfsDevice
+	for _, entry := range entries {
+		if !entry.IsDir() {
+			continue
+		}
+		properties, err := readKFDNodeProperties(filepath.Join(nodeRoot, entry.Name(), "properties"))
+		if err != nil || !properties.isGPU() {
+			continue
+		}
+
+		device, err := readROCmDRMDevice(sysfsRoot, properties.drmRenderMinor)
+		if err != nil {
+			slog.Debug("linux rocm sysfs device skipped", "node", entry.Name(), "error", err)
+			continue
+		}
+		device.gfxTarget = gfxTargetFromKFDVersion(properties.gfxTargetVersion)
+		devices = append(devices, device)
+	}
+
+	return devices, nil
+}
+
+type kfdNodeProperties struct {
+	vendorID         uint64
+	deviceID         uint64
+	drmRenderMinor   int
+	gfxTargetVersion uint64
+}
+
+func (p kfdNodeProperties) isGPU() bool {
+	return p.vendorID != 0 && p.deviceID != 0 && p.drmRenderMinor != 0
+}
+
+func readKFDNodeProperties(path string) (kfdNodeProperties, error) {
+	file, err := os.Open(path)
+	if err != nil {
+		return kfdNodeProperties{}, err
+	}
+	defer file.Close()
+
+	values := make(map[string]string)
+	scanner := bufio.NewScanner(file)
+	for scanner.Scan() {
+		fields := strings.Fields(scanner.Text())
+		if len(fields) >= 2 {
+			values[fields[0]] = fields[1]
+		}
+	}
+	if err := scanner.Err(); err != nil {
+		return kfdNodeProperties{}, err
+	}
+
+	vendorID, _ := parseSysfsUint(values["vendor_id"])
+	deviceID, _ := parseSysfsUint(values["device_id"])
+	renderMinor, _ := parseSysfsUint(values["drm_render_minor"])
+	gfxVersion, _ := parseSysfsUint(values["gfx_target_version"])
+
+	return kfdNodeProperties{
+		vendorID:         vendorID,
+		deviceID:         deviceID,
+		drmRenderMinor:   int(renderMinor),
+		gfxTargetVersion: gfxVersion,
+	}, nil
+}
+
+func readROCmDRMDevice(sysfsRoot string, renderMinor int) (rocmLinuxSysfsDevice, error) {
+	devicePath := filepath.Join(sysfsRoot, "class", "drm", "renderD"+strconv.Itoa(renderMinor), "device")
+	resolvedDevicePath, err := filepath.EvalSymlinks(devicePath)
+	if err != nil {
+		return rocmLinuxSysfsDevice{}, err
+	}
+
+	vendor, err := readSysfsString(filepath.Join(resolvedDevicePath, "vendor"))
+	if err != nil {
+		return rocmLinuxSysfsDevice{}, err
+	}
+	if !strings.EqualFold(vendor, "0x1002") {
+		return rocmLinuxSysfsDevice{}, nil
+	}
+
+	driver, err := readSysfsDriverName(filepath.Join(resolvedDevicePath, "driver"))
+	if err != nil {
+		return rocmLinuxSysfsDevice{}, err
+	}
+	if driver != "amdgpu" {
+		return rocmLinuxSysfsDevice{}, nil
+	}
+
+	device := rocmLinuxSysfsDevice{pciID: pciIDFromPath(resolvedDevicePath)}
+	if sysfsFileExists(filepath.Join(resolvedDevicePath, "mem_info_vram_vendor")) ||
+		sysfsFileExists(filepath.Join(resolvedDevicePath, "board_info")) {
+		device.known = true
+		return device, nil
+	}
+
+	vramTotal, ok := readROCmLinuxMemoryInfo(resolvedDevicePath, "mem_info_vram_total")
+	if !ok {
+		return device, nil
+	}
+	gttTotal, ok := readROCmLinuxMemoryInfo(resolvedDevicePath, "mem_info_gtt_total")
+	if !ok {
+		return device, nil
+	}
+
+	const (
+		maxIntegratedVRAM = 4 << 30
+		minSharedGTT      = 8 << 30
+	)
+	if vramTotal > 0 && vramTotal <= maxIntegratedVRAM && gttTotal >= minSharedGTT && gttTotal >= 4*vramTotal {
+		device.integrated = true
+		device.known = true
+	}
+
+	return device, nil
+}
+
+func readROCmLinuxMemoryInfo(devicePath, name string) (uint64, bool) {
+	value, err := readSysfsUint(filepath.Join(devicePath, name))
+	return value, err == nil
+}
+
+func readSysfsString(path string) (string, error) {
+	data, err := os.ReadFile(path)
+	if err != nil {
+		return "", err
+	}
+	return strings.TrimSpace(string(data)), nil
+}
+
+func readSysfsDriverName(path string) (string, error) {
+	driver, readErr := readSysfsString(path)
+	if readErr == nil {
+		return driver, nil
+	}
+	driverPath, err := filepath.EvalSymlinks(path)
+	if err == nil {
+		return filepath.Base(driverPath), nil
+	}
+	return "", readErr
+}
+
+func readSysfsUint(path string) (uint64, error) {
+	value, err := readSysfsString(path)
+	if err != nil {
+		return 0, err
+	}
+	return parseSysfsUint(value)
+}
+
+func parseSysfsUint(value string) (uint64, error) {
+	return strconv.ParseUint(strings.TrimSpace(value), 0, 64)
+}
+
+func sysfsFileExists(path string) bool {
+	_, err := os.Stat(path)
+	return err == nil
+}
+
+func pciIDFromPath(path string) string {
+	base := filepath.Base(path)
+	if pciIDRegex.MatchString(base) {
+		return base
+	}
+	return ""
+}
+
+func gfxTargetFromKFDVersion(version uint64) string {
+	if version == 0 {
+		return ""
+	}
+	major := version / 10000
+	minor := (version / 100) % 100
+	stepping := version % 100
+	if minor > 0xf || stepping > 0xf {
+		return ""
+	}
+	return "gfx" + strconv.FormatUint(major, 10) + strconv.FormatUint(minor, 16) + strconv.FormatUint(stepping, 16)
+}
+
+// filterUnsupportedROCmDevices removes ROCm devices whose gfx target doesn't have
+// matching rocblas kernels bundled.
+func filterUnsupportedROCmDevices(devices []ml.DeviceInfo, libDirs []string) []ml.DeviceInfo {
+	supported := rocblasGFXTargets(libDirs)
+	if len(supported) == 0 {
+		return devices
+	}
+
+	override := hsaOverrideGFXTarget()
+	var filtered []ml.DeviceInfo
+	for _, dev := range devices {
+		if dev.Library != "ROCm" {
+			filtered = append(filtered, dev)
+			continue
+		}
+
+		setROCmGFXTarget(&dev, override)
+		gfx := dev.GFXTarget
+		if gfx == "" {
+			filtered = append(filtered, dev)
+			continue
+		}
+		if supported[gfx] {
+			filtered = append(filtered, dev)
+		} else {
+			slog.Warn("dropping ROCm device — no rocblas support for gfx target",
+				"device", dev.Name, "gfx_target", gfx, "supported", supported,
+				"hint", "set HSA_OVERRIDE_GFX_VERSION to map to a supported target")
+		}
+	}
+	return filtered
+}
+
+func detectOldAMDDriverWindows() {
+	if runtime.GOOS != "windows" {
+		return
+	}
+	_, errV6 := exec.LookPath("amdhip64_6.dll")
+	_, errV7 := exec.LookPath("amdhip64_7.dll")
+	if errV6 == nil && errV7 != nil {
+		slog.Warn("AMD driver is too old. Update your AMD driver to enable GPU inference.")
+	}
+}
--- a/discover/amd_test.go
+++ b/discover/amd_test.go
@ -0,0 +1,289 @@
+package discover
+
+import (
+	"os"
+	"path/filepath"
+	"runtime"
+	"strconv"
+	"testing"
+
+	"github.com/ollama/ollama/ml"
+)
+
+func TestApplyLinuxROCmRefinement(t *testing.T) {
+	if runtime.GOOS == "windows" {
+		t.Skip("fake Linux PCI sysfs paths use ':' which is not valid in Windows filenames")
+	}
+
+	tests := []struct {
+		name           string
+		nodes          []fakeROCmNode
+		devices        []ml.DeviceInfo
+		applied        bool
+		wantIntegrated []bool
+		wantPCIIDs     []string
+	}{
+		{
+			name: "apu is integrated",
+			nodes: []fakeROCmNode{{
+				node:        1,
+				renderMinor: 128,
+				gfxVersion:  "90012",
+				vramTotal:   2 << 30,
+				gttTotal:    32 << 30,
+			}},
+			devices: []ml.DeviceInfo{{
+				DeviceID:  ml.DeviceID{ID: "0", Library: "ROCm"},
+				Name:      "ROCm0",
+				GFXTarget: "gfx90c",
+			}},
+			applied:        true,
+			wantIntegrated: []bool{true},
+		},
+		{
+			name: "low vram dgpu is not integrated",
+			nodes: []fakeROCmNode{{
+				node:        1,
+				renderMinor: 128,
+				gfxVersion:  "100601",
+				vramTotal:   4 << 30,
+				gttTotal:    32 << 30,
+				vramVendor:  true,
+				boardInfo:   true,
+			}},
+			devices: []ml.DeviceInfo{{
+				DeviceID:  ml.DeviceID{ID: "0", Library: "ROCm"},
+				Name:      "ROCm0",
+				GFXTarget: "gfx1061",
+			}},
+			applied:        true,
+			wantIntegrated: []bool{false},
+		},
+		{
+			name: "mixed system follows kfd order not drm order",
+			nodes: []fakeROCmNode{
+				{
+					node:        1,
+					renderMinor: 129,
+					gfxVersion:  "110000",
+					vramTotal:   48 << 30,
+					gttTotal:    64 << 30,
+					vramVendor:  true,
+					boardInfo:   true,
+				},
+				{
+					node:        2,
+					renderMinor: 128,
+					gfxVersion:  "110003",
+					vramTotal:   512 << 20,
+					gttTotal:    32 << 30,
+				},
+			},
+			devices: []ml.DeviceInfo{
+				{DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"}, Name: "ROCm0", GFXTarget: "gfx1100"},
+				{DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"}, Name: "ROCm1", GFXTarget: "gfx1103"},
+			},
+			applied:        true,
+			wantIntegrated: []bool{false, true},
+		},
+		{
+			name: "remapped visible order matches existing pci identity",
+			nodes: []fakeROCmNode{
+				{
+					node:        1,
+					renderMinor: 128,
+					pciID:       "0000:e3:00.0",
+					gfxVersion:  "110000",
+					vramTotal:   48 << 30,
+					gttTotal:    64 << 30,
+					vramVendor:  true,
+					boardInfo:   true,
+				},
+				{
+					node:        2,
+					renderMinor: 129,
+					pciID:       "0000:c3:00.0",
+					gfxVersion:  "120000",
+					vramTotal:   2 << 30,
+					gttTotal:    32 << 30,
+				},
+			},
+			devices: []ml.DeviceInfo{
+				{DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"}, Name: "ROCm0", GFXTarget: "gfx1200", PCIID: "0000:c3:00.0"},
+				{DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"}, Name: "ROCm1", GFXTarget: "gfx1100", PCIID: "0000:e3:00.0"},
+			},
+			applied:        true,
+			wantIntegrated: []bool{true, false},
+			wantPCIIDs:     []string{"0000:c3:00.0", "0000:e3:00.0"},
+		},
+		{
+			name: "remapped visible order matches unique gfx when pci is absent",
+			nodes: []fakeROCmNode{
+				{
+					node:        1,
+					renderMinor: 128,
+					pciID:       "0000:e3:00.0",
+					gfxVersion:  "110000",
+					vramTotal:   48 << 30,
+					gttTotal:    64 << 30,
+					vramVendor:  true,
+					boardInfo:   true,
+				},
+				{
+					node:        2,
+					renderMinor: 129,
+					pciID:       "0000:c3:00.0",
+					gfxVersion:  "120000",
+					vramTotal:   2 << 30,
+					gttTotal:    32 << 30,
+				},
+			},
+			devices: []ml.DeviceInfo{
+				{DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"}, Name: "ROCm0", GFXTarget: "gfx1200"},
+				{DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"}, Name: "ROCm1", GFXTarget: "gfx1100"},
+			},
+			applied:        true,
+			wantIntegrated: []bool{true, false},
+			wantPCIIDs:     []string{"0000:c3:00.0", "0000:e3:00.0"},
+		},
+		{
+			name: "missing kfd data leaves devices unchanged",
+			devices: []ml.DeviceInfo{{
+				DeviceID:   ml.DeviceID{ID: "0", Library: "ROCm"},
+				Name:       "ROCm0",
+				Integrated: true,
+			}},
+			wantIntegrated: []bool{true},
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			sysfsRoot := t.TempDir()
+			for _, node := range tt.nodes {
+				writeFakeROCmNode(t, sysfsRoot, node)
+			}
+
+			devices := append([]ml.DeviceInfo(nil), tt.devices...)
+			applied := applyLinuxROCmRefinement(devices, sysfsRoot)
+			if applied != tt.applied {
+				t.Fatalf("applied = %v, want %v", applied, tt.applied)
+			}
+			for i, want := range tt.wantIntegrated {
+				if devices[i].Integrated != want {
+					t.Fatalf("device %d integrated = %v, want %v", i, devices[i].Integrated, want)
+				}
+			}
+			for i, want := range tt.wantPCIIDs {
+				if devices[i].PCIID != want {
+					t.Fatalf("device %d PCIID = %q, want %q", i, devices[i].PCIID, want)
+				}
+			}
+		})
+	}
+}
+
+func TestSameRefreshDeviceMatchesROCmByPCI(t *testing.T) {
+	updated := ml.DeviceInfo{
+		DeviceID: ml.DeviceID{ID: "0", Library: "ROCm"},
+		PCIID:    "0000:c3:00.0",
+	}
+	existing := ml.DeviceInfo{
+		DeviceID: ml.DeviceID{ID: "1", Library: "ROCm"},
+		PCIID:    "0000:C3:00.0",
+	}
+	if !sameRefreshDevice(updated, existing) {
+		t.Fatal("sameRefreshDevice did not match remapped ROCm device by PCI ID")
+	}
+}
+
+func TestFilterUnsupportedROCmDevicesRespectsHSAOverride(t *testing.T) {
+	t.Setenv("HSA_OVERRIDE_GFX_VERSION", "10.3.0")
+
+	libDir := t.TempDir()
+	rocblasDir := filepath.Join(libDir, "rocblas", "library")
+	if err := os.MkdirAll(rocblasDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(rocblasDir, "TensileLibrary_lazy_gfx1030.dat"), nil, 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	devices := filterUnsupportedROCmDevices([]ml.DeviceInfo{{
+		DeviceID:     ml.DeviceID{ID: "0", Library: "ROCm"},
+		Name:         "ROCm0",
+		GFXTarget:    "gfx1031",
+		ComputeMajor: 0x10,
+		ComputeMinor: 0x31,
+	}}, []string{libDir})
+	if len(devices) != 1 {
+		t.Fatalf("got %d devices, want 1", len(devices))
+	}
+	if got := devices[0].GFXTarget; got != "gfx1030" {
+		t.Fatalf("GFXTarget = %q, want gfx1030", got)
+	}
+	if got := devices[0].Compute(); got != "gfx1030" {
+		t.Fatalf("Compute() = %q, want gfx1030", got)
+	}
+}
+
+type fakeROCmNode struct {
+	node        int
+	renderMinor int
+	pciID       string
+	gfxVersion  string
+	vramTotal   uint64
+	gttTotal    uint64
+	vramVendor  bool
+	boardInfo   bool
+}
+
+func writeFakeROCmNode(t *testing.T, sysfsRoot string, node fakeROCmNode) {
+	t.Helper()
+
+	nodeDir := filepath.Join(sysfsRoot, "class", "kfd", "kfd", "topology", "nodes", strconv.Itoa(node.node))
+	if err := os.MkdirAll(nodeDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	properties := "vendor_id 4098\n" +
+		"device_id 1234\n" +
+		"drm_render_minor " + strconv.Itoa(node.renderMinor) + "\n" +
+		"gfx_target_version " + node.gfxVersion + "\n"
+	if err := os.WriteFile(filepath.Join(nodeDir, "properties"), []byte(properties), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	deviceDir := filepath.Join(sysfsRoot, "class", "drm", "renderD"+strconv.Itoa(node.renderMinor), "device")
+	if node.pciID != "" {
+		targetDir := filepath.Join(sysfsRoot, "devices", node.pciID)
+		if err := os.MkdirAll(targetDir, 0o755); err != nil {
+			t.Fatal(err)
+		}
+		if err := os.MkdirAll(filepath.Dir(deviceDir), 0o755); err != nil {
+			t.Fatal(err)
+		}
+		if err := os.Symlink(targetDir, deviceDir); err != nil {
+			t.Skipf("symlink unavailable for fake sysfs PCI path: %v", err)
+		}
+		deviceDir = targetDir
+	} else if err := os.MkdirAll(deviceDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	writeFakeSysfsFile(t, deviceDir, "vendor", "0x1002\n")
+	writeFakeSysfsFile(t, deviceDir, "driver", "amdgpu\n")
+	writeFakeSysfsFile(t, deviceDir, "mem_info_vram_total", strconv.FormatUint(node.vramTotal, 10)+"\n")
+	writeFakeSysfsFile(t, deviceDir, "mem_info_gtt_total", strconv.FormatUint(node.gttTotal, 10)+"\n")
+	if node.vramVendor {
+		writeFakeSysfsFile(t, deviceDir, "mem_info_vram_vendor", "samsung\n")
+	}
+	if node.boardInfo {
+		writeFakeSysfsFile(t, deviceDir, "board_info", "type : cem\n")
+	}
+}
+
+func writeFakeSysfsFile(t *testing.T, dir, name, content string) {
+	t.Helper()
+	if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
+		t.Fatal(err)
+	}
+}
--- a/discover/cpu_linux.go
+++ b/discover/cpu_linux.go
@ -4,13 +4,8 @@ import (
 	"bufio"
 	"errors"
 	"fmt"
-	"io"
-	"log/slog"
 	"os"
 	"path/filepath"
-	"reflect"
-	"regexp"
-	"sort"
 	"strconv"
 	"strings"

@ -92,143 +87,6 @@ func getUint64ValueFromFile(path string) (uint64, error) {
 	return 0, errors.New("empty file content")
 }

-const CpuInfoFilename = "/proc/cpuinfo"
-
-type linuxCpuInfo struct {
-	ID         string `cpuinfo:"processor"`
-	VendorID   string `cpuinfo:"vendor_id"`
-	ModelName  string `cpuinfo:"model name"`
-	PhysicalID string `cpuinfo:"physical id"`
-	Siblings   string `cpuinfo:"siblings"`
-	CoreID     string `cpuinfo:"core id"`
-}
-
-func GetCPUDetails() []CPU {
-	file, err := os.Open(CpuInfoFilename)
-	if err != nil {
-		slog.Warn("failed to get CPU details", "error", err)
-		return nil
-	}
-	defer file.Close()
-	cpus := linuxCPUDetails(file)
-	return overwriteThreadCountByLinuxCgroups(cpus)
-}
-
-func overwriteThreadCountByLinuxCgroups(cpus []CPU) []CPU {
-	file, err := os.Open("/sys/fs/cgroup/cpu.max")
-	if err != nil {
-		return cpus
-	}
-	defer file.Close()
-
-	scanner := bufio.NewScanner(file)
-	for scanner.Scan() {
-		line := scanner.Text()
-		if sl := strings.Split(line, " "); len(sl) == 2 {
-			allowdUs, err := strconv.ParseInt(sl[0], 10, 64)
-			if err != nil {
-				slog.Warn("failed to parse CPU allowed micro secs", "error", err)
-				return cpus
-			}
-			unitUs, err := strconv.ParseInt(sl[1], 10, 64)
-			if err != nil {
-				slog.Warn("failed to parse CPU unit micro secs", "error", err)
-				return cpus
-			}
-
-			threads := int(max(allowdUs/unitUs, 1))
-
-			cpu := cpus[0]
-			cpu.CoreCount = threads
-			cpu.ThreadCount = threads
-			return []CPU{cpu}
-		}
-	}
-	return cpus
-}
-
-func linuxCPUDetails(file io.Reader) []CPU {
-	reColumns := regexp.MustCompile("\t+: ")
-	scanner := bufio.NewScanner(file)
-	cpuInfos := []linuxCpuInfo{}
-	cpu := &linuxCpuInfo{}
-	for scanner.Scan() {
-		line := scanner.Text()
-		if sl := reColumns.Split(line, 2); len(sl) > 1 {
-			t := reflect.TypeOf(cpu).Elem()
-			s := reflect.ValueOf(cpu).Elem()
-			for i := range t.NumField() {
-				field := t.Field(i)
-				tag := field.Tag.Get("cpuinfo")
-				if tag == sl[0] {
-					s.FieldByName(field.Name).SetString(sl[1])
-					break
-				}
-			}
-		} else if strings.TrimSpace(line) == "" && cpu.ID != "" {
-			cpuInfos = append(cpuInfos, *cpu)
-			cpu = &linuxCpuInfo{}
-		}
-	}
-	if cpu.ID != "" {
-		cpuInfos = append(cpuInfos, *cpu)
-	}
-
-	// Process the sockets/cores/threads
-	socketByID := map[string]*CPU{}
-	coreBySocket := map[string]map[string]struct{}{}
-	threadsByCoreBySocket := map[string]map[string]int{}
-	for _, c := range cpuInfos {
-		if _, found := socketByID[c.PhysicalID]; !found {
-			socketByID[c.PhysicalID] = &CPU{
-				ID:        c.PhysicalID,
-				VendorID:  c.VendorID,
-				ModelName: c.ModelName,
-			}
-			coreBySocket[c.PhysicalID] = map[string]struct{}{}
-			threadsByCoreBySocket[c.PhysicalID] = map[string]int{}
-		}
-		if c.CoreID != "" {
-			coreBySocket[c.PhysicalID][c.PhysicalID+":"+c.CoreID] = struct{}{}
-			threadsByCoreBySocket[c.PhysicalID][c.PhysicalID+":"+c.CoreID]++
-		} else {
-			coreBySocket[c.PhysicalID][c.PhysicalID+":"+c.ID] = struct{}{}
-			threadsByCoreBySocket[c.PhysicalID][c.PhysicalID+":"+c.ID]++
-		}
-	}
-
-	// Tally up the values from the tracking maps
-	for id, s := range socketByID {
-		s.CoreCount = len(coreBySocket[id])
-		s.ThreadCount = 0
-
-		// This only works if HT is enabled, consider a more reliable model, maybe cache size comparisons?
-		efficiencyCoreCount := 0
-		for _, threads := range threadsByCoreBySocket[id] {
-			s.ThreadCount += threads
-			if threads == 1 {
-				efficiencyCoreCount++
-			}
-		}
-		if efficiencyCoreCount == s.CoreCount {
-			// 1:1 mapping means they're not actually efficiency cores, but regular cores
-			s.EfficiencyCoreCount = 0
-		} else {
-			s.EfficiencyCoreCount = efficiencyCoreCount
-		}
-	}
-	keys := make([]string, 0, len(socketByID))
-	result := make([]CPU, 0, len(socketByID))
-	for k := range socketByID {
-		keys = append(keys, k)
-	}
-	sort.Strings(keys)
-	for _, k := range keys {
-		result = append(result, *socketByID[k])
-	}
-	return result
-}
-
 func IsNUMA() bool {
 	ids := map[string]any{}
 	packageIds, _ := filepath.Glob("/sys/devices/system/cpu/cpu*/topology/physical_package_id")
--- a/discover/cpu_linux_test.go
+++ b/discover/cpu_linux_test.go
--- a/discover/cpu_windows.go
+++ b/discover/cpu_windows.go
@ -2,11 +2,8 @@ package discover

 import (
 	"fmt"
-	"log/slog"
 	"syscall"
 	"unsafe"
-
-	"github.com/ollama/ollama/logutil"
 )

 type MEMORYSTATUSEX struct {
@ -22,10 +19,9 @@ type MEMORYSTATUSEX struct {
 }

 var (
-	k32                              = syscall.NewLazyDLL("kernel32.dll")
-	globalMemoryStatusExProc         = k32.NewProc("GlobalMemoryStatusEx")
-	sizeofMemoryStatusEx             = uint32(unsafe.Sizeof(MEMORYSTATUSEX{}))
-	GetLogicalProcessorInformationEx = k32.NewProc("GetLogicalProcessorInformationEx")
+	k32                      = syscall.NewLazyDLL("kernel32.dll")
+	globalMemoryStatusExProc = k32.NewProc("GlobalMemoryStatusEx")
+	sizeofMemoryStatusEx     = uint32(unsafe.Sizeof(MEMORYSTATUSEX{}))
 )

 func GetCPUMem() (memInfo, error) {
@ -37,184 +33,6 @@ func GetCPUMem() (memInfo, error) {
 	return memInfo{TotalMemory: memStatus.TotalPhys, FreeMemory: memStatus.AvailPhys, FreeSwap: memStatus.AvailPageFile}, nil
 }

-type LOGICAL_PROCESSOR_RELATIONSHIP uint32
-
-const (
-	RelationProcessorCore LOGICAL_PROCESSOR_RELATIONSHIP = iota
-	RelationNumaNode
-	RelationCache
-	RelationProcessorPackage
-	RelationGroup
-	RelationProcessorDie
-	RelationNumaNodeEx
-	RelationProcessorModule
-)
-const RelationAll LOGICAL_PROCESSOR_RELATIONSHIP = 0xffff
-
-type GROUP_AFFINITY struct {
-	Mask     uintptr // KAFFINITY
-	Group    uint16
-	Reserved [3]uint16
-}
-
-type PROCESSOR_RELATIONSHIP struct {
-	Flags           byte
-	EfficiencyClass byte
-	Reserved        [20]byte
-	GroupCount      uint16
-	GroupMask       [1]GROUP_AFFINITY // len GroupCount
-}
-
-// Omitted unused structs: NUMA_NODE_RELATIONSHIP CACHE_RELATIONSHIP GROUP_RELATIONSHIP
-
-type SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX struct {
-	Relationship LOGICAL_PROCESSOR_RELATIONSHIP
-	Size         uint32
-	U            [1]byte // Union len Size
-	// PROCESSOR_RELATIONSHIP
-	// NUMA_NODE_RELATIONSHIP
-	// CACHE_RELATIONSHIP
-	// GROUP_RELATIONSHIP
-}
-
-func (group *GROUP_AFFINITY) IsMember(target *GROUP_AFFINITY) bool {
-	if group == nil || target == nil {
-		return false
-	}
-	return group.Mask&target.Mask != 0
-}
-
-type winPackage struct {
-	groups              []*GROUP_AFFINITY
-	coreCount           int // performance cores = coreCount - efficiencyCoreCount
-	efficiencyCoreCount int
-	threadCount         int
-}
-
-func (pkg *winPackage) IsMember(target *GROUP_AFFINITY) bool {
-	for _, group := range pkg.groups {
-		if group.IsMember(target) {
-			return true
-		}
-	}
-	return false
-}
-
-func getLogicalProcessorInformationEx() ([]byte, error) {
-	buf := make([]byte, 1)
-	bufSize := len(buf)
-	ret, _, err := GetLogicalProcessorInformationEx.Call(
-		uintptr(RelationAll),
-		uintptr(unsafe.Pointer(&buf[0])),
-		uintptr(unsafe.Pointer(&bufSize)),
-	)
-	if ret != 0 {
-		logutil.Trace("failed to retrieve CPU payload size", "ret", ret, "size", bufSize, "error", err)
-		return nil, fmt.Errorf("failed to determine size info ret:%d %w", ret, err)
-	}
-
-	buf = make([]byte, bufSize)
-	ret, _, err = GetLogicalProcessorInformationEx.Call(
-		uintptr(RelationAll),
-		uintptr(unsafe.Pointer(&buf[0])),
-		uintptr(unsafe.Pointer(&bufSize)),
-	)
-	if ret == 0 {
-		logutil.Trace("failed to retrieve CPU information", "ret", ret, "size", len(buf), "new_size", bufSize, "error", err)
-		return nil, fmt.Errorf("failed to gather processor information ret:%d buflen:%d %w", ret, bufSize, err)
-	}
-	return buf, nil
-}
-
-func processSystemLogicalProcessorInforationList(buf []byte) []*winPackage {
-	var slpi *SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX
-	// Find all the packages first
-	packages := []*winPackage{}
-	for bufOffset := 0; bufOffset < len(buf); bufOffset += int(slpi.Size) {
-		slpi = (*SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(unsafe.Pointer(&buf[bufOffset]))
-		if slpi.Relationship != RelationProcessorPackage {
-			continue
-		}
-		pr := (*PROCESSOR_RELATIONSHIP)(unsafe.Pointer(&slpi.U[0]))
-		pkg := &winPackage{}
-		ga0 := unsafe.Pointer(&pr.GroupMask[0])
-		for j := range pr.GroupCount {
-			gm := (*GROUP_AFFINITY)(unsafe.Pointer(uintptr(ga0) + uintptr(j)*unsafe.Sizeof(GROUP_AFFINITY{})))
-			pkg.groups = append(pkg.groups, gm)
-		}
-		packages = append(packages, pkg)
-	}
-
-	slog.Info("packages", "count", len(packages))
-
-	// To identify efficiency cores we have to compare the relative values
-	// Larger values are "less efficient" (aka, more performant)
-	var maxEfficiencyClass byte
-	for bufOffset := 0; bufOffset < len(buf); bufOffset += int(slpi.Size) {
-		slpi = (*SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(unsafe.Pointer(&buf[bufOffset]))
-		if slpi.Relationship != RelationProcessorCore {
-			continue
-		}
-		pr := (*PROCESSOR_RELATIONSHIP)(unsafe.Pointer(&slpi.U[0]))
-		if pr.EfficiencyClass > maxEfficiencyClass {
-			maxEfficiencyClass = pr.EfficiencyClass
-		}
-	}
-	if maxEfficiencyClass > 0 {
-		slog.Info("efficiency cores detected", "maxEfficiencyClass", maxEfficiencyClass)
-	}
-
-	// then match up the Cores to the Packages, count up cores, threads and efficiency cores
-	for bufOffset := 0; bufOffset < len(buf); bufOffset += int(slpi.Size) {
-		slpi = (*SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(unsafe.Pointer(&buf[bufOffset]))
-		if slpi.Relationship != RelationProcessorCore {
-			continue
-		}
-		pr := (*PROCESSOR_RELATIONSHIP)(unsafe.Pointer(&slpi.U[0]))
-		ga0 := unsafe.Pointer(&pr.GroupMask[0])
-		for j := range pr.GroupCount {
-			gm := (*GROUP_AFFINITY)(unsafe.Pointer(uintptr(ga0) + uintptr(j)*unsafe.Sizeof(GROUP_AFFINITY{})))
-			for _, pkg := range packages {
-				if pkg.IsMember(gm) {
-					pkg.coreCount++
-					if pr.Flags == 0 {
-						pkg.threadCount++
-					} else {
-						pkg.threadCount += 2
-					}
-					if pr.EfficiencyClass < maxEfficiencyClass {
-						pkg.efficiencyCoreCount++
-					}
-				}
-			}
-		}
-	}
-
-	// Summarize the results
-	for i, pkg := range packages {
-		slog.Info("", "package", i, "cores", pkg.coreCount, "efficiency", pkg.efficiencyCoreCount, "threads", pkg.threadCount)
-	}
-
-	return packages
-}
-
-func GetCPUDetails() []CPU {
-	buf, err := getLogicalProcessorInformationEx()
-	if err != nil {
-		slog.Warn("failed to get CPU details", "error", err)
-		return nil
-	}
-	packages := processSystemLogicalProcessorInforationList(buf)
-	cpus := make([]CPU, len(packages))
-
-	for i, pkg := range packages {
-		cpus[i].CoreCount = pkg.coreCount
-		cpus[i].EfficiencyCoreCount = pkg.efficiencyCoreCount
-		cpus[i].ThreadCount = pkg.threadCount
-	}
-	return cpus
-}
-
 func IsNUMA() bool {
 	// numa support in ggml is linux only
 	return false
--- a/discover/cpu_windows_test.go
+++ b/discover/cpu_windows_test.go
--- a/discover/cuda_compat.go
+++ b/discover/cuda_compat.go
@ -0,0 +1,54 @@
+package discover
+
+import (
+	"context"
+	"log/slog"
+
+	"github.com/ollama/ollama/ml"
+)
+
+func filterOldCUDADriver(_ context.Context, devices []ml.DeviceInfo) []ml.DeviceInfo {
+	oldCUDA := func(dev ml.DeviceInfo) bool {
+		return dev.Library == "CUDA" && dev.ComputeMajor > 0 && dev.ComputeMajor < 7
+	}
+
+	needsCheck := false
+	for _, dev := range devices {
+		if oldCUDA(dev) {
+			needsCheck = true
+			break
+		}
+	}
+	if !needsCheck {
+		return devices
+	}
+
+	driver := nvidiaDriverMajorFromDevices(devices)
+	if driver == 0 {
+		slog.Warn("could not verify NVIDIA driver compatibility for an older NVIDIA GPU")
+		return devices
+	}
+	if driver >= 570 {
+		return devices
+	}
+
+	filtered := devices[:0]
+	for _, dev := range devices {
+		if oldCUDA(dev) {
+			slog.Warn("NVIDIA driver too old",
+				"device", dev.Description, "compute", dev.Compute(), "driver", driver, "required_driver", "570 or newer")
+			continue
+		}
+		filtered = append(filtered, dev)
+	}
+	return filtered
+}
+
+func nvidiaDriverMajorFromDevices(devices []ml.DeviceInfo) int {
+	for _, dev := range devices {
+		if dev.Library == "CUDA" && dev.NVIDIADriverMajor > 0 {
+			return dev.NVIDIADriverMajor
+		}
+	}
+	return 0
+}
--- a/discover/gpu.go
+++ b/discover/gpu.go
@ -17,31 +17,20 @@ import (
 // Included to drive logic for reducing Ollama-allocated overhead on L4T/Jetson devices.
 var CudaTegra string = os.Getenv("JETSON_JETPACK")

-// GetSystemInfo returns the last cached state of the GPUs on the system
+// GetSystemInfo returns host memory information used by scheduling.
 func GetSystemInfo() ml.SystemInfo {
-	logutil.Trace("performing CPU discovery")
+	logutil.Trace("performing system memory discovery")
 	startDiscovery := time.Now()
 	defer func() {
-		logutil.Trace("CPU discovery completed", "duration", time.Since(startDiscovery))
+		logutil.Trace("system memory discovery completed", "duration", time.Since(startDiscovery))
 	}()

 	memInfo, err := GetCPUMem()
 	if err != nil {
 		slog.Warn("error looking up system memory", "error", err)
 	}
-	var threadCount int
-	cpus := GetCPUDetails()
-	for _, c := range cpus {
-		threadCount += c.CoreCount - c.EfficiencyCoreCount
-	}
-
-	if threadCount == 0 {
-		// Fall back to Go's num CPU
-		threadCount = runtime.NumCPU()
-	}

 	return ml.SystemInfo{
-		ThreadCount: threadCount,
 		TotalMemory: memInfo.TotalMemory,
 		FreeMemory:  memInfo.FreeMemory,
 		FreeSwap:    memInfo.FreeSwap,
--- a/discover/gpu_darwin.go
+++ b/discover/gpu_darwin.go
@ -8,9 +8,6 @@ package discover
 import "C"

 import (
-	"log/slog"
-	"syscall"
-
 	"github.com/ollama/ollama/format"
 )

@ -26,28 +23,6 @@ func GetCPUMem() (memInfo, error) {
 	}, nil
 }

-func GetCPUDetails() []CPU {
-	query := "hw.perflevel0.physicalcpu"
-	perfCores, err := syscall.SysctlUint32(query)
-	if err != nil {
-		slog.Warn("failed to discover physical CPU details", "query", query, "error", err)
-	}
-	query = "hw.perflevel1.physicalcpu"
-	efficiencyCores, _ := syscall.SysctlUint32(query) // On x86 xeon this wont return data
-
-	// Determine thread count
-	query = "hw.logicalcpu"
-	logicalCores, _ := syscall.SysctlUint32(query)
-
-	return []CPU{
-		{
-			CoreCount:           int(perfCores + efficiencyCores),
-			EfficiencyCoreCount: int(efficiencyCores),
-			ThreadCount:         int(logicalCores),
-		},
-	}
-}
-
 func IsNUMA() bool {
 	// numa support in ggml is linux only
 	return false
--- a/discover/gpu_info_darwin.m
+++ b/discover/gpu_info_darwin.m
@ -27,9 +27,17 @@ uint64_t getFreeMemory() {
    return 0;
  }

-  uint64_t free_memory = (uint64_t)vm_stat.free_count * pagesize;
-  free_memory += (uint64_t)vm_stat.speculative_count * pagesize;
-  free_memory += (uint64_t)vm_stat.inactive_count * pagesize;
+  uint64_t used = (uint64_t)vm_stat.active_count * pagesize
+    + (uint64_t)vm_stat.inactive_count * pagesize
+    + (uint64_t)vm_stat.speculative_count * pagesize
+    + (uint64_t)vm_stat.wire_count * pagesize
+    + (uint64_t)vm_stat.compressor_page_count * pagesize
+    - (uint64_t)vm_stat.purgeable_count * pagesize
+    - (uint64_t)vm_stat.external_page_count * pagesize;

-  return free_memory;
+  uint64_t total_memory = [NSProcessInfo processInfo].physicalMemory;
+  if (used >= total_memory) {
+    return 0;
+  }
+  return total_memory - used;
 }
--- a/discover/llama_server.go
+++ b/discover/llama_server.go
@ -0,0 +1,513 @@
+package discover
+
+import (
+	"bufio"
+	"context"
+	"fmt"
+	"io"
+	"log/slog"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"regexp"
+	"runtime"
+	"strconv"
+	"strings"
+	"time"
+
+	"github.com/ollama/ollama/llm"
+	"github.com/ollama/ollama/logutil"
+	"github.com/ollama/ollama/ml"
+)
+
+// llamaServerDiscoveryWaitDelay bounds how long Wait can hang after we stop
+// the short-lived discovery subprocess.
+const llamaServerDiscoveryWaitDelay = 5 * time.Second
+
+// llamaServerDiscoverDevices spawns llama-server briefly (without a model) to
+// discover GPU devices and their capabilities. The server prints device info
+// and system_info (including compiled CUDA architectures) on startup before
+// any model load, then we kill it.
+//
+// Captured from combined stderr output:
+//
+//	Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
+//	Device 0: AMD Radeon RX 6700 XT, gfx1031 (0x1031), VMM: no, Wave Size: 32, VRAM: 12272 MiB
+//
+// Captured from stdout device list:
+//
+//	CUDA0: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
+//	Metal: Apple M3 Max (98304 MiB, 98303 MiB free)
+
+func llamaServerDiscoverDevices(ctx context.Context, libDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
+	status := llm.NewStatusWriter(llamaServerDiscoveryOutput(ctx))
+	llamaServer, err := llm.FindLlamaServer()
+	if err != nil {
+		slog.Debug("llama-server not available for device discovery", "error", err)
+		return nil, status, err
+	}
+
+	start := time.Now()
+	defer func() {
+		slog.Debug("llama-server device discovery took", "duration", time.Since(start), "libDirs", libDirs)
+	}()
+
+	// Use a random port to avoid conflicts. The server may start listening
+	// before it emits system_info, but we stop it as soon as we have the GPU
+	// discovery output we need.
+	port := 49152 + time.Now().UnixNano()%16383
+	cmd := exec.CommandContext(ctx, llamaServer,
+		"--port", strconv.FormatInt(port, 10),
+		"--host", "127.0.0.1",
+		"--no-webui",
+		"--offline",
+		"--verbose",
+	)
+	cmd.WaitDelay = llamaServerDiscoveryWaitDelay
+	cmd.Env = os.Environ()
+
+	llm.SetupLlamaServerCommandEnv(cmd, llamaServer, libDirs, extraEnvs)
+
+	logutil.Trace("running llama-server for discovery", "cmd", cmd.Path, "libDirs", libDirs)
+
+	// Capture stderr (device info + system_info) via pipe so we can
+	// read it line-by-line and kill the server as soon as we have what we need.
+	stderrPipe, err := cmd.StderrPipe()
+	if err != nil {
+		slog.Debug("llama-server discovery: failed to create stderr pipe", "error", err)
+		return nil, status, err
+	}
+	// Forward stdout through the same status writer so trace logging captures
+	// all llama-server discovery output.
+	cmd.Stdout = status
+
+	if err := cmd.Start(); err != nil {
+		slog.Debug("llama-server discovery: failed to start", "error", err)
+		return nil, status, err
+	}
+
+	// Read stderr until we see system_info or timeout
+	var stderrLines []string
+	gotSystemInfo := false
+	done := make(chan struct{})
+	go func() {
+		scanner := bufio.NewScanner(stderrPipe)
+		for scanner.Scan() {
+			line := scanner.Text()
+			_, _ = status.Write([]byte(line + "\n"))
+			stderrLines = append(stderrLines, line)
+			if strings.Contains(line, "system_info:") {
+				gotSystemInfo = true
+				break
+			}
+		}
+		close(done)
+	}()
+
+	select {
+	case <-done:
+	case <-ctx.Done():
+	}
+
+	// Kill the server - we have what we need, or timed out.
+	stoppedForDiscovery := false
+	if cmd.Process != nil {
+		stoppedForDiscovery = cmd.Process.Kill() == nil
+	}
+	waitErr := cmd.Wait()
+	if waitErr != nil {
+		exit := llm.ExitStatusFromError(waitErr)
+		if stoppedForDiscovery {
+			slog.Debug("llama-server discovery: stopped subprocess after collecting GPU info", "exit", exit, "libDirs", libDirs)
+		}
+		if !stoppedForDiscovery {
+			slog.Debug("llama-server discovery: server startup exited", "error", waitErr, "exit", exit, "libDirs", libDirs)
+		}
+	}
+	<-done
+
+	if ctx.Err() != nil {
+		slog.Warn("llama-server discovery: timed out waiting for server startup", "error", ctx.Err(), "libDirs", libDirs, "lines_captured", len(stderrLines))
+		return nil, status, ctx.Err()
+	}
+
+	if !gotSystemInfo {
+		slog.Warn("llama-server discovery: system_info line not found in output - "+
+			"CUDA architecture filtering will be disabled. If GPU inference fails, "+
+			"this may indicate an incompatible llama-server version.",
+			"libDirs", libDirs, "lines_captured", len(stderrLines))
+	}
+
+	// Also run --list-devices to get the stdout device list with free memory
+	// (the brief server startup doesn't print that)
+	cmd2 := exec.CommandContext(ctx, llamaServer, "--list-devices", "--offline", "--verbose")
+	cmd2.WaitDelay = llamaServerDiscoveryWaitDelay
+	cmd2.Env = cmd.Env // reuse same environment
+	listOutput, err := cmd2.CombinedOutput()
+	_, _ = status.Write(listOutput)
+	if err != nil {
+		exit := llm.ExitStatusFromError(err)
+		slog.Debug("llama-server --list-devices failed", "error", err, "exit", exit)
+		if exit.Known() {
+			return nil, status, fmt.Errorf("llama-server --list-devices failed: %s", exit)
+		}
+		return nil, status, fmt.Errorf("llama-server --list-devices failed: %w", err)
+	}
+
+	nativeDevices, nativeStderr, nativeErr := discoverNativeDevices(ctx, llamaServer, libDirs, extraEnvs)
+	_, _ = status.Write([]byte(nativeStderr))
+	if nativeErr != nil {
+		logNativeProbeFailure(nativeErr, nativeStderr, libDirs)
+	}
+
+	combined := string(listOutput) + "\n" + strings.Join(stderrLines, "\n") + "\n" + nativeStderr
+	return parseLlamaServerDevicesWithNative(combined, libDirs, nativeDevices), status, nil
+}
+
+func llamaServerDiscoveryOutput(ctx context.Context) io.Writer {
+	if slog.Default().Enabled(ctx, logutil.LevelTrace) {
+		return os.Stderr
+	}
+	return io.Discard
+}
+
+// deviceLineRegex matches stdout lines like:
+//
+//	CUDA0: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
+//	Metal: Apple M3 Max (98304 MiB, 98303 MiB free)
+var deviceLineRegex = regexp.MustCompile(
+	`^\s+(.+?):\s+(.+?)\s+\((\d+)\s+MiB,\s+(\d+)\s+MiB\s+free\)`,
+)
+
+// cudaCCRegex matches CUDA stderr lines like:
+//
+//	Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
+var cudaCCRegex = regexp.MustCompile(
+	`Device\s+(\d+):.*compute capability\s+(\d+)\.(\d+)`,
+)
+
+// cudaArchsRegex matches the CUDA architecture list from system_info like:
+//
+//	CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210
+var cudaArchsRegex = regexp.MustCompile(
+	`CUDA\s*:\s*ARCHS\s*=\s*([\d,]+)`,
+)
+
+var (
+	cudaRuntimeSORegex  = regexp.MustCompile(`^libcudart\.so\.(\d+)(?:\.(\d+))?`)
+	cudaRuntimeDLLRegex = regexp.MustCompile(`^cudart64_(\d{2})(\d)\.dll$`)
+	cudaRuntimeDirRegex = regexp.MustCompile(`^cuda_v(\d+)$`)
+)
+
+// parseLlamaServerDevices parses the combined output of llama-server discovery.
+// It extracts device info, ROCm gfx targets, CUDA compute capabilities, and
+// CUDA compiled architecture lists.
+func parseLlamaServerDevices(output string, libDirs []string) []ml.DeviceInfo {
+	return parseLlamaServerDevicesWithNative(output, libDirs, nil)
+}
+
+func parseLlamaServerDevicesWithNative(output string, libDirs []string, nativeDevices []nativeProbeDevice) []ml.DeviceInfo {
+	// Extract per-device metadata from stderr
+	gfxByIndex := parseROCmGFXTargets(output)
+	rocmGFXOverride := hsaOverrideGFXTarget()
+	integratedByIndex := parseVulkanUMA(output)
+	ccByIndex := make(map[int]cudaComputeCapability)
+	var cudaArchs []string // compiled architectures for this variant
+	nativeByIndex := nativeProbeByLibraryIndex(nativeDevices)
+	for idx, dev := range nativeByIndex["ROCm"] {
+		if rocmGFXOverride != "" {
+			gfxByIndex[idx] = rocmGFXOverride
+		} else if dev.GFXTarget != "" {
+			gfxByIndex[idx] = dev.GFXTarget
+		}
+	}
+
+	scanner := bufio.NewScanner(strings.NewReader(output))
+	for scanner.Scan() {
+		line := scanner.Text()
+		if matches := cudaCCRegex.FindStringSubmatch(line); matches != nil {
+			idx, _ := strconv.Atoi(matches[1])
+			major, _ := strconv.Atoi(matches[2])
+			minor, _ := strconv.Atoi(matches[3])
+			ccByIndex[idx] = cudaComputeCapability{
+				major: major,
+				minor: minor,
+				arch:  fmt.Sprintf("%d%d0", major, minor),
+			}
+		}
+		if matches := cudaArchsRegex.FindStringSubmatch(line); matches != nil {
+			cudaArchs = strings.Split(matches[1], ",")
+		}
+	}
+	if cudaDevices := nativeByIndex["CUDA"]; len(cudaDevices) > 0 {
+		for idx, dev := range cudaDevices {
+			if dev.ComputeMajor <= 0 {
+				continue
+			}
+			ccByIndex[idx] = cudaComputeCapability{
+				major: dev.ComputeMajor,
+				minor: dev.ComputeMinor,
+				arch:  fmt.Sprintf("%d%d0", dev.ComputeMajor, dev.ComputeMinor),
+			}
+		}
+	}
+
+	// Validate CUDA devices against compiled architectures
+	cudaArchSet := make(map[string]bool, len(cudaArchs))
+	for _, arch := range cudaArchs {
+		cudaArchSet[strings.TrimSpace(arch)] = true
+	}
+	cudaRuntimeMajor, cudaRuntimeMinor, hasCUDARuntime := cudaRuntimeVersion(libDirs)
+
+	// Parse stdout device lines
+	var devices []ml.DeviceInfo
+	deviceIndex := 0
+	scanner = bufio.NewScanner(strings.NewReader(output))
+	for scanner.Scan() {
+		matches := deviceLineRegex.FindStringSubmatch(scanner.Text())
+		if matches == nil {
+			continue
+		}
+
+		name := matches[1]
+		description := matches[2]
+		totalMiB, _ := strconv.ParseUint(matches[3], 10, 64)
+		freeMiB, _ := strconv.ParseUint(matches[4], 10, 64)
+		library := inferLibrary(name, description)
+
+		// Skip pseudo-devices like BLAS/Accelerate that report zero memory.
+		// These are CPU math libraries, not real GPUs — they shouldn't appear
+		// as inference compute devices or inflate the scheduler's GPU count.
+		if totalMiB == 0 {
+			slog.Debug("skipping pseudo-device with zero memory", "name", name, "description", description)
+			deviceIndex++
+			continue
+		}
+
+		// For CUDA devices, check if this variant supports the device's CC
+		if library == "CUDA" {
+			cc, ok := ccByIndex[deviceIndex]
+			if ok && len(cudaArchSet) > 0 {
+				if !cudaArchSet[cc.arch] {
+					slog.Info("skipping CUDA device — compute capability not in compiled architectures",
+						"device", description, "cc", cc.arch, "archs", cudaArchs,
+						"libDirs", libDirs)
+					deviceIndex++
+					continue
+				}
+			} else if !ok {
+				slog.Warn("llama-server discovery: could not determine compute capability for CUDA device — "+
+					"architecture filtering disabled for this device. If inference crashes, "+
+					"check that the CUDA backend supports this GPU.",
+					"device", description, "libDirs", libDirs)
+			} else if len(cudaArchSet) == 0 {
+				slog.Warn("llama-server discovery: could not determine compiled CUDA architectures — "+
+					"architecture filtering disabled. If inference crashes on older GPUs, "+
+					"check llama-server system_info output for ARCHS.",
+					"device", description, "libDirs", libDirs)
+			}
+		}
+
+		nativeDevice, hasNativeDevice := nativeByIndex[library][deviceIndex]
+		totalBytes := totalMiB * 1024 * 1024
+		if hasNativeDevice && !nativeProbeMatchesLlamaServerDevice(library, description, totalBytes, nativeDevice) {
+			hasNativeDevice = false
+		}
+		computeMajor, computeMinor := computeVersion(library, deviceIndex, gfxByIndex, ccByIndex)
+		dev := ml.DeviceInfo{
+			DeviceID: ml.DeviceID{
+				ID:      strconv.Itoa(deviceIndex),
+				Library: library,
+			},
+			Name:         name,
+			Description:  description,
+			TotalMemory:  totalBytes,
+			FreeMemory:   freeMiB * 1024 * 1024,
+			ComputeMajor: computeMajor,
+			ComputeMinor: computeMinor,
+			LibraryPath:  libDirs,
+			GFXTarget:    gfxByIndex[deviceIndex],
+			Integrated:   isIntegratedLlamaServerDevice(library, deviceIndex, integratedByIndex),
+		}
+		if hasNativeDevice {
+			if nativeDevice.DeviceID != "" {
+				dev.PCIID = nativeDevice.DeviceID
+			}
+			if nativeDevice.IntegratedKnown {
+				dev.Integrated = nativeDevice.Integrated
+			} else {
+				dev.Integrated = dev.Integrated || nativeDevice.Integrated
+			}
+			if dev.ComputeMajor == 0 && nativeDevice.ComputeMajor > 0 {
+				dev.ComputeMajor = nativeDevice.ComputeMajor
+				dev.ComputeMinor = nativeDevice.ComputeMinor
+			}
+			if nativeDevice.CUDADriverMajor > 0 {
+				dev.DriverMajor = nativeDevice.CUDADriverMajor
+				dev.DriverMinor = nativeDevice.CUDADriverMinor
+			}
+			if nativeDevice.NVIDIADriverMajor > 0 {
+				dev.NVIDIADriverMajor = nativeDevice.NVIDIADriverMajor
+			}
+			setROCmGFXTarget(&dev, nativeDevice.GFXTarget)
+		}
+		setROCmGFXTarget(&dev, rocmGFXOverride)
+		if library == "CUDA" && dev.DriverMajor == 0 && hasCUDARuntime {
+			dev.DriverMajor = cudaRuntimeMajor
+			dev.DriverMinor = cudaRuntimeMinor
+		}
+
+		devices = append(devices, dev)
+		deviceIndex++
+	}
+
+	return refineLlamaServerDevices(devices, libDirs)
+}
+
+func nativeProbeMatchesLlamaServerDevice(library, description string, totalBytes uint64, nativeDevice nativeProbeDevice) bool {
+	if library != "Vulkan" {
+		return true
+	}
+
+	nativeDescription := nativeDevice.Description
+	if nativeDescription == "" {
+		nativeDescription = nativeDevice.Name
+	}
+	if nativeDescription == "" || !ml.SimilarDeviceDescription(description, nativeDescription) {
+		slog.Debug("skipping Vulkan native metadata with mismatched device name",
+			"llama_server_name", description,
+			"native_name", nativeDescription)
+		return false
+	}
+	if nativeDevice.TotalMemory != 0 && !ml.SimilarDeviceMemory(totalBytes, nativeDevice.TotalMemory) {
+		slog.Debug("skipping Vulkan native metadata with mismatched memory",
+			"llama_server_name", description,
+			"llama_server_total", totalBytes,
+			"native_total", nativeDevice.TotalMemory)
+		return false
+	}
+
+	return true
+}
+
+func cudaRuntimeVersion(libDirs []string) (int, int, bool) {
+	bestMajor, bestMinor := -1, -1
+	update := func(major, minor int) {
+		if major > bestMajor || (major == bestMajor && minor > bestMinor) {
+			bestMajor, bestMinor = major, minor
+		}
+	}
+
+	for _, dir := range libDirs {
+		for _, entry := range readDirNames(dir) {
+			if matches := cudaRuntimeSORegex.FindStringSubmatch(entry); matches != nil {
+				major, _ := strconv.Atoi(matches[1])
+				minor := 0
+				if matches[2] != "" {
+					minor, _ = strconv.Atoi(matches[2])
+				}
+				update(major, minor)
+			}
+			if matches := cudaRuntimeDLLRegex.FindStringSubmatch(entry); matches != nil {
+				major, _ := strconv.Atoi(matches[1])
+				minor, _ := strconv.Atoi(matches[2])
+				update(major, minor)
+			}
+		}
+
+		if matches := cudaRuntimeDirRegex.FindStringSubmatch(filepath.Base(dir)); matches != nil {
+			major, _ := strconv.Atoi(matches[1])
+			update(major, 0)
+		}
+	}
+
+	if bestMajor < 0 {
+		return 0, 0, false
+	}
+	return bestMajor, bestMinor, true
+}
+
+func readDirNames(dir string) []string {
+	entries, err := os.ReadDir(dir)
+	if err != nil {
+		return nil
+	}
+	names := make([]string, 0, len(entries))
+	for _, entry := range entries {
+		names = append(names, entry.Name())
+	}
+	return names
+}
+
+type cudaComputeCapability struct {
+	major int
+	minor int
+	arch  string
+}
+
+func computeVersion(library string, deviceIndex int, gfxByIndex map[int]string, ccByIndex map[int]cudaComputeCapability) (int, int) {
+	switch library {
+	case "CUDA":
+		if cc, ok := ccByIndex[deviceIndex]; ok {
+			return cc.major, cc.minor
+		}
+	case "ROCm":
+		return parseGFXTarget(gfxByIndex[deviceIndex])
+	}
+	return 0, 0
+}
+
+// inferLibrary determines the GPU library type from the llama-server device name and description.
+func inferLibrary(name, description string) string {
+	combined := strings.ToLower(name + " " + description)
+	switch {
+	case strings.Contains(combined, "cuda"):
+		return "CUDA"
+	case strings.Contains(combined, "rocm") || strings.Contains(combined, "hip"):
+		return "ROCm"
+	case strings.Contains(combined, "metal") || strings.Contains(combined, "apple"):
+		return "Metal"
+	case strings.Contains(combined, "vulkan"):
+		return "Vulkan"
+	default:
+		return description
+	}
+}
+
+func isIntegratedLlamaServerDevice(library string, deviceIndex int, integratedByIndex map[int]bool) bool {
+	if library == "Vulkan" && integratedByIndex[deviceIndex] {
+		return true
+	}
+
+	// llama-server discovery does not expose a stable backend device-type field,
+	// so we only infer "integrated" here for cases where the contract is strong:
+	// explicit Vulkan UMA metadata, or the single Apple Silicon Metal device.
+	//
+	// Other backends stay unclassified unless discovery provides a stronger
+	// signal. That keeps scheduling conservative instead of guessing from
+	// device names or backend-specific heuristics.
+	return library == "Metal" && runtime.GOOS == "darwin" && runtime.GOARCH == "arm64"
+}
+
+func llamaServerBootstrapDevicesWithStatus(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
+	devices, status, err := llamaServerDiscoverDevices(ctx, ollamaLibDirs, extraEnvs)
+	if err != nil {
+		return devices, status, err
+	}
+
+	hasROCm := false
+	for _, d := range devices {
+		if d.Library == "ROCm" {
+			hasROCm = true
+			break
+		}
+	}
+	if !hasROCm {
+		return devices, status, nil
+	}
+
+	return filterUnsupportedROCmDevices(devices, ollamaLibDirs), status, nil
+}
+
+// Ensure stderrPipe is fully consumed to avoid blocking
+var _ io.Reader
--- a/discover/llama_server_test.go
+++ b/discover/llama_server_test.go
@ -0,0 +1,515 @@
+package discover
+
+import (
+	"io"
+	"log/slog"
+	"os"
+	"path/filepath"
+	"testing"
+
+	"github.com/ollama/ollama/logutil"
+	"github.com/ollama/ollama/ml"
+)
+
+func TestLlamaServerDiscovery(t *testing.T) {
+	originalProbe := probeLlamaServerVulkanDevices
+	probeLlamaServerVulkanDevices = func(_ []string) ([]vulkanPhysicalDevice, error) {
+		return nil, errWindowsVulkanProbeUnsupported
+	}
+	t.Cleanup(func() {
+		probeLlamaServerVulkanDevices = originalProbe
+	})
+
+	t.Run("output only trace", func(t *testing.T) {
+		original := slog.Default()
+		t.Cleanup(func() {
+			slog.SetDefault(original)
+		})
+
+		slog.SetDefault(logutil.NewLogger(io.Discard, slog.LevelDebug))
+		if got := llamaServerDiscoveryOutput(t.Context()); got != io.Discard {
+			t.Fatal("debug logging should discard raw llama-server discovery output")
+		}
+
+		slog.SetDefault(logutil.NewLogger(io.Discard, logutil.LevelTrace))
+		if got := llamaServerDiscoveryOutput(t.Context()); got == io.Discard {
+			t.Fatal("trace logging should emit raw llama-server discovery output")
+		}
+	})
+
+	t.Run("parse devices", func(t *testing.T) {
+		type wantDevice struct {
+			name            string
+			library         string
+			totalMiB        uint64
+			compute         string
+			driver          string
+			gfxTarget       string
+			checkIntegrated bool
+			integrated      bool
+		}
+
+		tests := []struct {
+			name    string
+			output  string
+			libDirs []string
+			want    []wantDevice
+		}{
+			{
+				name: "NVIDIA CUDA",
+				output: `load_backend: loaded CUDA backend from /lib/ollama/cuda_v12/libggml-cuda.so
+Available devices:
+  NVIDIA GeForce RTX 4090: NVIDIA CUDA (24564 MiB, 23592 MiB free)
+`,
+				libDirs: []string{"/lib/ollama", "/lib/ollama/cuda_v12"},
+				want: []wantDevice{{
+					name:     "NVIDIA GeForce RTX 4090",
+					library:  "CUDA",
+					totalMiB: 24564,
+					driver:   "12.0",
+				}},
+			},
+			{
+				name: "Metal",
+				output: `Available devices:
+  Metal: Apple M3 Max (98304 MiB, 98303 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "Metal",
+					library:  "Metal",
+					totalMiB: 98304,
+				}},
+			},
+			{
+				name: "ROCm with gfx target",
+				output: `  Device 0: AMD Radeon RX 6700 XT, gfx1031 (0x1031), VMM: no, Wave Size: 32, VRAM: 12272 MiB
+Available devices:
+  ROCm0: AMD Radeon RX 6700 XT (12272 MiB, 12248 MiB free)
+`,
+				libDirs: []string{"/lib/ollama", "/lib/ollama/rocm_v7_2"},
+				want: []wantDevice{{
+					name:      "ROCm0",
+					library:   "ROCm",
+					totalMiB:  12272,
+					compute:   "gfx1031",
+					gfxTarget: "gfx1031",
+				}},
+			},
+			{
+				name: "multi GPU",
+				output: `Available devices:
+  CUDA0: NVIDIA GeForce RTX 4090 (24564 MiB, 23592 MiB free)
+  CUDA1: NVIDIA GeForce RTX 3060 (12288 MiB, 11500 MiB free)
+`,
+				libDirs: []string{"/lib/ollama", "/lib/ollama/cuda_v12"},
+				want: []wantDevice{
+					{name: "CUDA0", library: "CUDA", totalMiB: 24564},
+					{name: "CUDA1", library: "CUDA", totalMiB: 12288},
+				},
+			},
+			{
+				name: "Vulkan UMA",
+				output: `ggml_vulkan: 0 = Intel(R) Graphics (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
+Available devices:
+  Vulkan0: Intel(R) Graphics (16384 MiB, 12288 MiB free)
+`,
+				libDirs: []string{"/lib/ollama", "/lib/ollama/vulkan"},
+				want: []wantDevice{{
+					name:            "Vulkan0",
+					library:         "Vulkan",
+					totalMiB:        16384,
+					checkIntegrated: true,
+					integrated:      true,
+				}},
+			},
+			{
+				name: "Vulkan without UMA metadata",
+				output: `Available devices:
+  Vulkan0: AMD Radeon(TM) Graphics (32768 MiB, 31000 MiB free)
+`,
+				libDirs: []string{"/lib/ollama", "/lib/ollama/vulkan"},
+				want: []wantDevice{{
+					name:            "Vulkan0",
+					library:         "Vulkan",
+					totalMiB:        32768,
+					checkIntegrated: true,
+				}},
+			},
+			{
+				name: "CUDA device filtered by compiled archs",
+				output: `ggml_cuda_init: found 1 CUDA devices (Total VRAM: 6063 MiB):
+  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
+load_backend: loaded CUDA backend from /lib/ollama/cuda_v13/libggml-cuda.so
+system_info: n_threads = 4 | CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210 |
+Available devices:
+  CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
+`,
+				libDirs: []string{"/lib/ollama", "/lib/ollama/cuda_v13"},
+			},
+			{
+				name: "CUDA device kept by compiled archs",
+				output: `ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16379 MiB):
+  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
+system_info: n_threads = 16 | CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210 |
+Available devices:
+  CUDA0: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "CUDA0",
+					library:  "CUDA",
+					totalMiB: 16379,
+					compute:  "8.9",
+				}},
+			},
+			{
+				name: "CUDA without compiled archs fails open",
+				output: `ggml_cuda_init: found 1 CUDA devices (Total VRAM: 6063 MiB):
+  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
+Available devices:
+  CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "CUDA0",
+					library:  "CUDA",
+					totalMiB: 6063,
+					compute:  "6.1",
+				}},
+			},
+			{
+				name: "CUDA without compute capability fails open",
+				output: `system_info: n_threads = 4 | CUDA : ARCHS = 750,800 |
+Available devices:
+  CUDA0: Some Future GPU (8192 MiB, 8000 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "CUDA0",
+					library:  "CUDA",
+					totalMiB: 8192,
+				}},
+			},
+			{
+				name: "CUDA mixed arch support",
+				output: `ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
+  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
+system_info: n_threads = 8 | CUDA : ARCHS = 750,800,860,890 |
+Available devices:
+  CUDA0: NVIDIA GeForce GTX 1060 (6063 MiB, 5900 MiB free)
+  CUDA1: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "CUDA1",
+					library:  "CUDA",
+					totalMiB: 16379,
+					compute:  "8.9",
+				}},
+			},
+			{
+				name: "ROCm gfx target with xnack suffix",
+				output: `ggml_cuda_init: found 2 ROCm devices (Total VRAM: 32736 MiB):
+  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 16368 MiB
+  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
+Available devices:
+  ROCm0: AMD Radeon RX 6800 (16368 MiB, 16342 MiB free)
+  ROCm1: AMD Radeon Pro VII (16368 MiB, 16348 MiB free)
+`,
+				want: []wantDevice{
+					{name: "ROCm0", library: "ROCm", totalMiB: 16368, compute: "gfx1030", gfxTarget: "gfx1030"},
+					{name: "ROCm1", library: "ROCm", totalMiB: 16368, compute: "gfx906", gfxTarget: "gfx906"},
+				},
+			},
+			{
+				name: "unknown library",
+				output: `Available devices:
+  Future0: Mystery Accelerator (8192 MiB, 8000 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "Future0",
+					library:  "Mystery Accelerator",
+					totalMiB: 8192,
+				}},
+			},
+			{
+				name:   "no devices",
+				output: "Available devices:\n",
+			},
+			{
+				name: "empty output",
+			},
+		}
+
+		for _, tt := range tests {
+			t.Run(tt.name, func(t *testing.T) {
+				if tt.libDirs == nil {
+					tt.libDirs = []string{"/lib/ollama"}
+				}
+				devices := parseLlamaServerDevices(tt.output, tt.libDirs)
+				if len(devices) != len(tt.want) {
+					t.Fatalf("got %d devices, want %d", len(devices), len(tt.want))
+				}
+				for i, want := range tt.want {
+					got := devices[i]
+					if want.name != "" && got.Name != want.name {
+						t.Errorf("device %d name = %q, want %q", i, got.Name, want.name)
+					}
+					if want.library != "" && got.Library != want.library {
+						t.Errorf("device %d library = %q, want %q", i, got.Library, want.library)
+					}
+					if want.totalMiB > 0 && got.TotalMemory != want.totalMiB*1024*1024 {
+						t.Errorf("device %d total memory = %d, want %d MiB", i, got.TotalMemory, want.totalMiB)
+					}
+					if want.compute != "" && got.Compute() != want.compute {
+						t.Errorf("device %d compute = %q, want %q", i, got.Compute(), want.compute)
+					}
+					if want.driver != "" && got.Driver() != want.driver {
+						t.Errorf("device %d driver = %q, want %q", i, got.Driver(), want.driver)
+					}
+					if want.gfxTarget != "" && got.GFXTarget != want.gfxTarget {
+						t.Errorf("device %d gfx target = %q, want %q", i, got.GFXTarget, want.gfxTarget)
+					}
+					if want.checkIntegrated && got.Integrated != want.integrated {
+						t.Errorf("device %d integrated = %v, want %v", i, got.Integrated, want.integrated)
+					}
+				}
+			})
+		}
+	})
+
+	t.Run("parse fixtures", func(t *testing.T) {
+		type wantDevice struct {
+			name       string
+			library    string
+			totalMiB   uint64
+			compute    string
+			gfxTarget  string
+			integrated bool
+		}
+
+		tests := []struct {
+			name     string
+			output   string
+			libDirs  []string
+			want     []wantDevice
+			wantSkip string
+		}{
+			{
+				name: "cuda mixed archs filters unsupported device",
+				output: `ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1, VMM: yes, VRAM: 6063 MiB
+  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 16379 MiB
+system_info: n_threads = 8 | CUDA : ARCHS = 750,800,860,890 |
+Available devices:
+  CUDA0: NVIDIA GeForce GTX 1060 (6063 MiB, 5900 MiB free)
+  CUDA1: NVIDIA GeForce RTX 4060 Ti (16379 MiB, 14900 MiB free)
+`,
+				want: []wantDevice{{
+					name:     "CUDA1",
+					library:  "CUDA",
+					totalMiB: 16379,
+					compute:  "8.9",
+				}},
+			},
+			{
+				name: "rocm gfx targets preserve suffix-free compute",
+				output: `ggml_cuda_init: found 2 ROCm devices (Total VRAM: 32736 MiB):
+  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 16368 MiB
+  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 16368 MiB
+Available devices:
+  ROCm0: AMD Radeon RX 6800 (16368 MiB, 16342 MiB free)
+  ROCm1: AMD Radeon Pro VII (16368 MiB, 16348 MiB free)
+`,
+				want: []wantDevice{
+					{name: "ROCm0", library: "ROCm", totalMiB: 16368, compute: "gfx1030", gfxTarget: "gfx1030"},
+					{name: "ROCm1", library: "ROCm", totalMiB: 16368, compute: "gfx906", gfxTarget: "gfx906"},
+				},
+			},
+			{
+				name: "vulkan uma marks integrated",
+				output: `ggml_vulkan: 0 = Intel(R) Graphics (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
+Available devices:
+  Vulkan0: Intel(R) Graphics (16384 MiB, 12288 MiB free)
+`,
+				want: []wantDevice{{
+					name:       "Vulkan0",
+					library:    "Vulkan",
+					totalMiB:   16384,
+					integrated: true,
+				}},
+			},
+			{
+				name: "windows vulkan without uma stays unclassified",
+				output: `load_backend: loaded Vulkan backend from C:\ollama\lib\ollama\vulkan\ggml-vulkan.dll
+Available devices:
+  Vulkan0: AMD Radeon(TM) Graphics (32768 MiB, 31000 MiB free)
+  Vulkan1: AMD Radeon RX 7900 XTX (24564 MiB, 23000 MiB free)
+`,
+				want: []wantDevice{
+					{name: "Vulkan0", library: "Vulkan", totalMiB: 32768},
+					{name: "Vulkan1", library: "Vulkan", totalMiB: 24564},
+				},
+			},
+		}
+
+		for _, tt := range tests {
+			t.Run(tt.name, func(t *testing.T) {
+				libDirs := tt.libDirs
+				if libDirs == nil {
+					libDirs = []string{"/lib/ollama"}
+				}
+
+				got := parseLlamaServerDevices(tt.output, libDirs)
+				if len(got) != len(tt.want) {
+					t.Fatalf("got %d devices, want %d", len(got), len(tt.want))
+				}
+				for i, want := range tt.want {
+					if got[i].Name != want.name {
+						t.Fatalf("device %d name = %q, want %q", i, got[i].Name, want.name)
+					}
+					if got[i].Library != want.library {
+						t.Fatalf("device %d library = %q, want %q", i, got[i].Library, want.library)
+					}
+					if got[i].TotalMemory != want.totalMiB*1024*1024 {
+						t.Fatalf("device %d total memory = %d, want %d MiB", i, got[i].TotalMemory, want.totalMiB)
+					}
+					if want.compute != "" && got[i].Compute() != want.compute {
+						t.Fatalf("device %d compute = %q, want %q", i, got[i].Compute(), want.compute)
+					}
+					if want.gfxTarget != "" && got[i].GFXTarget != want.gfxTarget {
+						t.Fatalf("device %d gfx target = %q, want %q", i, got[i].GFXTarget, want.gfxTarget)
+					}
+					if got[i].Integrated != want.integrated {
+						t.Fatalf("device %d integrated = %v, want %v", i, got[i].Integrated, want.integrated)
+					}
+				}
+			})
+		}
+	})
+
+	t.Run("skips mismatched Vulkan native metadata", func(t *testing.T) {
+		output := `Available devices:
+  Vulkan0: Intel(R) UHD Graphics 770 (32768 MiB, 31000 MiB free)
+`
+		nativeDevices := []nativeProbeDevice{{
+			Library:             "Vulkan",
+			Index:               0,
+			IndexMatchesBackend: true,
+			Description:         "NVIDIA GeForce RTX 4060 Ti",
+			DeviceID:            "0000:05:00.0",
+			IntegratedKnown:     true,
+			TotalMemory:         16107 * 1024 * 1024,
+		}}
+
+		devices := parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/vulkan"}, nativeDevices)
+		if len(devices) != 1 {
+			t.Fatalf("got %d devices, want 1", len(devices))
+		}
+		if devices[0].PCIID != "" {
+			t.Fatalf("PCIID = %q, want empty", devices[0].PCIID)
+		}
+		if devices[0].Integrated {
+			t.Fatal("Integrated = true, want false")
+		}
+	})
+
+	t.Run("cuda runtime version", func(t *testing.T) {
+		dir := t.TempDir()
+		if err := os.WriteFile(filepath.Join(dir, "libcudart.so.12.8.90"), nil, 0o644); err != nil {
+			t.Fatal(err)
+		}
+
+		major, minor, ok := cudaRuntimeVersion([]string{dir})
+		if !ok || major != 12 || minor != 8 {
+			t.Fatalf("cudaRuntimeVersion = %d.%d, %v, want 12.8, true", major, minor, ok)
+		}
+
+		major, minor, ok = cudaRuntimeVersion([]string{filepath.Join(t.TempDir(), "cuda_v13")})
+		if !ok || major != 13 || minor != 0 {
+			t.Fatalf("cudaRuntimeVersion fallback = %d.%d, %v, want 13.0, true", major, minor, ok)
+		}
+	})
+
+	t.Run("refine windows vulkan devices", func(t *testing.T) {
+		makeDevices := func() []ml.DeviceInfo {
+			return []ml.DeviceInfo{
+				{DeviceID: ml.DeviceID{ID: "0", Library: "Vulkan"}, Description: "AMD Radeon(TM) Graphics"},
+				{DeviceID: ml.DeviceID{ID: "1", Library: "Vulkan"}, Description: "AMD Radeon RX 7900 XTX"},
+				{DeviceID: ml.DeviceID{ID: "0", Library: "CUDA"}, Description: "NVIDIA GeForce RTX 4090"},
+			}
+		}
+
+		tests := []struct {
+			name    string
+			devices []ml.DeviceInfo
+			probed  []vulkanPhysicalDevice
+			want    []bool
+			applied bool
+		}{
+			{
+				name: "fills missing integrated bit",
+				probed: []vulkanPhysicalDevice{
+					{Name: "AMD Radeon(TM) Graphics", Integrated: true},
+					{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
+				},
+				want:    []bool{true, false, false},
+				applied: true,
+			},
+			{
+				name: "matches names when order differs",
+				probed: []vulkanPhysicalDevice{
+					{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
+					{Name: "AMD Radeon(TM) Graphics", Integrated: true},
+				},
+				want:    []bool{true, false, false},
+				applied: true,
+			},
+			{
+				name: "skips when names do not line up",
+				probed: []vulkanPhysicalDevice{
+					{Name: "Wrong GPU", Integrated: true},
+					{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
+				},
+				want: []bool{false, false, false},
+			},
+			{
+				name:   "skips when counts do not line up",
+				probed: []vulkanPhysicalDevice{{Name: "AMD Radeon(TM) Graphics", Integrated: true}},
+				want:   []bool{false, false, false},
+			},
+			{
+				name: "overwrites stale classification",
+				devices: []ml.DeviceInfo{
+					{DeviceID: ml.DeviceID{ID: "0", Library: "Vulkan"}, Description: "AMD Radeon(TM) Graphics", Integrated: true},
+					{DeviceID: ml.DeviceID{ID: "1", Library: "Vulkan"}, Description: "AMD Radeon RX 7900 XTX"},
+				},
+				probed: []vulkanPhysicalDevice{
+					{Name: "AMD Radeon(TM) Graphics", Integrated: false},
+					{Name: "AMD Radeon RX 7900 XTX", Integrated: false},
+				},
+				want:    []bool{false, false},
+				applied: true,
+			},
+		}
+
+		for _, tt := range tests {
+			t.Run(tt.name, func(t *testing.T) {
+				devices := tt.devices
+				if devices == nil {
+					devices = makeDevices()
+				}
+				applied := applyWindowsVulkanRefinement(devices, tt.probed)
+				if applied != tt.applied {
+					t.Fatalf("applied = %v, want %v", applied, tt.applied)
+				}
+				got := devices
+				if len(got) != len(tt.want) {
+					t.Fatalf("got %d devices, want %d", len(got), len(tt.want))
+				}
+				for i, want := range tt.want {
+					if got[i].Integrated != want {
+						t.Fatalf("device %d integrated = %v, want %v", i, got[i].Integrated, want)
+					}
+				}
+			})
+		}
+	})
+}
--- a/discover/native_probe.go
+++ b/discover/native_probe.go
@ -0,0 +1,263 @@
+package discover
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"io"
+	"log/slog"
+	"os"
+	"os/exec"
+	"runtime"
+	"strings"
+	"time"
+
+	"github.com/ollama/ollama/llm"
+	"github.com/ollama/ollama/ml"
+)
+
+// Native GPU discovery runs in a short-lived Ollama subprocess so loading GGML
+// and driver libraries cannot crash the main server process. The subprocess
+// keeps stdout reserved for JSON and lets GGML's default logger write to
+// stderr; the parent captures that stderr for trace/debug diagnostics.
+const nativeProbeTimeout = 15 * time.Second
+
+type nativeProbeDevice struct {
+	Library string `json:"library"`
+	Index   int    `json:"index"`
+	// IndexMatchesBackend means Index is in the same visible-device order that
+	// llama-server reports, so it is safe to correlate when PCI ID is missing.
+	IndexMatchesBackend bool   `json:"index_matches_backend,omitempty"`
+	Name                string `json:"name,omitempty"`
+	Description         string `json:"description,omitempty"`
+	DeviceID            string `json:"device_id,omitempty"`
+	Integrated          bool   `json:"integrated,omitempty"`
+	IntegratedKnown     bool   `json:"integrated_known"`
+	TotalMemory         uint64 `json:"total_memory,omitempty"`
+	FreeMemory          uint64 `json:"free_memory,omitempty"`
+	ComputeMajor        int    `json:"compute_major,omitempty"`
+	ComputeMinor        int    `json:"compute_minor,omitempty"`
+	CUDADriverMajor     int    `json:"cuda_driver_major,omitempty"`
+	CUDADriverMinor     int    `json:"cuda_driver_minor,omitempty"`
+	NVIDIADriverMajor   int    `json:"nvidia_driver_major,omitempty"`
+	GFXTarget           string `json:"gfx_target,omitempty"`
+}
+
+type nativeProbeResult struct {
+	Devices []nativeProbeDevice `json:"devices"`
+}
+
+type ggmlBackendDevCaps struct {
+	Async             uint8
+	HostBuffer        uint8
+	BufferFromHostPtr uint8
+	Events            uint8
+}
+
+type ggmlBackendDevProps struct {
+	Name        uintptr
+	Description uintptr
+	MemoryFree  uintptr
+	MemoryTotal uintptr
+	Type        int32
+	_           [4]byte
+	DeviceID    uintptr
+	Caps        ggmlBackendDevCaps
+	_           [4]byte
+}
+
+func discoverNativeDevices(ctx context.Context, llamaServer string, libDirs []string, extraEnvs map[string]string) ([]nativeProbeDevice, string, error) {
+	if runtime.GOOS != "linux" && runtime.GOOS != "windows" {
+		return nil, "", nil
+	}
+
+	exe, err := os.Executable()
+	if err != nil {
+		return nil, "", err
+	}
+
+	ctx, cancel := context.WithTimeout(ctx, nativeProbeTimeout)
+	defer cancel()
+
+	args := []string{"gpu-discover"}
+	for _, dir := range libDirs {
+		args = append(args, "--lib-dir", dir)
+	}
+	cmd := exec.CommandContext(ctx, exe, args...)
+	cmd.WaitDelay = llamaServerDiscoveryWaitDelay
+	llm.SetupLlamaServerCommandEnv(cmd, llamaServer, libDirs, extraEnvs)
+
+	var stderr bytes.Buffer
+	cmd.Stderr = &stderr
+	stdout, err := cmd.Output()
+	if err != nil {
+		if ctx.Err() != nil {
+			return nil, stderr.String(), ctx.Err()
+		}
+		return nil, stderr.String(), err
+	}
+
+	var result nativeProbeResult
+	if err := json.Unmarshal(stdout, &result); err != nil {
+		return nil, stderr.String(), err
+	}
+
+	return result.Devices, stderr.String(), nil
+}
+
+func RunNativeProbeCommand(ctx context.Context, libDirs []string, out io.Writer) error {
+	if len(libDirs) == 0 {
+		libDirs = []string{ml.LibOllamaPath}
+	}
+
+	devices, err := runNativeProbe(ctx, libDirs)
+	if err != nil {
+		return err
+	}
+
+	return json.NewEncoder(out).Encode(nativeProbeResult{Devices: devices})
+}
+
+func runNativeProbe(ctx context.Context, libDirs []string) ([]nativeProbeDevice, error) {
+	return runPlatformNativeProbe(ctx, libDirs)
+}
+
+func mergeNativeProbeDevices(base, supplement []nativeProbeDevice) []nativeProbeDevice {
+	if len(base) == 0 {
+		var out []nativeProbeDevice
+		for _, extra := range supplement {
+			if extra.IndexMatchesBackend {
+				out = append(out, extra)
+			}
+		}
+		return out
+	}
+
+	out := append([]nativeProbeDevice(nil), base...)
+	for _, extra := range supplement {
+		idx := -1
+		for i := range out {
+			if sameNativeProbeDevice(out[i], extra) {
+				idx = i
+				break
+			}
+		}
+		if idx < 0 {
+			if !extra.IndexMatchesBackend || nativeProbeLibraryIndexExists(out, extra) {
+				continue
+			}
+			out = append(out, extra)
+			continue
+		}
+		mergeNativeProbeDevice(&out[idx], extra)
+	}
+	return out
+}
+
+func sameNativeProbeDevice(a, b nativeProbeDevice) bool {
+	if !strings.EqualFold(a.Library, b.Library) {
+		return false
+	}
+	if a.DeviceID != "" && b.DeviceID != "" {
+		return strings.EqualFold(a.DeviceID, b.DeviceID)
+	}
+	if !a.IndexMatchesBackend || !b.IndexMatchesBackend {
+		return false
+	}
+	return a.Index == b.Index
+}
+
+func mergeNativeProbeDevice(dst *nativeProbeDevice, src nativeProbeDevice) {
+	dst.IndexMatchesBackend = dst.IndexMatchesBackend || src.IndexMatchesBackend
+	if dst.Name == "" {
+		dst.Name = src.Name
+	}
+	if dst.Description == "" {
+		dst.Description = src.Description
+	}
+	if dst.DeviceID == "" {
+		dst.DeviceID = src.DeviceID
+	}
+	if src.IntegratedKnown {
+		dst.Integrated = src.Integrated
+		dst.IntegratedKnown = true
+	} else if !dst.IntegratedKnown && src.Integrated {
+		dst.Integrated = true
+	}
+	if dst.TotalMemory == 0 {
+		dst.TotalMemory = src.TotalMemory
+	}
+	if dst.FreeMemory == 0 {
+		dst.FreeMemory = src.FreeMemory
+	}
+	if dst.ComputeMajor == 0 && src.ComputeMajor != 0 {
+		dst.ComputeMajor = src.ComputeMajor
+		dst.ComputeMinor = src.ComputeMinor
+	}
+	if dst.CUDADriverMajor == 0 && src.CUDADriverMajor != 0 {
+		dst.CUDADriverMajor = src.CUDADriverMajor
+		dst.CUDADriverMinor = src.CUDADriverMinor
+	}
+	if dst.NVIDIADriverMajor == 0 && src.NVIDIADriverMajor != 0 {
+		dst.NVIDIADriverMajor = src.NVIDIADriverMajor
+	}
+	if dst.GFXTarget == "" {
+		dst.GFXTarget = src.GFXTarget
+	}
+}
+
+func nativeProbeLibraryIndexExists(devices []nativeProbeDevice, target nativeProbeDevice) bool {
+	if !target.IndexMatchesBackend {
+		return false
+	}
+	for _, dev := range devices {
+		if strings.EqualFold(dev.Library, target.Library) && dev.Index == target.Index {
+			return true
+		}
+	}
+	return false
+}
+
+func nativeProbeByLibraryIndex(devices []nativeProbeDevice) map[string]map[int]nativeProbeDevice {
+	out := map[string]map[int]nativeProbeDevice{}
+	for _, dev := range devices {
+		if !dev.IndexMatchesBackend {
+			continue
+		}
+		lib := normalizeNativeProbeLibrary(dev.Library)
+		if lib == "" {
+			continue
+		}
+		if _, ok := out[lib]; !ok {
+			out[lib] = map[int]nativeProbeDevice{}
+		}
+		out[lib][dev.Index] = dev
+	}
+	return out
+}
+
+func normalizeNativeProbeLibrary(library string) string {
+	switch strings.ToLower(library) {
+	case "cuda":
+		return "CUDA"
+	case "hip", "rocm":
+		return "ROCm"
+	case "vulkan":
+		return "Vulkan"
+	case "metal":
+		return "Metal"
+	default:
+		return library
+	}
+}
+
+func logNativeProbeFailure(err error, stderr string, libDirs []string) {
+	if err == nil {
+		return
+	}
+	if stderr != "" {
+		slog.Debug("native GPU discovery failed", "error", err, "stderr", stderr, "libDirs", libDirs)
+		return
+	}
+	slog.Debug("native GPU discovery failed", "error", err, "libDirs", libDirs)
+}
--- a/discover/native_probe_linux.go
+++ b/discover/native_probe_linux.go
@ -0,0 +1,510 @@
+//go:build linux
+
+package discover
+
+/*
+#cgo linux LDFLAGS: -ldl
+
+#include <dlfcn.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+static void * ollama_dlopen(const char * path, int global) {
+	return dlopen(path, RTLD_NOW | (global ? RTLD_GLOBAL : RTLD_LOCAL));
+}
+
+static void * ollama_dlsym(void * handle, const char * name) {
+	return dlsym(handle, name);
+}
+
+static const char * ollama_dlerror(void) {
+	const char * err = dlerror();
+	return err ? err : "";
+}
+
+typedef void * (*ollama_ggml_backend_load_fn)(const char *);
+typedef size_t (*ollama_ggml_backend_reg_dev_count_fn)(void *);
+typedef void * (*ollama_ggml_backend_reg_dev_get_fn)(void *, size_t);
+typedef const char * (*ollama_ggml_backend_reg_name_fn)(void *);
+typedef void (*ollama_ggml_backend_dev_get_props_fn)(void *, void *);
+
+static void * ollama_call_ggml_backend_load(void * fn, const char * path) {
+	return ((ollama_ggml_backend_load_fn) fn)(path);
+}
+
+static size_t ollama_call_ggml_backend_reg_dev_count(void * fn, void * reg) {
+	return ((ollama_ggml_backend_reg_dev_count_fn) fn)(reg);
+}
+
+static void * ollama_call_ggml_backend_reg_dev_get(void * fn, void * reg, size_t index) {
+	return ((ollama_ggml_backend_reg_dev_get_fn) fn)(reg, index);
+}
+
+static const char * ollama_call_ggml_backend_reg_name(void * fn, void * reg) {
+	return ((ollama_ggml_backend_reg_name_fn) fn)(reg);
+}
+
+static void ollama_call_ggml_backend_dev_get_props(void * fn, void * dev, void * props) {
+	((ollama_ggml_backend_dev_get_props_fn) fn)(dev, props);
+}
+
+static const char * ollama_cstr_from_uintptr(uintptr_t ptr) {
+	return (const char *) ptr;
+}
+
+typedef int (*ollama_cu_init_fn)(unsigned int);
+typedef int (*ollama_cu_driver_get_version_fn)(int *);
+typedef int (*ollama_cu_device_get_count_fn)(int *);
+typedef int (*ollama_cu_device_get_fn)(int *, int);
+typedef int (*ollama_cu_device_get_attribute_fn)(int *, int, int);
+typedef int (*ollama_cu_device_get_name_fn)(char *, int, int);
+typedef int (*ollama_cu_device_total_mem_fn)(size_t *, int);
+typedef int (*ollama_cu_device_get_pci_bus_id_fn)(char *, int, int);
+
+static int ollama_call_cu_init(void * fn) {
+	return ((ollama_cu_init_fn) fn)(0);
+}
+
+static int ollama_call_cu_driver_get_version(void * fn, int * version) {
+	return ((ollama_cu_driver_get_version_fn) fn)(version);
+}
+
+static int ollama_call_cu_device_get_count(void * fn, int * count) {
+	return ((ollama_cu_device_get_count_fn) fn)(count);
+}
+
+static int ollama_call_cu_device_get(void * fn, int * device, int index) {
+	return ((ollama_cu_device_get_fn) fn)(device, index);
+}
+
+static int ollama_call_cu_device_get_attribute(void * fn, int * value, int attr, int device) {
+	return ((ollama_cu_device_get_attribute_fn) fn)(value, attr, device);
+}
+
+static int ollama_call_cu_device_get_name(void * fn, char * name, int len, int device) {
+	return ((ollama_cu_device_get_name_fn) fn)(name, len, device);
+}
+
+static int ollama_call_cu_device_total_mem(void * fn, size_t * total, int device) {
+	return ((ollama_cu_device_total_mem_fn) fn)(total, device);
+}
+
+static int ollama_call_cu_device_get_pci_bus_id(void * fn, char * pci, int len, int device) {
+	return ((ollama_cu_device_get_pci_bus_id_fn) fn)(pci, len, device);
+}
+
+typedef int (*ollama_nvml_init_fn)(void);
+typedef int (*ollama_nvml_shutdown_fn)(void);
+typedef int (*ollama_nvml_system_get_driver_version_fn)(char *, unsigned int);
+
+static int ollama_call_nvml_init(void * fn) {
+	return ((ollama_nvml_init_fn) fn)();
+}
+
+static int ollama_call_nvml_shutdown(void * fn) {
+	return ((ollama_nvml_shutdown_fn) fn)();
+}
+
+static int ollama_call_nvml_system_get_driver_version(void * fn, char * version, unsigned int len) {
+	return ((ollama_nvml_system_get_driver_version_fn) fn)(version, len);
+}
+
+*/
+import "C"
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"log/slog"
+	"os"
+	"strings"
+	"unsafe"
+)
+
+const (
+	cuSuccess                               = 0
+	cuDeviceAttributeComputeCapabilityMajor = 75
+	cuDeviceAttributeComputeCapabilityMinor = 76
+	cuDeviceAttributeIntegrated             = 18
+)
+
+type dlHandle struct {
+	ptr unsafe.Pointer
+}
+
+func runPlatformNativeProbe(ctx context.Context, libDirs []string) ([]nativeProbeDevice, error) {
+	select {
+	case <-ctx.Done():
+		return nil, ctx.Err()
+	default:
+	}
+
+	ggmlDevices, ggmlErr := probeGGMLDevicesLinux(libDirs)
+	var cudaDevices []nativeProbeDevice
+	var cudaErr error
+	if nativeProbeHasCUDA(libDirs) {
+		cudaDevices, cudaErr = probeCUDADriverLinux()
+	}
+	var rocmDevices []nativeProbeDevice
+	var rocmErr error
+	if nativeProbeHasROCm(libDirs) {
+		rocmDevices, rocmErr = probeROCmSysfsLinux()
+	}
+
+	devices := mergeNativeProbeDevices(mergeNativeProbeDevices(ggmlDevices, cudaDevices), rocmDevices)
+	if len(devices) > 0 {
+		return devices, nil
+	}
+
+	if ggmlErr != nil {
+		return nil, ggmlErr
+	}
+	if rocmErr != nil {
+		return nil, rocmErr
+	}
+	return nil, cudaErr
+}
+
+func probeGGMLDevicesLinux(libDirs []string) ([]nativeProbeDevice, error) {
+	if len(libDirs) == 0 {
+		return nil, errors.New("no library directories provided")
+	}
+
+	baseDir := libDirs[0]
+	if baseDir == "" {
+		return nil, errors.New("empty GGML library directory")
+	}
+
+	base, err := dlopen(ggmlLibraryFile(baseDir, "ggml-base"), true)
+	if err != nil {
+		return nil, err
+	}
+
+	ggml, err := dlopen(ggmlLibraryFile(baseDir, "ggml"), true)
+	if err != nil {
+		return nil, err
+	}
+
+	backendLoad, err := dlsym(ggml, "ggml_backend_load")
+	if err != nil {
+		return nil, err
+	}
+	regDevCount, err := dlsym(base, "ggml_backend_reg_dev_count")
+	if err != nil {
+		return nil, err
+	}
+	regDevGet, err := dlsym(base, "ggml_backend_reg_dev_get")
+	if err != nil {
+		return nil, err
+	}
+	regName, err := dlsym(base, "ggml_backend_reg_name")
+	if err != nil {
+		return nil, err
+	}
+	devGetProps, err := dlsym(base, "ggml_backend_dev_get_props")
+	if err != nil {
+		return nil, err
+	}
+
+	var devices []nativeProbeDevice
+	for _, backendPath := range nativeProbeBackendFiles(libDirs) {
+		reg := callGGMLBackendLoad(backendLoad, backendPath)
+		if reg == nil {
+			continue
+		}
+
+		library := ggmlProbeLibraryName(callGGMLRegName(regName, reg))
+		count := int(callGGMLRegDevCount(regDevCount, reg))
+		for i := range count {
+			dev := callGGMLRegDevGet(regDevGet, reg, i)
+			if dev == nil {
+				continue
+			}
+			props := callGGMLDeviceProps(devGetProps, dev)
+			if props.MemoryTotal == 0 {
+				continue
+			}
+			devices = append(devices, nativeProbeDevice{
+				Library:             library,
+				Index:               i,
+				IndexMatchesBackend: true,
+				Name:                cString(props.Name),
+				Description:         cString(props.Description),
+				DeviceID:            cString(props.DeviceID),
+				Integrated:          ggmlDeviceTypeIntegrated(props.Type),
+				IntegratedKnown: props.Type == ggmlBackendDeviceTypeGPU ||
+					props.Type == ggmlBackendDeviceTypeIGPU,
+				TotalMemory: uint64(props.MemoryTotal),
+				FreeMemory:  uint64(props.MemoryFree),
+			})
+			slog.Debug("GGML GPU device type", "library", library, "index", i, "ggml_type", props.Type, "integrated", ggmlDeviceTypeIntegrated(props.Type))
+		}
+	}
+
+	return devices, nil
+}
+
+func probeCUDADriverLinux() ([]nativeProbeDevice, error) {
+	cuda, err := dlopenFirst([]string{"libcuda.so.1", "libcuda.so"}, false)
+	if err != nil {
+		return nil, err
+	}
+
+	cuInit, err := dlsym(cuda, "cuInit")
+	if err != nil {
+		return nil, err
+	}
+	cuDriverGetVersion, err := dlsym(cuda, "cuDriverGetVersion")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetCount, err := dlsym(cuda, "cuDeviceGetCount")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGet, err := dlsym(cuda, "cuDeviceGet")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetAttribute, err := dlsym(cuda, "cuDeviceGetAttribute")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetName, err := dlsym(cuda, "cuDeviceGetName")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceTotalMem, err := dlsymAny(cuda, "cuDeviceTotalMem_v2", "cuDeviceTotalMem")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetPCIBusID, _ := dlsym(cuda, "cuDeviceGetPCIBusId")
+
+	if ret := C.ollama_call_cu_init(cuInit); ret != cuSuccess {
+		return nil, fmt.Errorf("cuInit failed: %d", int(ret))
+	}
+
+	var driverVersion C.int
+	driverMajor, driverMinor := 0, 0
+	if ret := C.ollama_call_cu_driver_get_version(cuDriverGetVersion, &driverVersion); ret == cuSuccess {
+		version := int(driverVersion)
+		driverMajor = version / 1000
+		driverMinor = (version - driverMajor*1000) / 10
+	}
+
+	nvidiaDriverMajor := 0
+	if driver, err := probeNVIDIADriverMajorLinux(); err == nil {
+		nvidiaDriverMajor = driver
+	}
+
+	var count C.int
+	if ret := C.ollama_call_cu_device_get_count(cuDeviceGetCount, &count); ret != cuSuccess {
+		return nil, fmt.Errorf("cuDeviceGetCount failed: %d", int(ret))
+	}
+
+	deviceCount := int(count)
+	devices := make([]nativeProbeDevice, 0, deviceCount)
+	for i := range deviceCount {
+		var device C.int
+		if ret := C.ollama_call_cu_device_get(cuDeviceGet, &device, C.int(i)); ret != cuSuccess {
+			continue
+		}
+
+		major := cudaDeviceAttribute(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMajor, device)
+		minor := cudaDeviceAttribute(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMinor, device)
+		integrated := cudaDeviceAttribute(cuDeviceGetAttribute, cuDeviceAttributeIntegrated, device) == 1
+
+		var name [128]C.char
+		_ = C.ollama_call_cu_device_get_name(cuDeviceGetName, &name[0], C.int(len(name)), device)
+
+		var total C.size_t
+		_ = C.ollama_call_cu_device_total_mem(cuDeviceTotalMem, &total, device)
+
+		pci := ""
+		if cuDeviceGetPCIBusID != nil {
+			var pciBuf [32]C.char
+			if ret := C.ollama_call_cu_device_get_pci_bus_id(cuDeviceGetPCIBusID, &pciBuf[0], C.int(len(pciBuf)), device); ret == cuSuccess {
+				pci = strings.ToLower(C.GoString(&pciBuf[0]))
+			}
+		}
+
+		devices = append(devices, nativeProbeDevice{
+			Library:             "CUDA",
+			Index:               i,
+			IndexMatchesBackend: true,
+			Description:         C.GoString(&name[0]),
+			DeviceID:            pci,
+			Integrated:          integrated,
+			IntegratedKnown:     true,
+			TotalMemory:         uint64(total),
+			ComputeMajor:        major,
+			ComputeMinor:        minor,
+			CUDADriverMajor:     driverMajor,
+			CUDADriverMinor:     driverMinor,
+			NVIDIADriverMajor:   nvidiaDriverMajor,
+		})
+	}
+
+	return devices, nil
+}
+
+func probeROCmSysfsLinux() ([]nativeProbeDevice, error) {
+	sysfsDevices, err := readROCmLinuxSysfsDevices("/sys")
+	if err != nil {
+		return nil, err
+	}
+
+	override := hsaOverrideGFXTarget()
+	// Sysfs stays in physical KFD order; ROCm visibility envs can reindex the
+	// backend device list, so filtered sysfs data must merge by PCI ID only.
+	backendIndex := !rocmVisibleDevicesEnvSet()
+	devices := make([]nativeProbeDevice, 0, len(sysfsDevices))
+	for i, sysfsDevice := range sysfsDevices {
+		gfxTarget := sysfsDevice.gfxTarget
+		if override != "" {
+			gfxTarget = override
+		}
+		devices = append(devices, nativeProbeDevice{
+			Library:             "ROCm",
+			Index:               i,
+			IndexMatchesBackend: backendIndex,
+			DeviceID:            sysfsDevice.pciID,
+			Integrated:          sysfsDevice.integrated,
+			IntegratedKnown:     sysfsDevice.known,
+			GFXTarget:           gfxTarget,
+		})
+	}
+	return devices, nil
+}
+
+func rocmVisibleDevicesEnvSet() bool {
+	for _, name := range []string{"HIP_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "GPU_DEVICE_ORDINAL"} {
+		if os.Getenv(name) != "" {
+			return true
+		}
+	}
+	return false
+}
+
+func probeNVIDIADriverMajorLinux() (int, error) {
+	nvml, err := dlopenFirst([]string{"libnvidia-ml.so.1", "libnvidia-ml.so"}, false)
+	if err != nil {
+		return 0, err
+	}
+	initFn, err := dlsym(nvml, "nvmlInit_v2")
+	if err != nil {
+		return 0, err
+	}
+	shutdownFn, err := dlsym(nvml, "nvmlShutdown")
+	if err != nil {
+		return 0, err
+	}
+	driverFn, err := dlsym(nvml, "nvmlSystemGetDriverVersion")
+	if err != nil {
+		return 0, err
+	}
+	if ret := C.ollama_call_nvml_init(initFn); ret != 0 {
+		return 0, fmt.Errorf("nvmlInit_v2 failed: %d", int(ret))
+	}
+	defer C.ollama_call_nvml_shutdown(shutdownFn)
+
+	var version [80]C.char
+	if ret := C.ollama_call_nvml_system_get_driver_version(driverFn, &version[0], C.uint(len(version))); ret != 0 {
+		return 0, fmt.Errorf("nvmlSystemGetDriverVersion failed: %d", int(ret))
+	}
+	return parseNVIDIADriverMajor(C.GoString(&version[0]))
+}
+
+func cudaDeviceAttribute(fn unsafe.Pointer, attr int, device C.int) int {
+	var value C.int
+	if ret := C.ollama_call_cu_device_get_attribute(fn, &value, C.int(attr), device); ret != cuSuccess {
+		return 0
+	}
+	return int(value)
+}
+
+func dlopenFirst(names []string, global bool) (dlHandle, error) {
+	var errs []string
+	for _, name := range names {
+		handle, err := dlopen(name, global)
+		if err == nil {
+			return handle, nil
+		}
+		errs = append(errs, err.Error())
+	}
+	return dlHandle{}, errors.New(strings.Join(errs, "; "))
+}
+
+func dlopen(path string, global bool) (dlHandle, error) {
+	cpath := C.CString(path)
+	defer C.free(unsafe.Pointer(cpath))
+
+	handle := C.ollama_dlopen(cpath, boolToCInt(global))
+	if handle == nil {
+		return dlHandle{}, fmt.Errorf("dlopen %s: %s", path, C.GoString(C.ollama_dlerror()))
+	}
+	return dlHandle{ptr: handle}, nil
+}
+
+func dlsym(handle dlHandle, name string) (unsafe.Pointer, error) {
+	cname := C.CString(name)
+	defer C.free(unsafe.Pointer(cname))
+
+	sym := C.ollama_dlsym(handle.ptr, cname)
+	if sym == nil {
+		return nil, fmt.Errorf("dlsym %s: %s", name, C.GoString(C.ollama_dlerror()))
+	}
+	return sym, nil
+}
+
+func dlsymAny(handle dlHandle, names ...string) (unsafe.Pointer, error) {
+	var errs []string
+	for _, name := range names {
+		sym, err := dlsym(handle, name)
+		if err == nil {
+			return sym, nil
+		}
+		errs = append(errs, err.Error())
+	}
+	return nil, errors.New(strings.Join(errs, "; "))
+}
+
+func callGGMLBackendLoad(fn unsafe.Pointer, path string) unsafe.Pointer {
+	cpath := C.CString(path)
+	defer C.free(unsafe.Pointer(cpath))
+	return C.ollama_call_ggml_backend_load(fn, cpath)
+}
+
+func callGGMLRegDevCount(fn unsafe.Pointer, reg unsafe.Pointer) uintptr {
+	return uintptr(C.ollama_call_ggml_backend_reg_dev_count(fn, reg))
+}
+
+func callGGMLRegDevGet(fn unsafe.Pointer, reg unsafe.Pointer, index int) unsafe.Pointer {
+	return C.ollama_call_ggml_backend_reg_dev_get(fn, reg, C.size_t(index))
+}
+
+func callGGMLRegName(fn unsafe.Pointer, reg unsafe.Pointer) string {
+	return C.GoString(C.ollama_call_ggml_backend_reg_name(fn, reg))
+}
+
+func callGGMLDeviceProps(fn unsafe.Pointer, dev unsafe.Pointer) ggmlBackendDevProps {
+	var props ggmlBackendDevProps
+	C.ollama_call_ggml_backend_dev_get_props(fn, dev, unsafe.Pointer(&props))
+	return props
+}
+
+func cString(ptr uintptr) string {
+	if ptr == 0 {
+		return ""
+	}
+	return C.GoString(C.ollama_cstr_from_uintptr(C.uintptr_t(ptr)))
+}
+
+func boolToCInt(v bool) C.int {
+	if v {
+		return 1
+	}
+	return 0
+}
--- a/discover/native_probe_linux_nocgo.go
+++ b/discover/native_probe_linux_nocgo.go
@ -0,0 +1,12 @@
+//go:build linux && !cgo
+
+package discover
+
+import (
+	"context"
+	"errors"
+)
+
+func runPlatformNativeProbe(context.Context, []string) ([]nativeProbeDevice, error) {
+	return nil, errors.New("native GPU discovery requires cgo on Linux")
+}
--- a/discover/native_probe_platform.go
+++ b/discover/native_probe_platform.go
@ -0,0 +1,132 @@
+//go:build (linux && cgo) || windows
+
+package discover
+
+import (
+	"errors"
+	"fmt"
+	"os"
+	"path/filepath"
+	"runtime"
+	"sort"
+	"strconv"
+	"strings"
+)
+
+const (
+	ggmlBackendDeviceTypeGPU  = 1
+	ggmlBackendDeviceTypeIGPU = 2
+)
+
+func ggmlDeviceTypeIntegrated(deviceType int32) bool {
+	return deviceType == ggmlBackendDeviceTypeIGPU
+}
+
+func ggmlProbeLibraryName(name string) string {
+	switch strings.ToLower(name) {
+	case "cuda":
+		return "CUDA"
+	case "hip", "rocm":
+		return "ROCm"
+	case "vulkan":
+		return "Vulkan"
+	case "metal":
+		return "Metal"
+	default:
+		return name
+	}
+}
+
+func ggmlLibraryFile(dir, name string) string {
+	if runtime.GOOS == "windows" {
+		return filepath.Join(dir, name+".dll")
+	}
+
+	exact := filepath.Join(dir, "lib"+name+".so")
+	if _, err := os.Stat(exact); err == nil {
+		return exact
+	}
+	matches, _ := filepath.Glob(exact + ".*")
+	if len(matches) > 0 {
+		sort.Strings(matches)
+		return matches[len(matches)-1]
+	}
+	return exact
+}
+
+func nativeProbeBackendFiles(libDirs []string) []string {
+	var files []string
+	seen := map[string]bool{}
+	for _, dir := range libDirs {
+		for _, pattern := range nativeProbeBackendPatterns(dir) {
+			matches, _ := filepath.Glob(pattern)
+			for _, match := range matches {
+				if seen[match] {
+					continue
+				}
+				seen[match] = true
+				files = append(files, match)
+			}
+		}
+	}
+	return files
+}
+
+func nativeProbeBackendPatterns(dir string) []string {
+	if runtime.GOOS == "windows" {
+		return []string{
+			filepath.Join(dir, "ggml-cuda.dll"),
+			filepath.Join(dir, "ggml-hip.dll"),
+			filepath.Join(dir, "ggml-vulkan.dll"),
+		}
+	}
+
+	return []string{
+		filepath.Join(dir, "libggml-cuda.so"),
+		filepath.Join(dir, "libggml-hip.so"),
+		filepath.Join(dir, "libggml-vulkan.so"),
+	}
+}
+
+func nativeProbeHasCUDA(libDirs []string) bool {
+	for _, dir := range libDirs {
+		if strings.Contains(strings.ToLower(filepath.Base(dir)), "cuda") {
+			return true
+		}
+	}
+	for _, file := range nativeProbeBackendFiles(libDirs) {
+		if strings.Contains(strings.ToLower(filepath.Base(file)), "cuda") {
+			return true
+		}
+	}
+	return false
+}
+
+func nativeProbeHasROCm(libDirs []string) bool {
+	for _, dir := range libDirs {
+		base := strings.ToLower(filepath.Base(dir))
+		if strings.Contains(base, "rocm") || strings.Contains(base, "hip") {
+			return true
+		}
+	}
+	for _, file := range nativeProbeBackendFiles(libDirs) {
+		base := strings.ToLower(filepath.Base(file))
+		if strings.Contains(base, "hip") {
+			return true
+		}
+	}
+	return false
+}
+
+func parseNVIDIADriverMajor(version string) (int, error) {
+	version = strings.TrimSpace(version)
+	if version == "" {
+		return 0, errors.New("empty NVIDIA driver version")
+	}
+	major, _, _ := strings.Cut(version, ".")
+	driver, err := strconv.Atoi(major)
+	if err != nil {
+		return 0, fmt.Errorf("parse NVIDIA driver version %q: %w", version, err)
+	}
+	return driver, nil
+}
--- a/discover/native_probe_stub.go
+++ b/discover/native_probe_stub.go
@ -0,0 +1,12 @@
+//go:build !linux && !windows
+
+package discover
+
+import (
+	"context"
+	"errors"
+)
+
+func runPlatformNativeProbe(context.Context, []string) ([]nativeProbeDevice, error) {
+	return nil, errors.New("native GPU discovery is not implemented on this platform")
+}
--- a/discover/native_probe_test.go
+++ b/discover/native_probe_test.go
@ -0,0 +1,203 @@
+package discover
+
+import (
+	"testing"
+	"unsafe"
+
+	"github.com/ollama/ollama/ml"
+)
+
+func TestGGMLBackendDevPropsLayout(t *testing.T) {
+	if unsafe.Sizeof(uintptr(0)) != 8 {
+		t.Skip("GGML probe layout assertions are for 64-bit builds")
+	}
+
+	var props ggmlBackendDevProps
+	if got, want := unsafe.Sizeof(props), uintptr(56); got != want {
+		t.Fatalf("ggmlBackendDevProps size = %d, want %d", got, want)
+	}
+	checks := []struct {
+		name string
+		got  uintptr
+		want uintptr
+	}{
+		{"Name", unsafe.Offsetof(props.Name), 0},
+		{"Description", unsafe.Offsetof(props.Description), 8},
+		{"MemoryFree", unsafe.Offsetof(props.MemoryFree), 16},
+		{"MemoryTotal", unsafe.Offsetof(props.MemoryTotal), 24},
+		{"Type", unsafe.Offsetof(props.Type), 32},
+		{"DeviceID", unsafe.Offsetof(props.DeviceID), 40},
+		{"Caps", unsafe.Offsetof(props.Caps), 48},
+	}
+	for _, tt := range checks {
+		t.Run(tt.name, func(t *testing.T) {
+			if tt.got != tt.want {
+				t.Fatalf("offset = %d, want %d", tt.got, tt.want)
+			}
+		})
+	}
+	if got, want := unsafe.Sizeof(ggmlBackendDevCaps{}), uintptr(4); got != want {
+		t.Fatalf("ggmlBackendDevCaps size = %d, want %d", got, want)
+	}
+}
+
+func TestParseLlamaServerDevicesUsesNativeCUDAComputeCapability(t *testing.T) {
+	output := `system_info: n_threads = 4 | CUDA : ARCHS = 750,800 |
+Available devices:
+  CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
+`
+	devices := parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/cuda_v13"}, []nativeProbeDevice{{
+		Library:             "CUDA",
+		Index:               0,
+		IndexMatchesBackend: true,
+		DeviceID:            "0000:01:00.0",
+		ComputeMajor:        6,
+		ComputeMinor:        1,
+		CUDADriverMajor:     13,
+		NVIDIADriverMajor:   570,
+	}})
+	if len(devices) != 0 {
+		t.Fatalf("got %d devices, want unsupported CUDA device filtered", len(devices))
+	}
+
+	output = `system_info: n_threads = 4 | CUDA : ARCHS = 610,750,800 |
+Available devices:
+  CUDA0: NVIDIA GeForce GTX 1060 6GB (6063 MiB, 5900 MiB free)
+`
+	devices = parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/cuda_v12"}, []nativeProbeDevice{{
+		Library:             "CUDA",
+		Index:               0,
+		IndexMatchesBackend: true,
+		DeviceID:            "0000:01:00.0",
+		ComputeMajor:        6,
+		ComputeMinor:        1,
+		CUDADriverMajor:     12,
+		NVIDIADriverMajor:   570,
+	}})
+	if len(devices) != 1 {
+		t.Fatalf("got %d devices, want 1", len(devices))
+	}
+	got := devices[0]
+	if got.Compute() != "6.1" {
+		t.Fatalf("compute = %q, want 6.1", got.Compute())
+	}
+	if got.PCIID != "0000:01:00.0" {
+		t.Fatalf("PCIID = %q, want 0000:01:00.0", got.PCIID)
+	}
+	if got.Driver() != "12.0" {
+		t.Fatalf("driver = %q, want 12.0", got.Driver())
+	}
+	if got.NVIDIADriverMajor != 570 {
+		t.Fatalf("NVIDIADriverMajor = %d, want 570", got.NVIDIADriverMajor)
+	}
+}
+
+func TestParseLlamaServerDevicesUsesNativeROCmMetadata(t *testing.T) {
+	output := `ggml_vulkan: 0 = AMD Radeon RX 7600 | uma: 1 | fp16: 1 |
+Available devices:
+  ROCm0: AMD Radeon RX 7600 (8176 MiB, 7900 MiB free)
+`
+	devices := parseLlamaServerDevicesWithNative(output, []string{"/lib/ollama", "/lib/ollama/rocm_v7_2"}, []nativeProbeDevice{{
+		Library:             "ROCm",
+		Index:               0,
+		IndexMatchesBackend: true,
+		DeviceID:            "0000:03:00.0",
+		GFXTarget:           "gfx1102",
+		Integrated:          false,
+		IntegratedKnown:     true,
+	}})
+	if len(devices) != 1 {
+		t.Fatalf("got %d devices, want 1", len(devices))
+	}
+	got := devices[0]
+	if got.PCIID != "0000:03:00.0" {
+		t.Fatalf("PCIID = %q, want 0000:03:00.0", got.PCIID)
+	}
+	if got.GFXTarget != "gfx1102" {
+		t.Fatalf("GFXTarget = %q, want gfx1102", got.GFXTarget)
+	}
+	if got.Compute() != "gfx1102" {
+		t.Fatalf("compute = %q, want gfx1102", got.Compute())
+	}
+	if got.Integrated {
+		t.Fatalf("Integrated = true, want false")
+	}
+}
+
+func TestNVIDIADriverMajorFromDevices(t *testing.T) {
+	devices := []ml.DeviceInfo{
+		{DeviceID: ml.DeviceID{Library: "CUDA"}, NVIDIADriverMajor: 565},
+	}
+	if got := nvidiaDriverMajorFromDevices(devices); got != 565 {
+		t.Fatalf("driver = %d, want 565", got)
+	}
+}
+
+func TestMergeNativeProbeDevicesAvoidsUnreliableIndexMatch(t *testing.T) {
+	tests := []struct {
+		name       string
+		base       []nativeProbeDevice
+		supplement []nativeProbeDevice
+		wantLen    int
+		wantPCI    string
+		wantKnown  bool
+		wantIGPU   bool
+	}{
+		{
+			name: "filtered sysfs cannot overwrite a different backend device by index",
+			base: []nativeProbeDevice{{
+				Library:             "ROCm",
+				Index:               0,
+				IndexMatchesBackend: true,
+				DeviceID:            "0000:03:00.0",
+			}},
+			supplement: []nativeProbeDevice{{
+				Library:         "ROCm",
+				Index:           0,
+				DeviceID:        "0000:04:00.0",
+				Integrated:      true,
+				IntegratedKnown: true,
+			}},
+			wantLen: 1,
+			wantPCI: "0000:03:00.0",
+		},
+		{
+			name: "filtered sysfs can still merge by PCI ID",
+			base: []nativeProbeDevice{{
+				Library:             "ROCm",
+				Index:               0,
+				IndexMatchesBackend: true,
+				DeviceID:            "0000:04:00.0",
+			}},
+			supplement: []nativeProbeDevice{{
+				Library:         "ROCm",
+				Index:           1,
+				DeviceID:        "0000:04:00.0",
+				Integrated:      true,
+				IntegratedKnown: true,
+			}},
+			wantLen:   1,
+			wantPCI:   "0000:04:00.0",
+			wantKnown: true,
+			wantIGPU:  true,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got := mergeNativeProbeDevices(tt.base, tt.supplement)
+			if len(got) != tt.wantLen {
+				t.Fatalf("got %d devices, want %d: %#v", len(got), tt.wantLen, got)
+			}
+			if got[0].DeviceID != tt.wantPCI {
+				t.Fatalf("DeviceID = %q, want %q", got[0].DeviceID, tt.wantPCI)
+			}
+			if got[0].IntegratedKnown != tt.wantKnown {
+				t.Fatalf("IntegratedKnown = %v, want %v", got[0].IntegratedKnown, tt.wantKnown)
+			}
+			if got[0].Integrated != tt.wantIGPU {
+				t.Fatalf("Integrated = %v, want %v", got[0].Integrated, tt.wantIGPU)
+			}
+		})
+	}
+}
--- a/discover/native_probe_windows.go
+++ b/discover/native_probe_windows.go
@ -0,0 +1,466 @@
+//go:build windows
+
+package discover
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"log/slog"
+	"os"
+	"path/filepath"
+	"strings"
+	"unsafe"
+
+	"github.com/ollama/ollama/llm"
+	"golang.org/x/sys/windows"
+)
+
+const (
+	cuSuccessWindows                         = 0
+	cuDeviceAttributeComputeCapabilityMajorW = 75
+	cuDeviceAttributeComputeCapabilityMinorW = 76
+	cuDeviceAttributeIntegratedW             = 18
+	hipSuccessWindows                        = 0
+	hipDeviceAttributeIntegratedWindows      = 16
+)
+
+func runPlatformNativeProbe(ctx context.Context, libDirs []string) ([]nativeProbeDevice, error) {
+	select {
+	case <-ctx.Done():
+		return nil, ctx.Err()
+	default:
+	}
+
+	ggmlDevices, ggmlErr := probeGGMLDevicesWindows(libDirs)
+	var cudaDevices []nativeProbeDevice
+	var cudaErr error
+	if nativeProbeHasCUDA(libDirs) {
+		cudaDevices, cudaErr = probeCUDADriverWindows()
+	}
+	var rocmDevices []nativeProbeDevice
+	var rocmErr error
+	if nativeProbeHasROCm(libDirs) {
+		rocmDevices, rocmErr = probeHIPRuntimeWindows(libDirs)
+	}
+
+	devices := mergeNativeProbeDevices(mergeNativeProbeDevices(ggmlDevices, cudaDevices), rocmDevices)
+	if len(devices) > 0 {
+		return devices, nil
+	}
+
+	if ggmlErr != nil {
+		return nil, ggmlErr
+	}
+	if rocmErr != nil {
+		return nil, rocmErr
+	}
+	return nil, cudaErr
+}
+
+func probeGGMLDevicesWindows(libDirs []string) ([]nativeProbeDevice, error) {
+	if len(libDirs) == 0 || libDirs[0] == "" {
+		return nil, errors.New("empty GGML library directory")
+	}
+
+	base, err := loadDLLFromPath(ggmlLibraryFile(libDirs[0], "ggml-base"))
+	if err != nil {
+		return nil, err
+	}
+	ggml, err := loadDLLFromPath(ggmlLibraryFile(libDirs[0], "ggml"))
+	if err != nil {
+		return nil, err
+	}
+
+	backendLoad, err := findProc(ggml, "ggml_backend_load")
+	if err != nil {
+		return nil, err
+	}
+	regDevCount, err := findProc(base, "ggml_backend_reg_dev_count")
+	if err != nil {
+		return nil, err
+	}
+	regDevGet, err := findProc(base, "ggml_backend_reg_dev_get")
+	if err != nil {
+		return nil, err
+	}
+	regName, err := findProc(base, "ggml_backend_reg_name")
+	if err != nil {
+		return nil, err
+	}
+	devGetProps, err := findProc(base, "ggml_backend_dev_get_props")
+	if err != nil {
+		return nil, err
+	}
+
+	var devices []nativeProbeDevice
+	for _, backendPath := range nativeProbeBackendFiles(libDirs) {
+		cpath, err := windows.BytePtrFromString(backendPath)
+		if err != nil {
+			return nil, err
+		}
+		reg, _, _ := backendLoad.Call(uintptr(unsafe.Pointer(cpath)))
+		if reg == 0 {
+			continue
+		}
+
+		regNamePtr, _, _ := regName.Call(reg)
+		library := ggmlProbeLibraryName(windowsCString(regNamePtr))
+		count, _, _ := regDevCount.Call(reg)
+		for i := uintptr(0); i < count; i++ {
+			dev, _, _ := regDevGet.Call(reg, i)
+			if dev == 0 {
+				continue
+			}
+			var props ggmlBackendDevProps
+			devGetProps.Call(dev, uintptr(unsafe.Pointer(&props)))
+			if props.MemoryTotal == 0 {
+				continue
+			}
+			devices = append(devices, nativeProbeDevice{
+				Library:             library,
+				Index:               int(i),
+				IndexMatchesBackend: true,
+				Name:                windowsCString(props.Name),
+				Description:         windowsCString(props.Description),
+				DeviceID:            windowsCString(props.DeviceID),
+				Integrated:          ggmlDeviceTypeIntegrated(props.Type),
+				IntegratedKnown: props.Type == ggmlBackendDeviceTypeGPU ||
+					props.Type == ggmlBackendDeviceTypeIGPU,
+				TotalMemory: uint64(props.MemoryTotal),
+				FreeMemory:  uint64(props.MemoryFree),
+			})
+			slog.Debug("GGML GPU device type", "library", library, "index", i, "ggml_type", props.Type, "integrated", ggmlDeviceTypeIntegrated(props.Type))
+		}
+	}
+
+	return devices, nil
+}
+
+func probeCUDADriverWindows() ([]nativeProbeDevice, error) {
+	cuda, err := loadDLLFromSystem32("nvcuda.dll")
+	if err != nil {
+		return nil, err
+	}
+	cuInit, err := findProc(cuda, "cuInit")
+	if err != nil {
+		return nil, err
+	}
+	cuDriverGetVersion, err := findProc(cuda, "cuDriverGetVersion")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetCount, err := findProc(cuda, "cuDeviceGetCount")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGet, err := findProc(cuda, "cuDeviceGet")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetAttribute, err := findProc(cuda, "cuDeviceGetAttribute")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetName, err := findProc(cuda, "cuDeviceGetName")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceTotalMem, err := procAny(cuda, "cuDeviceTotalMem_v2", "cuDeviceTotalMem")
+	if err != nil {
+		return nil, err
+	}
+	cuDeviceGetPCIBusID, _ := findProc(cuda, "cuDeviceGetPCIBusId")
+
+	if ret, _, _ := cuInit.Call(0); ret != cuSuccessWindows {
+		return nil, fmt.Errorf("cuInit failed: %d", ret)
+	}
+
+	driverMajor, driverMinor := 0, 0
+	var driverVersion int32
+	if ret, _, _ := cuDriverGetVersion.Call(uintptr(unsafe.Pointer(&driverVersion))); ret == cuSuccessWindows {
+		version := int(driverVersion)
+		driverMajor = version / 1000
+		driverMinor = (version - driverMajor*1000) / 10
+	}
+
+	nvidiaDriverMajor := 0
+	if driver, err := probeNVIDIADriverMajorWindows(); err == nil {
+		nvidiaDriverMajor = driver
+	}
+
+	var count int32
+	if ret, _, _ := cuDeviceGetCount.Call(uintptr(unsafe.Pointer(&count))); ret != cuSuccessWindows {
+		return nil, fmt.Errorf("cuDeviceGetCount failed: %d", ret)
+	}
+
+	devices := make([]nativeProbeDevice, 0, int(count))
+	for i := range int(count) {
+		var device int32
+		if ret, _, _ := cuDeviceGet.Call(uintptr(unsafe.Pointer(&device)), uintptr(i)); ret != cuSuccessWindows {
+			continue
+		}
+
+		major := cudaDeviceAttributeWindows(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMajorW, device)
+		minor := cudaDeviceAttributeWindows(cuDeviceGetAttribute, cuDeviceAttributeComputeCapabilityMinorW, device)
+		integrated := cudaDeviceAttributeWindows(cuDeviceGetAttribute, cuDeviceAttributeIntegratedW, device) == 1
+
+		name := make([]byte, 128)
+		cuDeviceGetName.Call(uintptr(unsafe.Pointer(&name[0])), uintptr(len(name)), uintptr(device))
+
+		var total uintptr
+		cuDeviceTotalMem.Call(uintptr(unsafe.Pointer(&total)), uintptr(device))
+
+		pci := ""
+		if cuDeviceGetPCIBusID != nil {
+			pciBuf := make([]byte, 32)
+			if ret, _, _ := cuDeviceGetPCIBusID.Call(uintptr(unsafe.Pointer(&pciBuf[0])), uintptr(len(pciBuf)), uintptr(device)); ret == cuSuccessWindows {
+				pci = strings.ToLower(byteCString(pciBuf))
+			}
+		}
+
+		devices = append(devices, nativeProbeDevice{
+			Library:             "CUDA",
+			Index:               i,
+			IndexMatchesBackend: true,
+			Description:         byteCString(name),
+			DeviceID:            pci,
+			Integrated:          integrated,
+			IntegratedKnown:     true,
+			TotalMemory:         uint64(total),
+			ComputeMajor:        major,
+			ComputeMinor:        minor,
+			CUDADriverMajor:     driverMajor,
+			CUDADriverMinor:     driverMinor,
+			NVIDIADriverMajor:   nvidiaDriverMajor,
+		})
+	}
+
+	return devices, nil
+}
+
+func probeHIPRuntimeWindows(libDirs []string) ([]nativeProbeDevice, error) {
+	hipPath, err := llm.WindowsROCmRuntimeDLLPath(libDirs)
+	if err != nil {
+		return nil, err
+	}
+	hip, err := loadDLLFromPath(hipPath)
+	if err != nil {
+		return nil, err
+	}
+	hipGetDeviceCount, err := findProc(hip, "hipGetDeviceCount")
+	if err != nil {
+		return nil, err
+	}
+	hipDeviceGetName, err := findProc(hip, "hipDeviceGetName")
+	if err != nil {
+		return nil, err
+	}
+	hipDeviceTotalMem, err := findProc(hip, "hipDeviceTotalMem")
+	if err != nil {
+		return nil, err
+	}
+	hipDeviceGetPCIBusID, _ := findProc(hip, "hipDeviceGetPCIBusId")
+	hipDeviceGetAttribute, _ := findProc(hip, "hipDeviceGetAttribute")
+
+	var count int32
+	if ret, _, _ := hipGetDeviceCount.Call(uintptr(unsafe.Pointer(&count))); ret != hipSuccessWindows {
+		return nil, fmt.Errorf("hipGetDeviceCount failed: %d", ret)
+	}
+
+	devices := make([]nativeProbeDevice, 0, int(count))
+	for i := range int(count) {
+		name := make([]byte, 128)
+		hipDeviceGetName.Call(uintptr(unsafe.Pointer(&name[0])), uintptr(len(name)), uintptr(i))
+
+		var total uintptr
+		hipDeviceTotalMem.Call(uintptr(unsafe.Pointer(&total)), uintptr(i))
+
+		pci := ""
+		if hipDeviceGetPCIBusID != nil {
+			pciBuf := make([]byte, 32)
+			if ret, _, _ := hipDeviceGetPCIBusID.Call(uintptr(unsafe.Pointer(&pciBuf[0])), uintptr(len(pciBuf)), uintptr(i)); ret == hipSuccessWindows {
+				pci = strings.ToLower(byteCString(pciBuf))
+			}
+		}
+
+		integrated, integratedKnown := false, false
+		if hipDeviceGetAttribute != nil {
+			integrated = hipDeviceAttributeWindows(hipDeviceGetAttribute, hipDeviceAttributeIntegratedWindows, int32(i)) == 1
+			integratedKnown = true
+		}
+
+		devices = append(devices, nativeProbeDevice{
+			Library:             "ROCm",
+			Index:               i,
+			IndexMatchesBackend: true,
+			Description:         byteCString(name),
+			DeviceID:            pci,
+			Integrated:          integrated,
+			IntegratedKnown:     integratedKnown,
+			TotalMemory:         uint64(total),
+		})
+	}
+
+	return devices, nil
+}
+
+func probeNVIDIADriverMajorWindows() (int, error) {
+	nvml, err := loadDLLFromSystem32("nvml.dll")
+	if err != nil {
+		nvml, err = loadDLLFromDirs([]string{"nvml.dll"}, nvidiaNVMLDirsWindows())
+	}
+	if err != nil {
+		return 0, err
+	}
+	initFn, err := findProc(nvml, "nvmlInit_v2")
+	if err != nil {
+		return 0, err
+	}
+	shutdownFn, err := findProc(nvml, "nvmlShutdown")
+	if err != nil {
+		return 0, err
+	}
+	driverFn, err := findProc(nvml, "nvmlSystemGetDriverVersion")
+	if err != nil {
+		return 0, err
+	}
+	if ret, _, _ := initFn.Call(); ret != 0 {
+		return 0, fmt.Errorf("nvmlInit_v2 failed: %d", ret)
+	}
+	defer shutdownFn.Call()
+
+	version := make([]byte, 80)
+	if ret, _, _ := driverFn.Call(uintptr(unsafe.Pointer(&version[0])), uintptr(len(version))); ret != 0 {
+		return 0, fmt.Errorf("nvmlSystemGetDriverVersion failed: %d", ret)
+	}
+	return parseNVIDIADriverMajor(byteCString(version))
+}
+
+func cudaDeviceAttributeWindows(fn *windows.Proc, attr int, device int32) int {
+	var value int32
+	if ret, _, _ := fn.Call(uintptr(unsafe.Pointer(&value)), uintptr(attr), uintptr(device)); ret != cuSuccessWindows {
+		return 0
+	}
+	return int(value)
+}
+
+func hipDeviceAttributeWindows(fn *windows.Proc, attr int, device int32) int {
+	var value int32
+	if ret, _, _ := fn.Call(uintptr(unsafe.Pointer(&value)), uintptr(attr), uintptr(device)); ret != hipSuccessWindows {
+		return 0
+	}
+	return int(value)
+}
+
+func findProc(dll *windows.DLL, name string) (*windows.Proc, error) {
+	return dll.FindProc(name)
+}
+
+// Use LoadLibraryEx so GPU discovery does not honor the current directory or PATH for DLL resolution.
+func loadDLLFromSystem32(name string) (*windows.DLL, error) {
+	return loadDLLWithFlags(name, windows.LOAD_LIBRARY_SEARCH_SYSTEM32)
+}
+
+func loadDLLFromPath(path string) (*windows.DLL, error) {
+	absPath, err := filepath.Abs(path)
+	if err != nil {
+		return nil, err
+	}
+	return loadDLLWithFlags(absPath, windows.LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR|windows.LOAD_LIBRARY_SEARCH_DEFAULT_DIRS)
+}
+
+func loadDLLWithFlags(name string, flags uintptr) (*windows.DLL, error) {
+	handle, err := windows.LoadLibraryEx(name, 0, flags)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load %s: %w", name, err)
+	}
+	return &windows.DLL{Name: name, Handle: handle}, nil
+}
+
+func loadDLLFromDirs(names, dirs []string) (*windows.DLL, error) {
+	var errs []string
+	for _, name := range names {
+		for _, dir := range dirs {
+			path := filepath.Join(dir, name)
+			if _, err := os.Stat(path); err != nil {
+				continue
+			}
+			dll, err := loadDLLFromPath(path)
+			if err == nil {
+				return dll, nil
+			}
+			errs = append(errs, err.Error())
+		}
+	}
+	if len(errs) == 0 {
+		return nil, fmt.Errorf("no matching DLL found: %s", strings.Join(names, ", "))
+	}
+	return nil, errors.New(strings.Join(errs, "; "))
+}
+
+func nvidiaNVMLDirsWindows() []string {
+	var dirs []string
+	for _, root := range windowsProgramFilesDirs() {
+		dirs = append(dirs, filepath.Join(root, "NVIDIA Corporation", "NVSMI"))
+	}
+	return uniqueAbsDirs(dirs)
+}
+
+func windowsProgramFilesDirs() []string {
+	return uniqueAbsDirs([]string{
+		os.Getenv("ProgramW6432"),
+		os.Getenv("ProgramFiles"),
+	})
+}
+
+func uniqueAbsDirs(dirs []string) []string {
+	seen := map[string]bool{}
+	var out []string
+	for _, dir := range dirs {
+		if dir == "" {
+			continue
+		}
+		absDir, err := filepath.Abs(dir)
+		if err != nil {
+			continue
+		}
+		absDir = filepath.Clean(absDir)
+		key := strings.ToLower(absDir)
+		if seen[key] {
+			continue
+		}
+		seen[key] = true
+		out = append(out, absDir)
+	}
+	return out
+}
+
+func procAny(dll *windows.DLL, names ...string) (*windows.Proc, error) {
+	var errs []string
+	for _, name := range names {
+		proc, err := dll.FindProc(name)
+		if err == nil {
+			return proc, nil
+		}
+		errs = append(errs, err.Error())
+	}
+	return nil, errors.New(strings.Join(errs, "; "))
+}
+
+//nolint:govet // Windows Proc.Call returns C string pointers as uintptr.
+func windowsCString(ptr uintptr) string {
+	if ptr == 0 {
+		return ""
+	}
+	return windows.BytePtrToString((*byte)(unsafe.Pointer(ptr)))
+}
+
+func byteCString(data []byte) string {
+	for i, b := range data {
+		if b == 0 {
+			return string(data[:i])
+		}
+	}
+	return string(data)
+}
--- a/discover/runner.go
+++ b/discover/runner.go
@ -4,9 +4,7 @@ package discover

 import (
 	"context"
-	"io"
 	"log/slog"
-	"os"
 	"os/exec"
 	"path/filepath"
 	"runtime"
@ -27,7 +25,6 @@ var (
 	deviceMu     sync.Mutex
 	devices      []ml.DeviceInfo
 	libDirs      map[string]struct{}
-	exe          string
 	bootstrapped bool
 )

@ -43,15 +40,6 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
 	if !bootstrapped {
 		msg = "GPU bootstrap discovery took"
 		libDirs = make(map[string]struct{})
-		var err error
-		exe, err = os.Executable()
-		if err != nil {
-			slog.Error("unable to lookup executable path", "error", err)
-			return nil
-		}
-		if eval, err := filepath.EvalSymlinks(exe); err == nil {
-			exe = eval
-		}
 		files, err := filepath.Glob(filepath.Join(ml.LibOllamaPath, "*", "*ggml-*"))
 		if err != nil {
 			slog.Debug("unable to lookup runner library directories", "error", err)
@ -66,6 +54,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.

 		slog.Info("discovering available GPUs...")
 		detectIncompatibleLibraries()
+		detectOldAMDDriverWindows()

 		// Warn if any user-overrides are set which could lead to incorrect GPU discovery
 		overrideWarnings()
@ -102,8 +91,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
 				} else if jetpack == "" && strings.Contains(filepath.Base(dir), "cuda_jetpack") {
 					slog.Debug("jetpack not detected (set JETSON_JETPACK or OLLAMA_LLM_LIBRARY to override), skipping", "libDir", dir)
 					continue
-				} else if !envconfig.EnableVulkan() && strings.Contains(filepath.Base(dir), "vulkan") {
-					slog.Info("experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1")
+				} else if !envconfig.EnableVulkan(true) && strings.Contains(filepath.Base(dir), "vulkan") {
 					continue
 				}
 				dirs = []string{ml.LibOllamaPath, dir}
@ -113,10 +101,16 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.

 			ctx1stPass, cancel := context.WithTimeout(ctx, bootstrapTimeout)
 			// For this pass, we retain duplicates in case any are incompatible with some libraries
-			devices = append(devices, bootstrapDevicesWithMetalRetry(ctx1stPass, ctx, bootstrapTimeout, dirs, nil)...)
+			discovered := bootstrapDevicesWithMetalRetry(ctx1stPass, ctx, bootstrapTimeout, dirs, nil)
+			if filepath.Base(dirs[len(dirs)-1]) == "cuda_v12" {
+				discovered = filterOldCUDADriver(ctx, discovered)
+			}
+			devices = append(devices, discovered...)
 			cancel()
 		}

+		devices = filterIntegratedGPUs(devices)
+
 		// In the second pass, we more deeply initialize the GPUs to weed out devices that
 		// aren't supported by a given library.  We run this phase in parallel to speed up discovery.
 		// Only devices that need verification are included in this pass
@ -146,7 +140,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
 			wg.Add(1)
 			go func(i int) {
 				defer wg.Done()
-				extraEnvs := ml.GetDevicesEnv(devices[i:i+1], true)
+				extraEnvs := ml.GetDevicesEnv(devices[i : i+1])
 				devices[i].AddInitValidation(extraEnvs)
 				if len(bootstrapDevicesWithMetalRetry(ctx2ndPass, ctx, 30*time.Second, devices[i].LibraryPath, extraEnvs)) == 0 {
 					slog.Debug("filtering device which didn't fully initialize",
@ -193,6 +187,7 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
 					devices[i].FilterID = devices[i].ID
 					devices[i].ID = strconv.Itoa(postFilteredID[devices[i].Library])
 				}
+				remapFilterIDForUserVisibleDevices(&devices[i])
 				postFilteredID[devices[i].Library]++
 			}
 		}
@ -328,18 +323,18 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.

 			// Bootstrapping may take longer in some cases (AMD windows), but we
 			// would rather use stale free data to get the model running sooner
-			ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
+			rctx, cancel := context.WithTimeout(ctx, 3*time.Second)
 			defer cancel()

-			// Apply any dev filters to avoid re-discovering unsupported devices, and get IDs correct
-			// We avoid CUDA filters here to keep ROCm from failing to discover GPUs in a mixed environment
-			devFilter := ml.GetDevicesEnv(devices, false)
+			// Apply any device filters to avoid re-discovering unsupported devices
+			// and keep remapped IDs aligned.
+			devFilter := ml.GetDevicesEnv(devices)

 			for dir := range libDirs {
-				updatedDevices := bootstrapDevices(ctx, []string{ml.LibOllamaPath, dir}, devFilter)
+				updatedDevices := bootstrapDevicesWithMetalRetry(rctx, ctx, 3*time.Second, []string{ml.LibOllamaPath, dir}, devFilter)
 				for _, u := range updatedDevices {
 					for i := range devices {
-						if u.DeviceID == devices[i].DeviceID && u.PCIID == devices[i].PCIID {
+						if sameRefreshDevice(u, devices[i]) {
 							updated[i] = true
 							devices[i].FreeMemory = u.FreeMemory
 							break
@ -360,6 +355,64 @@ func GPUDevices(ctx context.Context, runners []ml.FilteredRunnerDiscovery) []ml.
 	return append([]ml.DeviceInfo{}, devices...)
 }

+func sameRefreshDevice(updated, existing ml.DeviceInfo) bool {
+	if updated.Library != existing.Library {
+		return false
+	}
+	if updated.PCIID != "" && existing.PCIID != "" {
+		return strings.EqualFold(updated.PCIID, existing.PCIID)
+	}
+	return updated.DeviceID == existing.DeviceID
+}
+
+func filterIntegratedGPUs(devices []ml.DeviceInfo) []ml.DeviceInfo {
+	if runtime.GOOS == "darwin" && runtime.GOARCH == "arm64" {
+		return devices
+	}
+
+	allow, explicit := integratedGPUAdmission()
+	filtered := devices[:0]
+	for _, device := range devices {
+		if !device.Integrated {
+			filtered = append(filtered, device)
+			continue
+		}
+
+		if explicit {
+			if allow {
+				filtered = append(filtered, device)
+				continue
+			}
+		} else if integratedGPUAllowedByDefault(device) {
+			filtered = append(filtered, device)
+			continue
+		}
+
+		slog.Info("dropping integrated GPU",
+			"id", device.ID,
+			"library", device.Library,
+			"compute", device.Compute(),
+			"name", device.Name,
+			"description", device.Description,
+			"pci_id", device.PCIID)
+	}
+
+	return filtered
+}
+
+func integratedGPUAdmission() (allow, explicit bool) {
+	enabledWithTrueDefault := envconfig.EnableIntegratedGPU(true)
+	enabledWithFalseDefault := envconfig.EnableIntegratedGPU(false)
+	if enabledWithTrueDefault == enabledWithFalseDefault {
+		return enabledWithTrueDefault, true
+	}
+	return false, false
+}
+
+func integratedGPUAllowedByDefault(device ml.DeviceInfo) bool {
+	return device.Library == "CUDA"
+}
+
 func filterOverlapByLibrary(supported map[string]map[string]map[string]int, needsDelete []bool) {
 	// For multi-GPU systems, use the newest version that supports all the GPUs
 	for _, byLibDirs := range supported {
@ -410,67 +463,216 @@ func filterOverlapByLibrary(supported map[string]map[string]map[string]int, need
 	}
 }

-type bootstrapRunner struct {
-	port int
-	cmd  *exec.Cmd
-}
-
-func (r *bootstrapRunner) GetPort() int {
-	return r.port
-}
-
-func (r *bootstrapRunner) HasExited() bool {
-	if r.cmd != nil && r.cmd.ProcessState != nil {
-		return true
-	}
-	return false
-}
-
 func bootstrapDevicesWithMetalRetry(firstAttemptCtx, retryParentCtx context.Context, timeout time.Duration, ollamaLibDirs []string, extraEnvs map[string]string) []ml.DeviceInfo {
-	runDiscovery := func(ctx context.Context, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, int, error) {
+	extraEnvs = normalizeDiscoveryEnv(ollamaLibDirs, extraEnvs)
+
+	runDiscovery := func(ctx context.Context, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
 		start := time.Now()
 		defer func() {
 			slog.Debug("bootstrap discovery took", "duration", time.Since(start), "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs)
 		}()
-		return bootstrapDevicesWithStatus(ctx, ollamaLibDirs, extraEnvs)
+		return bootstrapDevicesWithStatusWatchdog(ctx, ollamaLibDirs, extraEnvs)
 	}

-	devices, status, exitCode, err := runDiscovery(firstAttemptCtx, extraEnvs)
+	devices, status, err := runDiscovery(firstAttemptCtx, extraEnvs)
 	if err == nil {
 		recordPersistentRunnerEnv(devices, extraEnvs)
+		return devices
 	}
-	if err != nil && llm.ShouldRetryWithMetalTensorDisabled(err, status) && (extraEnvs == nil || extraEnvs["GGML_METAL_TENSOR_DISABLE"] != "1") {
+
+	if llm.ShouldRetryWithMetalTensorDisabled(err, status) && (extraEnvs == nil || extraEnvs["GGML_METAL_TENSOR_DISABLE"] != "1") {
 		retryEnvs := map[string]string{}
 		for k, v := range extraEnvs {
 			retryEnvs[k] = v
 		}
 		retryEnvs["GGML_METAL_TENSOR_DISABLE"] = "1"
-		slog.Warn("retrying GPU discovery with Metal tensor API disabled", "error", err)
+		slog.Warn("retrying llama-server GPU discovery with Metal tensor API disabled", "error", err, "detail", lastDiscoveryStatusError(status))
+
 		retryCtx, cancel := context.WithTimeout(retryParentCtx, timeout)
 		defer cancel()
-		devices, status, exitCode, err = runDiscovery(retryCtx, retryEnvs)
+		devices, status, err = runDiscovery(retryCtx, retryEnvs)
 		if err == nil {
 			recordPersistentRunnerEnv(devices, retryEnvs)
+			return devices
 		}
 	}

-	if err != nil {
-		if exitCode >= 0 {
-			// Expected during bootstrapping while we filter out unsupported GPUs.
-			logutil.Trace("runner exited", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "code", exitCode, "detail", status.LastError())
-		} else {
-			slog.Info("failure during GPU discovery", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", err, "detail", status.LastError())
-		}
-	}
-
+	slog.Info("failure during llama-server GPU discovery", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", err, "detail", lastDiscoveryStatusError(status))
 	return devices
 }

+func normalizeDiscoveryEnv(ollamaLibDirs []string, extraEnvs map[string]string) map[string]string {
+	return normalizeDiscoveryEnvForGOOS(runtime.GOOS, ollamaLibDirs, extraEnvs)
+}
+
+func normalizeDiscoveryEnvForGOOS(goos string, ollamaLibDirs []string, extraEnvs map[string]string) map[string]string {
+	if goos != "linux" || len(ollamaLibDirs) == 0 || !isROCmLibraryDir(filepath.Base(ollamaLibDirs[len(ollamaLibDirs)-1])) {
+		return extraEnvs
+	}
+
+	if extraEnvs["ROCR_VISIBLE_DEVICES"] != "" || envconfig.RocrVisibleDevices() != "" {
+		return extraEnvs
+	}
+
+	source, tokens := rocmNumericVisibleDeviceSource(extraEnvs)
+	if len(tokens) == 0 {
+		return extraEnvs
+	}
+
+	env := make(map[string]string, len(extraEnvs)+1)
+	for k, v := range extraEnvs {
+		env[k] = v
+	}
+	env["ROCR_VISIBLE_DEVICES"] = strings.Join(tokens, ",")
+	env[source] = visibleDeviceOrdinals(len(tokens))
+	slog.Debug("normalizing AMD visible devices for ROCm discovery", "from_env", source, "ROCR_VISIBLE_DEVICES", env["ROCR_VISIBLE_DEVICES"], "visible_ordinals", env[source])
+	return env
+}
+
+func isROCmLibraryDir(name string) bool {
+	return strings.HasPrefix(name, "rocm")
+}
+
+type bootstrapDevicesResult struct {
+	devices []ml.DeviceInfo
+	status  *llm.StatusWriter
+	err     error
+}
+
+func bootstrapDevicesWithStatusWatchdog(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
+	return runBootstrapDevicesWithStatusWatchdog(ctx, ollamaLibDirs, extraEnvs, llamaServerBootstrapDevicesWithStatus)
+}
+
+func runBootstrapDevicesWithStatusWatchdog(
+	ctx context.Context,
+	ollamaLibDirs []string,
+	extraEnvs map[string]string,
+	discover func(context.Context, []string, map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error),
+) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
+	resultCh := make(chan bootstrapDevicesResult, 1)
+	go func() {
+		devices, status, err := discover(ctx, ollamaLibDirs, extraEnvs)
+		resultCh <- bootstrapDevicesResult{devices: devices, status: status, err: err}
+	}()
+
+	select {
+	case result := <-resultCh:
+		return result.devices, result.status, result.err
+	case <-ctx.Done():
+		slog.Warn("llama-server GPU discovery watchdog timed out", "OLLAMA_LIBRARY_PATH", ollamaLibDirs, "extra_envs", extraEnvs, "error", ctx.Err())
+		return nil, nil, ctx.Err()
+	}
+}
+
+func remapFilterIDForUserVisibleDevices(device *ml.DeviceInfo) {
+	tokens := visibleDeviceFilterTokens(runtime.GOOS, device.Library)
+	if len(tokens) == 0 {
+		return
+	}
+
+	id := device.FilterID
+	if id == "" {
+		id = device.ID
+	}
+	index, err := strconv.Atoi(id)
+	if err != nil || index < 0 || index >= len(tokens) {
+		return
+	}
+
+	device.FilterID = tokens[index]
+}
+
+func visibleDeviceFilterTokens(goos, library string) []string {
+	switch library {
+	case "CUDA":
+		return splitVisibleDeviceList(envconfig.CudaVisibleDevices())
+	case "ROCm":
+		if goos == "linux" {
+			if tokens := splitVisibleDeviceList(envconfig.RocrVisibleDevices()); len(tokens) > 0 {
+				return tokens
+			}
+			if _, tokens := rocmNumericVisibleDeviceSource(nil); len(tokens) > 0 {
+				return tokens
+			}
+			return nil
+		}
+		for _, value := range []string{envconfig.HipVisibleDevices(), envconfig.GpuDeviceOrdinal(), envconfig.CudaVisibleDevices()} {
+			if tokens := splitNumericVisibleDeviceList(value); len(tokens) > 0 {
+				return tokens
+			}
+		}
+	case "Vulkan":
+		return splitVisibleDeviceList(envconfig.VkVisibleDevices())
+	}
+
+	return nil
+}
+
+func rocmNumericVisibleDeviceSource(extraEnvs map[string]string) (string, []string) {
+	for _, name := range []string{"HIP_VISIBLE_DEVICES", "GPU_DEVICE_ORDINAL", "CUDA_VISIBLE_DEVICES"} {
+		value := extraEnvs[name]
+		if value == "" {
+			switch name {
+			case "HIP_VISIBLE_DEVICES":
+				value = envconfig.HipVisibleDevices()
+			case "GPU_DEVICE_ORDINAL":
+				value = envconfig.GpuDeviceOrdinal()
+			case "CUDA_VISIBLE_DEVICES":
+				value = envconfig.CudaVisibleDevices()
+			}
+		}
+		if tokens := splitNumericVisibleDeviceList(value); len(tokens) > 0 {
+			return name, tokens
+		}
+	}
+	return "", nil
+}
+
+func splitVisibleDeviceList(value string) []string {
+	fields := strings.Split(value, ",")
+	tokens := make([]string, 0, len(fields))
+	for _, field := range fields {
+		field = strings.TrimSpace(field)
+		if field != "" {
+			tokens = append(tokens, field)
+		}
+	}
+	return tokens
+}
+
+func splitNumericVisibleDeviceList(value string) []string {
+	tokens := splitVisibleDeviceList(value)
+	if len(tokens) == 0 {
+		return nil
+	}
+	for _, token := range tokens {
+		index, err := strconv.Atoi(token)
+		if err != nil || index < 0 {
+			return nil
+		}
+	}
+	return tokens
+}
+
+func visibleDeviceOrdinals(count int) string {
+	ordinals := make([]string, count)
+	for i := range ordinals {
+		ordinals[i] = strconv.Itoa(i)
+	}
+	return strings.Join(ordinals, ",")
+}
+
+func lastDiscoveryStatusError(status *llm.StatusWriter) string {
+	if status == nil {
+		return ""
+	}
+	return status.LastError()
+}
+
 func recordPersistentRunnerEnv(devices []ml.DeviceInfo, extraEnvs map[string]string) {
 	if extraEnvs["GGML_METAL_TENSOR_DISABLE"] != "1" {
 		return
 	}
-
 	for i := range devices {
 		if devices[i].Library != "Metal" {
 			continue
@ -482,44 +684,6 @@ func recordPersistentRunnerEnv(devices []ml.DeviceInfo, extraEnvs map[string]str
 	}
 }

-func bootstrapDevices(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) []ml.DeviceInfo {
-	devices, _, _, _ := bootstrapDevicesWithStatus(ctx, ollamaLibDirs, extraEnvs)
-	return devices
-}
-
-func bootstrapDevicesWithStatus(ctx context.Context, ollamaLibDirs []string, extraEnvs map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, int, error) {
-	var baseOut io.Writer = io.Discard
-	if envconfig.LogLevel() == logutil.LevelTrace {
-		baseOut = os.Stderr
-	}
-
-	status := llm.NewStatusWriter(baseOut)
-	cmd, port, err := llm.StartRunner(
-		true, // ollama engine
-		"",   // no model
-		ollamaLibDirs,
-		status,
-		extraEnvs,
-	)
-	if err != nil {
-		slog.Debug("failed to start runner to discovery GPUs", "error", err)
-		return nil, status, -1, err
-	}
-
-	go func() {
-		cmd.Wait() // exit status ignored
-	}()
-
-	defer cmd.Process.Kill()
-
-	devices, err := ml.GetDevicesFromRunner(ctx, &bootstrapRunner{port: port, cmd: cmd})
-	exitCode := -1
-	if cmd.ProcessState != nil {
-		exitCode = cmd.ProcessState.ExitCode()
-	}
-	return devices, status, exitCode, err
-}
-
 func overrideWarnings() {
 	anyFound := false
 	m := envconfig.AsMap()
--- a/discover/runner_test.go
+++ b/discover/runner_test.go
@ -1,18 +1,16 @@
 package discover

 import (
-	"log/slog"
-	"os"
+	"context"
+	"errors"
+	"runtime"
 	"testing"
+	"time"

+	"github.com/ollama/ollama/llm"
 	"github.com/ollama/ollama/ml"
 )

-func init() {
-	logger := slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelDebug}))
-	slog.SetDefault(logger)
-}
-
 func TestFilterOverlapByLibrary(t *testing.T) {
 	type testcase struct {
 		name string
@ -133,3 +131,266 @@ func TestRecordPersistentRunnerEnv(t *testing.T) {
 		t.Fatalf("unexpected RunnerEnvOverrides recorded for non-Metal device: %#v", devices[1].RunnerEnvOverrides)
 	}
 }
+
+func TestFilterIntegratedGPUs(t *testing.T) {
+	devices := []ml.DeviceInfo{
+		{DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"}, Description: "NVIDIA integrated", Integrated: true},
+		{DeviceID: ml.DeviceID{Library: "Metal", ID: "0"}, Description: "Apple GPU", Integrated: true},
+		{DeviceID: ml.DeviceID{Library: "Vulkan", ID: "0"}, Description: "AMD Radeon(TM) Graphics", Integrated: true},
+		{DeviceID: ml.DeviceID{Library: "ROCm", ID: "0"}, Description: "AMD Radeon(TM) Graphics", Integrated: true},
+		{DeviceID: ml.DeviceID{Library: "Vulkan", ID: "1"}, Description: "AMD Radeon RX 6800"},
+	}
+
+	if runtime.GOOS == "darwin" && runtime.GOARCH == "arm64" {
+		t.Setenv("OLLAMA_IGPU_ENABLE", "false")
+		got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
+		want := []ml.DeviceID{
+			{Library: "CUDA", ID: "0"},
+			{Library: "Metal", ID: "0"},
+			{Library: "Vulkan", ID: "0"},
+			{Library: "ROCm", ID: "0"},
+			{Library: "Vulkan", ID: "1"},
+		}
+		assertDeviceIDs(t, got, want)
+		return
+	}
+
+	t.Run("auto admits only allowlisted integrated GPUs", func(t *testing.T) {
+		got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
+		want := []ml.DeviceID{
+			{Library: "CUDA", ID: "0"},
+			{Library: "Vulkan", ID: "1"},
+		}
+		assertDeviceIDs(t, got, want)
+	})
+
+	t.Run("explicit true admits all integrated GPUs", func(t *testing.T) {
+		t.Setenv("OLLAMA_IGPU_ENABLE", "true")
+		got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
+		want := []ml.DeviceID{
+			{Library: "CUDA", ID: "0"},
+			{Library: "Metal", ID: "0"},
+			{Library: "Vulkan", ID: "0"},
+			{Library: "ROCm", ID: "0"},
+			{Library: "Vulkan", ID: "1"},
+		}
+		assertDeviceIDs(t, got, want)
+	})
+
+	t.Run("explicit false drops integrated GPUs", func(t *testing.T) {
+		t.Setenv("OLLAMA_IGPU_ENABLE", "false")
+		got := filterIntegratedGPUs(append([]ml.DeviceInfo{}, devices...))
+		want := []ml.DeviceID{{Library: "Vulkan", ID: "1"}}
+		assertDeviceIDs(t, got, want)
+	})
+}
+
+func assertDeviceIDs(t *testing.T, got []ml.DeviceInfo, want []ml.DeviceID) {
+	t.Helper()
+	if len(got) != len(want) {
+		t.Fatalf("got %d devices, want %d: %#v", len(got), len(want), got)
+	}
+	for i := range want {
+		if got[i].DeviceID != want[i] {
+			t.Fatalf("device %d = %#v, want %#v", i, got[i].DeviceID, want[i])
+		}
+	}
+}
+
+func TestRemapFilterIDForUserVisibleDevices(t *testing.T) {
+	tests := []struct {
+		name       string
+		env        map[string]string
+		device     ml.DeviceInfo
+		wantID     string
+		wantFilter string
+	}{
+		{
+			name: "cuda numeric parent filter",
+			env:  map[string]string{"CUDA_VISIBLE_DEVICES": "1"},
+			device: ml.DeviceInfo{
+				DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"},
+				FilterID: "0",
+			},
+			wantID:     "0",
+			wantFilter: "1",
+		},
+		{
+			name: "cuda uuid parent filter",
+			env:  map[string]string{"CUDA_VISIBLE_DEVICES": "GPU-f3a94ab8-b31d-61ff-9fbb-ce91ac1cdd95"},
+			device: ml.DeviceInfo{
+				DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"},
+				FilterID: "0",
+			},
+			wantID:     "0",
+			wantFilter: "GPU-f3a94ab8-b31d-61ff-9fbb-ce91ac1cdd95",
+		},
+		{
+			name: "rocm hip parent filter",
+			env:  map[string]string{"HIP_VISIBLE_DEVICES": "2,0"},
+			device: ml.DeviceInfo{
+				DeviceID: ml.DeviceID{Library: "ROCm", ID: "1"},
+				FilterID: "1",
+			},
+			wantID:     "1",
+			wantFilter: "0",
+		},
+		{
+			name: "vulkan parent filter",
+			env:  map[string]string{"GGML_VK_VISIBLE_DEVICES": "1"},
+			device: ml.DeviceInfo{
+				DeviceID: ml.DeviceID{Library: "Vulkan", ID: "0"},
+				FilterID: "0",
+			},
+			wantID:     "0",
+			wantFilter: "1",
+		},
+		{
+			name: "no parent filter keeps internal filter id",
+			device: ml.DeviceInfo{
+				DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"},
+				FilterID: "3",
+			},
+			wantID:     "0",
+			wantFilter: "3",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			for key, value := range tt.env {
+				t.Setenv(key, value)
+			}
+
+			remapFilterIDForUserVisibleDevices(&tt.device)
+
+			if tt.device.ID != tt.wantID {
+				t.Fatalf("ID = %q, want %q", tt.device.ID, tt.wantID)
+			}
+			if tt.device.FilterID != tt.wantFilter {
+				t.Fatalf("FilterID = %q, want %q", tt.device.FilterID, tt.wantFilter)
+			}
+		})
+	}
+}
+
+func TestNormalizeROCmDiscoveryEnv(t *testing.T) {
+	tests := []struct {
+		name        string
+		env         map[string]string
+		extra       map[string]string
+		wantROCR    string
+		wantSource  string
+		wantOrdinal string
+		wantSame    bool
+	}{
+		{
+			name:        "hip becomes rocr",
+			env:         map[string]string{"HIP_VISIBLE_DEVICES": "2"},
+			wantROCR:    "2",
+			wantSource:  "HIP_VISIBLE_DEVICES",
+			wantOrdinal: "0",
+		},
+		{
+			name:        "gpu ordinal becomes rocr",
+			env:         map[string]string{"GPU_DEVICE_ORDINAL": "3"},
+			wantROCR:    "3",
+			wantSource:  "GPU_DEVICE_ORDINAL",
+			wantOrdinal: "0",
+		},
+		{
+			name:        "cuda numeric becomes rocr",
+			env:         map[string]string{"CUDA_VISIBLE_DEVICES": "2,0"},
+			wantROCR:    "2,0",
+			wantSource:  "CUDA_VISIBLE_DEVICES",
+			wantOrdinal: "0,1",
+		},
+		{
+			name:     "rocr wins",
+			env:      map[string]string{"ROCR_VISIBLE_DEVICES": "1", "HIP_VISIBLE_DEVICES": "2"},
+			wantSame: true,
+		},
+		{
+			name:     "cuda uuid does not become rocr",
+			env:      map[string]string{"CUDA_VISIBLE_DEVICES": "GPU-f3a94ab8-b31d-61ff-9fbb-ce91ac1cdd95"},
+			wantSame: true,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			for key, value := range tt.env {
+				t.Setenv(key, value)
+			}
+
+			got := normalizeDiscoveryEnvForGOOS("linux", []string{"/lib/ollama", "/lib/ollama/rocm_v7_2"}, tt.extra)
+
+			if tt.wantSame {
+				if got != nil && got["ROCR_VISIBLE_DEVICES"] != "" {
+					t.Fatalf("ROCR_VISIBLE_DEVICES = %q, want unset", got["ROCR_VISIBLE_DEVICES"])
+				}
+				return
+			}
+			if got["ROCR_VISIBLE_DEVICES"] != tt.wantROCR {
+				t.Fatalf("ROCR_VISIBLE_DEVICES = %q, want %q", got["ROCR_VISIBLE_DEVICES"], tt.wantROCR)
+			}
+			if got[tt.wantSource] != tt.wantOrdinal {
+				t.Fatalf("%s = %q, want %q", tt.wantSource, got[tt.wantSource], tt.wantOrdinal)
+			}
+		})
+	}
+}
+
+func TestBootstrapDevicesWithStatusWatchdogReturnsResult(t *testing.T) {
+	want := []ml.DeviceInfo{{DeviceID: ml.DeviceID{Library: "CUDA", ID: "0"}}}
+	devices, _, err := runBootstrapDevicesWithStatusWatchdog(
+		t.Context(),
+		[]string{"/lib/ollama", "/lib/ollama/cuda_v12"},
+		nil,
+		func(context.Context, []string, map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
+			return want, nil, nil
+		},
+	)
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	if len(devices) != 1 || devices[0].DeviceID != want[0].DeviceID {
+		t.Fatalf("devices = %#v, want %#v", devices, want)
+	}
+}
+
+func TestBootstrapDevicesWithStatusWatchdogReturnsOnDeadline(t *testing.T) {
+	ctx, cancel := context.WithTimeout(t.Context(), 10*time.Millisecond)
+	defer cancel()
+
+	started := make(chan struct{})
+	release := make(chan struct{})
+	finished := make(chan struct{})
+
+	_, _, err := runBootstrapDevicesWithStatusWatchdog(
+		ctx,
+		[]string{"/lib/ollama", "/lib/ollama/rocm_v7_2"},
+		nil,
+		func(context.Context, []string, map[string]string) ([]ml.DeviceInfo, *llm.StatusWriter, error) {
+			close(started)
+			defer close(finished)
+			<-release
+			return nil, nil, nil
+		},
+	)
+	if !errors.Is(err, context.DeadlineExceeded) {
+		t.Fatalf("err = %v, want context deadline exceeded", err)
+	}
+	close(release)
+
+	select {
+	case <-started:
+	case <-time.After(time.Second):
+		t.Fatal("discovery function was not called")
+	}
+	select {
+	case <-finished:
+	case <-time.After(time.Second):
+		t.Fatal("discovery function did not finish after release")
+	}
+}
--- a/discover/types.go
+++ b/discover/types.go
@ -16,16 +16,6 @@ type memInfo struct {
 	FreeSwap    uint64 `json:"free_swap,omitempty"` // TODO split this out for system only
 }

-// CPU type represents a CPU Package occupying a socket
-type CPU struct {
-	ID                  string `cpuinfo:"processor"`
-	VendorID            string `cpuinfo:"vendor_id"`
-	ModelName           string `cpuinfo:"model name"`
-	CoreCount           int
-	EfficiencyCoreCount int // Performance = CoreCount - Efficiency
-	ThreadCount         int
-}
-
 func LogDetails(devices []ml.DeviceInfo) {
 	sort.Sort(sort.Reverse(ml.ByFreeMemory(devices))) // Report devices in order of scheduling preference
 	for _, dev := range devices {
--- a/discover/vulkan.go
+++ b/discover/vulkan.go
@ -0,0 +1,136 @@
+// Vulkan discovery needs a small amount of normalization around device type.
+// llama-server discovery output does not currently expose a stable structured
+// backend type field, so we use explicit Vulkan UMA metadata when it is
+// present and, on Windows, refine the result with a direct Vulkan API query.
+// The goal is to preserve correct integrated-vs-discrete scheduling decisions
+// without relying on device-name heuristics.
+package discover
+
+import (
+	"bufio"
+	"errors"
+	"log/slog"
+	"regexp"
+	"runtime"
+	"strconv"
+	"strings"
+
+	"github.com/ollama/ollama/ml"
+)
+
+// vulkanUMARegex matches Vulkan debug lines like:
+//
+//	ggml_vulkan: 0 = Intel(R) Graphics (...) | uma: 1 | fp16: 1 |
+var vulkanUMARegex = regexp.MustCompile(
+	`ggml_vulkan:\s+(\d+)\s+=.*\|\s+uma:\s+([01])\s+\|`,
+)
+
+func parseVulkanUMA(output string) map[int]bool {
+	integratedByIndex := make(map[int]bool)
+
+	scanner := bufio.NewScanner(strings.NewReader(output))
+	for scanner.Scan() {
+		if matches := vulkanUMARegex.FindStringSubmatch(scanner.Text()); matches != nil {
+			idx, _ := strconv.Atoi(matches[1])
+			integratedByIndex[idx] = matches[2] == "1"
+		}
+	}
+
+	return integratedByIndex
+}
+
+var errWindowsVulkanProbeUnsupported = errors.New("windows vulkan probe unsupported")
+
+type vulkanPhysicalDevice struct {
+	Name       string
+	Integrated bool
+}
+
+var probeLlamaServerVulkanDevices = func(_ []string) ([]vulkanPhysicalDevice, error) {
+	return nil, errWindowsVulkanProbeUnsupported
+}
+
+func refineLlamaServerDevices(devices []ml.DeviceInfo, libDirs []string) []ml.DeviceInfo {
+	devices = refineLinuxROCmDevices(devices)
+	return refineWindowsVulkanDevices(devices, libDirs)
+}
+
+func refineWindowsVulkanDevices(devices []ml.DeviceInfo, libDirs []string) []ml.DeviceInfo {
+	if runtime.GOOS != "windows" {
+		return devices
+	}
+
+	var vulkanIndexes []int
+	for i, device := range devices {
+		if device.Library != "Vulkan" {
+			continue
+		}
+		vulkanIndexes = append(vulkanIndexes, i)
+	}
+
+	if len(vulkanIndexes) == 0 {
+		return devices
+	}
+
+	probed, err := probeLlamaServerVulkanDevices(libDirs)
+	if err != nil {
+		if !errors.Is(err, errWindowsVulkanProbeUnsupported) {
+			slog.Debug("windows vulkan device refinement unavailable", "error", err)
+		}
+		return devices
+	}
+
+	if !applyWindowsVulkanRefinement(devices, probed) {
+		return devices
+	}
+
+	return devices
+}
+
+func applyWindowsVulkanRefinement(devices []ml.DeviceInfo, probed []vulkanPhysicalDevice) bool {
+	var vulkanIndexes []int
+	for i, device := range devices {
+		if device.Library == "Vulkan" {
+			vulkanIndexes = append(vulkanIndexes, i)
+		}
+	}
+
+	if len(probed) != len(vulkanIndexes) {
+		slog.Debug("windows vulkan device refinement skipped: device count mismatch",
+			"llama_server_count", len(vulkanIndexes), "vulkan_count", len(probed))
+		return false
+	}
+
+	matches := make([]int, len(vulkanIndexes))
+	for i := range matches {
+		matches[i] = -1
+	}
+	used := make([]bool, len(probed))
+	for i, deviceIndex := range vulkanIndexes {
+		description := devices[deviceIndex].Description
+		for j, probedDevice := range probed {
+			if used[j] || !sameVulkanDeviceName(description, probedDevice.Name) {
+				continue
+			}
+			matches[i] = j
+			used[j] = true
+			break
+		}
+		if matches[i] < 0 {
+			slog.Debug("windows vulkan device refinement skipped: device name mismatch",
+				"index", i, "llama_server_name", description)
+			return false
+		}
+	}
+
+	for i, probedIndex := range matches {
+		devices[vulkanIndexes[i]].Integrated = probed[probedIndex].Integrated
+	}
+
+	slog.Debug("windows vulkan device refinement applied", "devices", len(vulkanIndexes))
+	return true
+}
+
+func sameVulkanDeviceName(a, b string) bool {
+	return ml.SimilarDeviceDescription(a, b)
+}
--- a/discover/vulkan_refine_stub.go
+++ b/discover/vulkan_refine_stub.go
@ -0,0 +1,3 @@
+//go:build !windows
+
+package discover
--- a/discover/vulkan_refine_windows.go
+++ b/discover/vulkan_refine_windows.go
@ -0,0 +1,119 @@
+package discover
+
+import (
+	"fmt"
+	"unsafe"
+
+	"github.com/ollama/ollama/llm"
+)
+
+const (
+	vkSuccess                           = 0
+	vkStructureTypeInstanceCreateInfo   = 1
+	vkPhysicalDeviceTypeIntegratedGPU   = 1
+	vkMaxPhysicalDeviceNameSize         = 256
+	vkPhysicalDevicePropertiesByteCount = 4096
+)
+
+type vkInstanceCreateInfo struct {
+	SType                   uint32
+	PNext                   uintptr
+	Flags                   uint32
+	PApplicationInfo        uintptr
+	EnabledLayerCount       uint32
+	PpEnabledLayerNames     uintptr
+	EnabledExtensionCount   uint32
+	PpEnabledExtensionNames uintptr
+}
+
+func init() {
+	probeLlamaServerVulkanDevices = windowsVulkanPhysicalDevices
+}
+
+func windowsVulkanPhysicalDevices(libDirs []string) ([]vulkanPhysicalDevice, error) {
+	vulkanPath, err := llm.WindowsVulkanRuntimeDLLPath(libDirs)
+	if err != nil {
+		return nil, err
+	}
+	vulkanDLL, err := loadDLLFromPath(vulkanPath)
+	if err != nil {
+		return nil, err
+	}
+	vkCreateInstanceProc, err := findProc(vulkanDLL, "vkCreateInstance")
+	if err != nil {
+		return nil, fmt.Errorf("vkCreateInstance unavailable: %w", err)
+	}
+	vkDestroyInstanceProc, err := findProc(vulkanDLL, "vkDestroyInstance")
+	if err != nil {
+		return nil, fmt.Errorf("vkDestroyInstance unavailable: %w", err)
+	}
+	vkEnumeratePhysicalDevices, err := findProc(vulkanDLL, "vkEnumeratePhysicalDevices")
+	if err != nil {
+		return nil, fmt.Errorf("vkEnumeratePhysicalDevices unavailable: %w", err)
+	}
+	vkGetPhysicalDeviceProperties, err := findProc(vulkanDLL, "vkGetPhysicalDeviceProperties")
+	if err != nil {
+		return nil, fmt.Errorf("vkGetPhysicalDeviceProperties unavailable: %w", err)
+	}
+
+	createInfo := vkInstanceCreateInfo{SType: vkStructureTypeInstanceCreateInfo}
+	var instance uintptr
+	result, _, callErr := vkCreateInstanceProc.Call(
+		uintptr(unsafe.Pointer(&createInfo)),
+		0,
+		uintptr(unsafe.Pointer(&instance)),
+	)
+	if result != vkSuccess {
+		return nil, fmt.Errorf("vkCreateInstance failed: result=%d error=%w", result, callErr)
+	}
+	defer vkDestroyInstanceProc.Call(instance, 0)
+
+	var count uint32
+	result, _, callErr = vkEnumeratePhysicalDevices.Call(
+		instance,
+		uintptr(unsafe.Pointer(&count)),
+		0,
+	)
+	if result != vkSuccess {
+		return nil, fmt.Errorf("vkEnumeratePhysicalDevices count failed: result=%d error=%w", result, callErr)
+	}
+	if count == 0 {
+		return nil, nil
+	}
+
+	physicalDevices := make([]uintptr, int(count))
+	result, _, callErr = vkEnumeratePhysicalDevices.Call(
+		instance,
+		uintptr(unsafe.Pointer(&count)),
+		uintptr(unsafe.Pointer(&physicalDevices[0])),
+	)
+	if result != vkSuccess {
+		return nil, fmt.Errorf("vkEnumeratePhysicalDevices failed: result=%d error=%w", result, callErr)
+	}
+
+	devices := make([]vulkanPhysicalDevice, 0, count)
+	for _, physicalDevice := range physicalDevices[:int(count)] {
+		properties := make([]byte, vkPhysicalDevicePropertiesByteCount)
+		vkGetPhysicalDeviceProperties.Call(
+			physicalDevice,
+			uintptr(unsafe.Pointer(&properties[0])),
+		)
+		deviceType := *(*uint32)(unsafe.Pointer(&properties[16]))
+		deviceNameBytes := properties[20 : 20+vkMaxPhysicalDeviceNameSize]
+		devices = append(devices, vulkanPhysicalDevice{
+			Name:       nulTerminatedString(deviceNameBytes),
+			Integrated: deviceType == vkPhysicalDeviceTypeIntegratedGPU,
+		})
+	}
+
+	return devices, nil
+}
+
+func nulTerminatedString(data []byte) string {
+	for i, b := range data {
+		if b == 0 {
+			return string(data[:i])
+		}
+	}
+	return string(data)
+}
--- a/docs/api.md
+++ b/docs/api.md
@ -398,6 +398,7 @@ curl http://localhost:11434/api/generate -d '{
    "num_keep": 5,
    "seed": 42,
    "num_predict": 100,
+    "draft_num_predict": 4,
    "top_k": 20,
    "top_p": 0.9,
    "min_p": 0.0,
--- a/docs/development.md
+++ b/docs/development.md
@ -3,9 +3,11 @@
 Install prerequisites:

 - [Go](https://go.dev/doc/install)
- C/C++ Compiler e.g. Clang on macOS, [TDM-GCC](https://github.com/jmeubank/tdm-gcc/releases/latest) (Windows amd64) or [llvm-mingw](https://github.com/mstorsjo/llvm-mingw) (Windows arm64), GCC/Clang on Linux.
+- [CMake](https://cmake.org/download/) 3.24 or newer
+- C/C++ compiler: Clang on macOS, Visual Studio 2022 C++ tools on Windows, or GCC/Clang on Linux
+- [Ninja](https://github.com/ninja-build/ninja/releases) in `PATH` is recommended, especially on Windows

-Then build and run Ollama from the root directory of the repository:
+For pure Go iteration against an existing native payload, run Ollama from the repository root:

 ```shell
 go run . serve
@ -14,53 +16,73 @@ go run . serve
 > [!NOTE]
 > Ollama includes native code compiled with CGO.  From time to time these data structures can change and CGO can get out of sync resulting in unexpected crashes.  You can force a full build of the native code by running `go clean -cache` first. 

+## Native build model
+
+For a fresh checkout, or after changing native code, build from the repository root. On macOS arm64, this builds Metal inference. On all other platforms this builds CPU-only inference. It builds the Go binary at the repository root and installs the native runtime payload under `build/lib/ollama`.
+
+```shell
+cmake -B build .
+cmake --build build --parallel 8
+./ollama serve
+```
+
+To install into a standard prefix layout:
+
+```shell
+cmake --install build --prefix /path/to/install
+```
+
+On all platforms except macOS arm64, to build GPU backends select the backends explicitly:
+
+```shell
+cmake -B build . -DOLLAMA_LLAMA_BACKENDS="cuda_v13;vulkan"
+cmake --build build --parallel 8
+```
+
+Supported backend values are `cuda_v12`, `cuda_v13`, `rocm_v7_1`, `rocm_v7_2`, `vulkan`, `cuda_jetpack5`, and `cuda_jetpack6`.
+
+Use standard CMake architecture overrides to narrow GPU builds for local hardware:
+
+```shell
+# CUDA
+cmake -B build . -DOLLAMA_LLAMA_BACKENDS=cuda_v13 -DCMAKE_CUDA_ARCHITECTURES=native
+
+# ROCm / HIP
+cmake -B build . -DOLLAMA_LLAMA_BACKENDS=rocm_v7_2 -DCMAKE_HIP_ARCHITECTURES=gfx1100
+```
+
+You can tune GGML build options by setting `GGML_*` values during configure. For example, to build CUDA v12 for Pascal without flash attention kernels:
+
+```shell
+cmake -B build . -DOLLAMA_LLAMA_BACKENDS=cuda_v12 -DCMAKE_CUDA_ARCHITECTURES=61 -DGGML_CUDA_FA=OFF
+```

 ## macOS (Apple Silicon)

-macOS Apple Silicon supports Metal which is built-in to the Ollama binary. No additional steps are required.
+Additional prerequisites:

-## macOS (Intel)
-
-Install prerequisites:
-
- [CMake](https://cmake.org/download/) or `brew install cmake`
-
-Then, configure and build the project:
+MLX Metal requires the Metal toolchain. Install [Xcode](https://developer.apple.com/xcode/) first, then:

 ```shell
-cmake -B build
-cmake --build build
-```
-
-Lastly, run Ollama:
-
-```shell
-go run . serve
+xcodebuild -downloadComponent MetalToolchain
 ```

 ## Windows

-Install prerequisites:
+Additional prerequisites:

- [CMake](https://cmake.org/download/)
 - [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/) including the Native Desktop Workload
 - (Optional) AMD GPU support
    - [ROCm](https://rocm.docs.amd.com/en/latest/)
-    - [Ninja](https://github.com/ninja-build/ninja/releases)
 - (Optional) NVIDIA GPU support
-    - [CUDA SDK](https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_network)
- (Optional) VULKAN GPU support
-    - [VULKAN SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
+    - [CUDA SDK](https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_type=exe_network)
+- (Optional) Vulkan GPU support
+    - [Vulkan SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
 - (Optional) MLX engine support
    - [CUDA 13+ SDK](https://developer.nvidia.com/cuda-downloads)
    - [cuDNN 9+](https://developer.nvidia.com/cudnn)

-Then, configure and build the project:
-
-```shell
-cmake -B build
-cmake --build build --config Release
-```
+For Ninja builds, run CMake from a Developer PowerShell/Command Prompt or another shell where the Visual Studio compiler is available.

 > Building for Vulkan requires VULKAN_SDK environment variable:
 > 
@ -73,36 +95,20 @@ cmake --build build --config Release
 > set VULKAN_SDK=C:\VulkanSDK\<version>
 > ```

-> [!IMPORTANT]
-> Building for ROCm requires additional flags:
-> ```
-> cmake -B build -G Ninja -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
-> cmake --build build --config Release
-> ```
-
-
-
-Lastly, run Ollama:
-
-```shell
-go run . serve
-```
-
 ## Windows (ARM)

-Windows ARM does not support additional acceleration libraries at this time.  Do not use cmake, simply `go run` or `go build`.
+Windows ARM does not support additional acceleration libraries at this time.

 ## Linux

-Install prerequisites:
+Additional prerequisites:

- [CMake](https://cmake.org/download/) or `sudo apt install cmake` or `sudo dnf install cmake`
 - (Optional) AMD GPU support
    - [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html)
 - (Optional) NVIDIA GPU support
    - [CUDA SDK](https://developer.nvidia.com/cuda-downloads)
- (Optional) VULKAN GPU support
-    - [VULKAN SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
+- (Optional) Vulkan GPU support
+    - [Vulkan SDK](https://vulkan.lunarg.com/sdk/home) - useful for AMD/Intel GPUs
    - Or install via package manager: `sudo apt install vulkan-sdk` (Ubuntu/Debian) or `sudo dnf install vulkan-sdk` (Fedora/CentOS)
 - (Optional) MLX engine support
    - [CUDA 13+ SDK](https://developer.nvidia.com/cuda-downloads)
@ -111,57 +117,17 @@ Install prerequisites:
 > [!IMPORTANT]
 > Ensure prerequisites are in `PATH` before running CMake.

-
-Then, configure and build the project:
-
-```shell
-cmake -B build
-cmake --build build
-```
-
-Lastly, run Ollama:
-
-```shell
-go run . serve
-```
-
 ## MLX Engine (Optional)

-The MLX engine enables running safetensor based models. It requires building the [MLX](https://github.com/ml-explore/mlx) and [MLX-C](https://github.com/ml-explore/mlx-c) shared libraries separately via CMake.  On MacOS, MLX leverages the Metal library to run on the GPU, and on Windows and Linux, runs on NVIDIA GPUs via CUDA v13.
+The MLX engine enables running safetensor based models. On macOS arm64, MLX is enabled by default. On other platforms, MLX backends are selected with `OLLAMA_MLX_BACKENDS`.

-### macOS (Apple Silicon)
-
-Requires the Metal toolchain. Install [Xcode](https://developer.apple.com/xcode/) first, then:
-
-```shell
-xcodebuild -downloadComponent MetalToolchain
-```
-
-Verify it's installed correctly (should print "no input files"):
-
-```shell
-xcrun metal
-```
-
-Then build:
-
-```shell
-cmake -B build --preset MLX
-cmake --build build --preset MLX --parallel
-cmake --install build --component MLX
-```
-
-> [!NOTE]
-> Without the Metal toolchain, cmake will silently complete with Metal disabled. Check the cmake output for `Setting MLX_BUILD_METAL=OFF` which indicates the toolchain is missing.
-
-### Windows / Linux (CUDA)
+### CUDA

 Requires CUDA 13+ and [cuDNN](https://developer.nvidia.com/cudnn) 9+.

 ```shell
-cmake -B build --preset "MLX CUDA 13"
-cmake --build build --target mlx --target mlxc --config Release --parallel
-cmake --install build --component MLX --strip
+cmake -B build . -DOLLAMA_MLX_BACKENDS=cuda_v13
+cmake --build build --parallel 8
 ```

 ### Local MLX source overrides
@ -173,17 +139,20 @@ export OLLAMA_MLX_SOURCE=/path/to/mlx
 export OLLAMA_MLX_C_SOURCE=/path/to/mlx-c
 ```

-For example, using the helper scripts with local mlx and mlx-c repos:
-```shell
-OLLAMA_MLX_SOURCE=../mlx OLLAMA_MLX_C_SOURCE=../mlx-c ./scripts/build_linux.sh
+On macOS arm64:

-OLLAMA_MLX_SOURCE=../mlx OLLAMA_MLX_C_SOURCE=../mlx-c ./scripts/build_darwin.sh
+```shell
+OLLAMA_MLX_SOURCE=../mlx OLLAMA_MLX_C_SOURCE=../mlx-c cmake -B build .
+cmake --build build --parallel 8
 ```

+For CUDA:
+
 ```powershell
 $env:OLLAMA_MLX_SOURCE="../mlx"
 $env:OLLAMA_MLX_C_SOURCE="../mlx-c"
-./scripts/build_darwin.ps1
+cmake -B build . -DOLLAMA_MLX_BACKENDS=cuda_v13
+cmake --build build --parallel 8
 ```

 ## Docker
@ -208,11 +177,11 @@ go test ./...

 ## Library detection

-Ollama looks for acceleration libraries in the following paths relative to the `ollama` executable:
+Ollama looks for native helper binaries and acceleration libraries in installed and local development layouts:

-* `./lib/ollama` (Windows)
-* `../lib/ollama` (Linux)
-* `.` (macOS)
-* `build/lib/ollama` (for development)
+* `../lib/ollama` for standard installs where `ollama` is under `bin/`
+* `./lib/ollama` for Windows release-style payloads and local dist output
+* `.` for macOS release artifacts that colocate helpers with `ollama`
+* `build/lib/ollama` and `dist/<platform>/lib/ollama` for local development builds

 If the libraries are not found, Ollama will not run with any acceleration libraries.
--- a/docs/docker.mdx
+++ b/docs/docker.mdx
@ -70,12 +70,16 @@ docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 114

 ## Vulkan Support

-Vulkan is bundled into the `ollama/ollama` image.  
+Vulkan is bundled into the `ollama/ollama` image and is enabled by default when
+the container can access the GPU devices.

 ```shell
-docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_VULKAN=1 --name ollama ollama/ollama
+docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
 ```

+Use `OLLAMA_VULKAN=0` to disable Vulkan, or `GGML_VK_VISIBLE_DEVICES=<ids>` to
+select specific Vulkan devices.
+

 ## Run model locally

@ -88,4 +92,3 @@ docker exec -it ollama ollama run llama3.2
 ## Try different models

 More models can be found on the [Ollama library](https://ollama.com/library).
-
--- a/docs/gpu.mdx
+++ b/docs/gpu.mdx
@ -4,6 +4,7 @@ title: Hardware support

 ## Nvidia
 Ollama supports Nvidia GPUs with compute capability 5.0+ and driver version 531 and newer.
+Nvidia GPUs with compute capability 5.0 through 6.2 require driver version 570 or newer.

 Check your compute compatibility to see if your card is supported:
 [https://developer.nvidia.com/cuda-gpus](https://developer.nvidia.com/cuda-gpus)
@ -75,7 +76,7 @@ using the `amdgpu-install` utility from

 ### Windows Support

-With ROCm v6.1, the following GPUs are supported on Windows.
+Ollama requires an AMD ROCm v7 / HIP7-capable driver stack on Windows.

 | Family         | Cards and accelerators                                                                                               |
 | -------------- | -------------------------------------------------------------------------------------------------------------------- |
@ -142,12 +143,9 @@ Ollama supports GPU acceleration on Apple devices via the Metal API.

 ## Vulkan GPU Support

-> **NOTE:**
-> Vulkan is currently an Experimental feature.  To enable, you must set OLLAMA_VULKAN=1 for the Ollama server as
-described in the [FAQ](faq#how-do-i-configure-ollama-server)
-
 Additional GPU support on Windows and Linux is provided via
-[Vulkan](https://www.vulkan.org/). On Windows most GPU vendors drivers come
+[Vulkan](https://www.vulkan.org/). Vulkan is enabled by default when the
+backend is installed. On Windows most GPU vendors drivers come
 bundled with Vulkan support and require no additional setup steps. Most Linux
 distributions require installing additional components, and you may have
 multiple options for Vulkan drivers between Mesa and GPU Vendor specific packages
@ -173,4 +171,9 @@ To select specific Vulkan GPU(s), you can set the environment variable
 `GGML_VK_VISIBLE_DEVICES` to one or more numeric IDs on the Ollama server as
 described in the [FAQ](faq#how-do-i-configure-ollama-server). If you
 encounter any problems with Vulkan based GPUs, you can disable all Vulkan GPUs
-by setting `GGML_VK_VISIBLE_DEVICES=-1` 
+by setting `OLLAMA_VULKAN=0` or `GGML_VK_VISIBLE_DEVICES=-1`.
+
+On mixed iGPU/dGPU systems where the Vulkan iGPU is unstable, keep Vulkan
+enabled and set `GGML_VK_VISIBLE_DEVICES` to the discrete GPU index. For
+example, use `GGML_VK_VISIBLE_DEVICES=1` when `Vulkan1` is the discrete
+GPU.
--- a/docs/modelfile.mdx
+++ b/docs/modelfile.mdx
@ -157,6 +157,7 @@ PARAMETER <parameter> <parametervalue>
 | seed           | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0)                                                                                                                                                                                                               | int        | seed 42              |
 | stop           | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile.                                                                                                                                                              | string     | stop "AI assistant:" |
 | num_predict    | Maximum number of tokens to predict when generating text. (Default: -1, infinite generation)                                                                                                                                                                                                                                                                                    | int        | num_predict 42       |
+| draft_num_predict | Maximum number of speculative draft tokens to predict per step when a draft model is available. Separate draft models default to 4; embedded MTP tensors require setting this parameter. Set to 0 to disable speculative drafting.                                                                                                                                               | int        | draft_num_predict 4  |
 | top_k          | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)                                                                                                                                                                                                | int        | top_k 40             |
 | top_p          | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)                                                                                                                                                                                         | float      | top_p 0.9            |
 | min_p          | Alternative to the top*p, and aims to ensure a balance of quality and variety. The parameter \_p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with _p_=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. (Default: 0.0) | float      | min_p 0.05           |
--- a/docs/windows.mdx
+++ b/docs/windows.mdx
@ -12,10 +12,18 @@ terminal application. As usual the Ollama [API](/api) will be served on

 - Windows 10 22H2 or newer, Home or Pro
 - NVIDIA 452.39 or newer Drivers if you have an NVIDIA card
- AMD Radeon Driver https://www.amd.com/en/support if you have a Radeon card
+- AMD ROCm v7 / HIP7-capable driver stack for ROCm acceleration, or a Vulkan-capable AMD Radeon driver for Vulkan acceleration

 Ollama uses unicode characters for progress indication, which may render as unknown squares in some older terminal fonts in Windows 10. If you see this, try changing your terminal font settings.

+<Note>
+  Some RDNA2 / Radeon RX 6000 systems, including RX 6800-class cards, may not
+  expose ROCm v7 on current Windows AMD drivers. Vulkan is enabled by default
+  and is the recommended fallback for those systems. If a mixed iGPU/dGPU
+  system selects an unstable Vulkan iGPU, set `GGML_VK_VISIBLE_DEVICES` to the
+  discrete GPU index.
+</Note>
+
 ## Filesystem Requirements

 The Ollama install does not require Administrator, and installs in your home directory by default. You'll need at least 4GB of space for the binary install. Once you've installed Ollama, you'll need additional space for storing the Large Language models, which can be tens to hundreds of GB in size. If your home directory doesn't have enough space, you can change where the binaries are installed, and where the models are stored.
--- a/envconfig/config.go
+++ b/envconfig/config.go
@ -214,6 +214,8 @@ func LogLevel() slog.Level {
 var (
 	// FlashAttention enables the experimental flash attention feature.
 	FlashAttention = BoolWithDefault("OLLAMA_FLASH_ATTENTION")
+	// GoTemplate enables Modelfile TEMPLATE rendering when a model has one.
+	GoTemplate = BoolWithDefault("OLLAMA_GO_TEMPLATE")
 	// DebugLogRequests logs inference requests to disk for replay/debugging.
 	DebugLogRequests = Bool("OLLAMA_DEBUG_LOG_REQUESTS")
 	// KvCacheType is the quantization type for the K/V cache.
@ -224,16 +226,14 @@ var (
 	NoPrune = Bool("OLLAMA_NOPRUNE")
 	// SchedSpread allows scheduling models across all GPUs.
 	SchedSpread = Bool("OLLAMA_SCHED_SPREAD")
-	// MultiUserCache optimizes prompt caching for multi-user scenarios
-	MultiUserCache = Bool("OLLAMA_MULTIUSER_CACHE")
-	// Enable the new Ollama engine
-	NewEngine = Bool("OLLAMA_NEW_ENGINE")
 	// ContextLength sets the default context length
 	ContextLength = Uint("OLLAMA_CONTEXT_LENGTH", 0)
 	// Auth enables authentication between the Ollama client and server
 	UseAuth = Bool("OLLAMA_AUTH")
-	// Enable Vulkan backend
-	EnableVulkan = Bool("OLLAMA_VULKAN")
+	// EnableVulkan controls Vulkan backend discovery.
+	EnableVulkan = BoolWithDefault("OLLAMA_VULKAN")
+	// EnableIntegratedGPU controls whether integrated GPUs may be selected.
+	EnableIntegratedGPU = BoolWithDefault("OLLAMA_IGPU_ENABLE")
 	// NoCloudEnv checks the OLLAMA_NO_CLOUD environment variable.
 	NoCloudEnv = Bool("OLLAMA_NO_CLOUD")
 )
@ -312,9 +312,13 @@ func AsMap() map[string]EnvVar {
 	ret := map[string]EnvVar{
 		"OLLAMA_DEBUG":                {"OLLAMA_DEBUG", LogLevel(), "Show additional debug information (e.g. OLLAMA_DEBUG=1)"},
 		"OLLAMA_DEBUG_LOG_REQUESTS":   {"OLLAMA_DEBUG_LOG_REQUESTS", DebugLogRequests(), "Log inference request bodies and replay curl commands to a temp directory"},
+		"OLLAMA_GO_TEMPLATE":          {"OLLAMA_GO_TEMPLATE", GoTemplate(true), "Enable Modelfile TEMPLATE based rendering when available"},
 		"OLLAMA_FLASH_ATTENTION":      {"OLLAMA_FLASH_ATTENTION", FlashAttention(false), "Enabled flash attention"},
 		"OLLAMA_KV_CACHE_TYPE":        {"OLLAMA_KV_CACHE_TYPE", KvCacheType(), "Quantization type for the K/V cache (default: f16)"},
 		"OLLAMA_GPU_OVERHEAD":         {"OLLAMA_GPU_OVERHEAD", GpuOverhead(), "Reserve a portion of VRAM per GPU (bytes)"},
+		"OLLAMA_IGPU_ENABLE":          {"OLLAMA_IGPU_ENABLE", String("OLLAMA_IGPU_ENABLE")(), "Enable integrated GPUs"},
+		"LLAMA_ARG_FIT":               {"LLAMA_ARG_FIT", String("LLAMA_ARG_FIT")(), "Enable llama.cpp automatic fit of unset memory options (default \"on\")"},
+		"LLAMA_ARG_FIT_TARGET":        {"LLAMA_ARG_FIT_TARGET", String("LLAMA_ARG_FIT_TARGET")(), "Target free VRAM margin per device for llama.cpp fit (MiB)"},
 		"OLLAMA_HOST":                 {"OLLAMA_HOST", Host(), "IP Address for the ollama server (default 127.0.0.1:11434)"},
 		"OLLAMA_KEEP_ALIVE":           {"OLLAMA_KEEP_ALIVE", KeepAlive(), "The duration that models stay loaded in memory (default \"5m\")"},
 		"OLLAMA_LLM_LIBRARY":          {"OLLAMA_LLM_LIBRARY", LLMLibrary(), "Set LLM library to bypass autodetection"},
@ -329,10 +333,8 @@ func AsMap() map[string]EnvVar {
 		"OLLAMA_NUM_PARALLEL":         {"OLLAMA_NUM_PARALLEL", NumParallel(), "Maximum number of parallel requests"},
 		"OLLAMA_ORIGINS":              {"OLLAMA_ORIGINS", AllowedOrigins(), "A comma separated list of allowed origins"},
 		"OLLAMA_SCHED_SPREAD":         {"OLLAMA_SCHED_SPREAD", SchedSpread(), "Always schedule model across all GPUs"},
-		"OLLAMA_MULTIUSER_CACHE":      {"OLLAMA_MULTIUSER_CACHE", MultiUserCache(), "Optimize prompt caching for multi-user scenarios"},
 		"OLLAMA_CONTEXT_LENGTH":       {"OLLAMA_CONTEXT_LENGTH", ContextLength(), "Context length to use unless otherwise specified (default: 4k/32k/256k based on VRAM)"},
 		"OLLAMA_EDITOR":               {"OLLAMA_EDITOR", Editor(), "Path to editor for interactive prompt editing (Ctrl+G)"},
-		"OLLAMA_NEW_ENGINE":           {"OLLAMA_NEW_ENGINE", NewEngine(), "Enable the new Ollama engine"},
 		"OLLAMA_REMOTES":              {"OLLAMA_REMOTES", Remotes(), "Allowed hosts for remote models (default \"ollama.com\")"},

 		// Informational
@ -355,7 +357,7 @@ func AsMap() map[string]EnvVar {
 		ret["GGML_VK_VISIBLE_DEVICES"] = EnvVar{"GGML_VK_VISIBLE_DEVICES", VkVisibleDevices(), "Set which Vulkan devices are visible by numeric ID"}
 		ret["GPU_DEVICE_ORDINAL"] = EnvVar{"GPU_DEVICE_ORDINAL", GpuDeviceOrdinal(), "Set which AMD devices are visible by numeric ID"}
 		ret["HSA_OVERRIDE_GFX_VERSION"] = EnvVar{"HSA_OVERRIDE_GFX_VERSION", HsaOverrideGfxVersion(), "Override the gfx used for all detected AMD GPUs"}
-		ret["OLLAMA_VULKAN"] = EnvVar{"OLLAMA_VULKAN", EnableVulkan(), "Enable experimental Vulkan support"}
+		ret["OLLAMA_VULKAN"] = EnvVar{"OLLAMA_VULKAN", EnableVulkan(true), "Enable Vulkan support"}
 	}

 	return ret
--- a/fs/ggml/ggml.go
+++ b/fs/ggml/ggml.go
@ -424,8 +424,12 @@ func (t TensorType) BlockSize() uint64 {
 		TensorTypeQ8_0,
 		TensorTypeQ8_1,
 		tensorTypeIQ4_NL,
-		4, TensorTypeMXFP4:
+		TensorTypeMXFP4:
 		return 32
+	case TensorTypeNVFP4:
+		return 64
+	case TensorTypeQ1_0:
+		return 128
 	default:
 		return 256
 	}
@ -497,8 +501,12 @@ func (t TensorType) TypeSize() uint64 {
 		return blockSize/8 + blockSize/16 + blockSize/32
 	case TensorTypeBF16:
 		return 2
-	case 4, TensorTypeMXFP4:
+	case TensorTypeMXFP4:
 		return 1 + blockSize/2
+	case TensorTypeNVFP4:
+		return 4 + blockSize/2
+	case TensorTypeQ1_0:
+		return 2 + blockSize/8
 	default:
 		return 0
 	}
--- a/fs/ggml/type.go
+++ b/fs/ggml/type.go
@ -14,9 +14,9 @@ const (
 	FileTypeF16
 	fileTypeQ4_0
 	fileTypeQ4_1
-	fileTypeMXFP4 // originally fileTypeQ4_1_F16 // unused by GGML
-	fileTypeQ4_2  // unused by GGML
-	fileTypeQ4_3  // unused by GGML
+	fileTypeQ4_1_F16 // removed from GGUF files
+	fileTypeQ4_2     // removed from GGUF files
+	fileTypeQ4_3     // removed from GGUF files
 	FileTypeQ8_0
 	fileTypeQ5_0
 	fileTypeQ5_1
@ -48,6 +48,9 @@ const (
 	fileTypeQ4_0_8_8 // unused by GGML
 	fileTypeTQ1_0
 	fileTypeTQ2_0
+	fileTypeMXFP4_MOE
+	fileTypeNVFP4
+	fileTypeQ1_0

 	FileTypeUnknown = 1024
 )
@ -97,8 +100,6 @@ func (t FileType) String() string {
 		return "Q4_0"
 	case fileTypeQ4_1:
 		return "Q4_1"
-	case fileTypeMXFP4:
-		return "MXFP4"
 	case FileTypeQ8_0:
 		return "Q8_0"
 	case fileTypeQ5_0:
@ -123,10 +124,44 @@ func (t FileType) String() string {
 		return "Q5_K_M"
 	case fileTypeQ6_K:
 		return "Q6_K"
+	case fileTypeIQ2_XXS:
+		return "IQ2_XXS"
+	case fileTypeIQ2_XS:
+		return "IQ2_XS"
 	case fileTypeQ2_K_S:
 		return "Q2_K_S"
+	case fileTypeIQ3_XS:
+		return "IQ3_XS"
+	case fileTypeIQ3_XXS:
+		return "IQ3_XXS"
+	case fileTypeIQ1_S:
+		return "IQ1_S"
+	case fileTypeIQ4_NL:
+		return "IQ4_NL"
+	case fileTypeIQ3_S:
+		return "IQ3_S"
+	case fileTypeIQ3_M:
+		return "IQ3_M"
+	case fileTypeIQ2_S:
+		return "IQ2_S"
+	case fileTypeIQ2_M:
+		return "IQ2_M"
+	case fileTypeIQ4_XS:
+		return "IQ4_XS"
+	case fileTypeIQ1_M:
+		return "IQ1_M"
 	case FileTypeBF16:
 		return "BF16"
+	case fileTypeTQ1_0:
+		return "TQ1_0"
+	case fileTypeTQ2_0:
+		return "TQ2_0"
+	case fileTypeMXFP4_MOE:
+		return "MXFP4_MOE"
+	case fileTypeNVFP4:
+		return "NVFP4"
+	case fileTypeQ1_0:
+		return "Q1_0"
 	default:
 		return "unknown"
 	}
@ -170,12 +205,40 @@ func (ftype FileType) ToTensorType() TensorType {
 		return TensorTypeQ5_K
 	case fileTypeQ6_K:
 		return TensorTypeQ6_K
+	case fileTypeIQ2_XXS:
+		return tensorTypeIQ2_XXS
+	case fileTypeIQ2_XS:
+		return tensorTypeIQ2_XS
 	case fileTypeQ2_K_S:
 		return TensorTypeQ2_K
+	case fileTypeIQ3_XS:
+		return tensorTypeIQ3_S
+	case fileTypeIQ3_XXS:
+		return tensorTypeIQ3_XXS
+	case fileTypeIQ1_S:
+		return tensorTypeIQ1_S
+	case fileTypeIQ4_NL:
+		return tensorTypeIQ4_NL
+	case fileTypeIQ3_S, fileTypeIQ3_M:
+		return tensorTypeIQ3_S
+	case fileTypeIQ2_S, fileTypeIQ2_M:
+		return tensorTypeIQ2_S
+	case fileTypeIQ4_XS:
+		return tensorTypeIQ4_XS
+	case fileTypeIQ1_M:
+		return tensorTypeIQ1_M
 	case FileTypeBF16:
 		return TensorTypeBF16
-	case fileTypeMXFP4:
+	case fileTypeTQ1_0:
+		return tensorTypeTQ1_0
+	case fileTypeTQ2_0:
+		return tensorTypeTQ2_0
+	case fileTypeMXFP4_MOE:
 		return TensorTypeMXFP4
+	case fileTypeNVFP4:
+		return TensorTypeNVFP4
+	case fileTypeQ1_0:
+		return TensorTypeQ1_0
 	default:
 		slog.Warn("unsupported file type", "type", ftype)
 		return 0 // F32
@ -227,6 +290,8 @@ const (
 	tensorTypeIQ4_NL_4_8 // unused by GGML
 	tensorTypeIQ4_NL_8_8 // unused by GGML
 	TensorTypeMXFP4
+	TensorTypeNVFP4
+	TensorTypeQ1_0
 )

 // ParseTensorType parses the provided GGUF tensor type
@ -315,12 +380,46 @@ func (t TensorType) String() string {
 		return "Q6_K"
 	case TensorTypeQ8_K:
 		return "Q8_K"
+	case tensorTypeIQ2_XXS:
+		return "IQ2_XXS"
+	case tensorTypeIQ2_XS:
+		return "IQ2_XS"
+	case tensorTypeIQ3_XXS:
+		return "IQ3_XXS"
+	case tensorTypeIQ1_S:
+		return "IQ1_S"
+	case tensorTypeIQ4_NL:
+		return "IQ4_NL"
+	case tensorTypeIQ3_S:
+		return "IQ3_S"
+	case tensorTypeIQ2_S:
+		return "IQ2_S"
+	case tensorTypeIQ4_XS:
+		return "IQ4_XS"
+	case TensorTypeI8:
+		return "I8"
+	case TensorTypeI16:
+		return "I16"
+	case TensorTypeI32:
+		return "I32"
+	case TensorTypeI64:
+		return "I64"
 	case TensorTypeF64:
 		return "F64"
+	case tensorTypeIQ1_M:
+		return "IQ1_M"
 	case TensorTypeBF16:
 		return "BF16"
-	case 4, TensorTypeMXFP4:
+	case tensorTypeTQ1_0:
+		return "TQ1_0"
+	case tensorTypeTQ2_0:
+		return "TQ2_0"
+	case TensorTypeMXFP4:
 		return "MXFP4"
+	case TensorTypeNVFP4:
+		return "NVFP4"
+	case TensorTypeQ1_0:
+		return "Q1_0"
 	default:
 		return "unknown"
 	}
--- a/fs/ggml/type_test.go
+++ b/fs/ggml/type_test.go
@ -0,0 +1,115 @@
+package ggml
+
+import "testing"
+
+func TestFileTypeStringMatchesLlamaFType(t *testing.T) {
+	tests := []struct {
+		ftype FileType
+		want  string
+	}{
+		{0, "F32"},
+		{1, "F16"},
+		{2, "Q4_0"},
+		{3, "Q4_1"},
+		{7, "Q8_0"},
+		{8, "Q5_0"},
+		{9, "Q5_1"},
+		{10, "Q2_K"},
+		{11, "Q3_K_S"},
+		{12, "Q3_K_M"},
+		{13, "Q3_K_L"},
+		{14, "Q4_K_S"},
+		{15, "Q4_K_M"},
+		{16, "Q5_K_S"},
+		{17, "Q5_K_M"},
+		{18, "Q6_K"},
+		{19, "IQ2_XXS"},
+		{20, "IQ2_XS"},
+		{21, "Q2_K_S"},
+		{22, "IQ3_XS"},
+		{23, "IQ3_XXS"},
+		{24, "IQ1_S"},
+		{25, "IQ4_NL"},
+		{26, "IQ3_S"},
+		{27, "IQ3_M"},
+		{28, "IQ2_S"},
+		{29, "IQ2_M"},
+		{30, "IQ4_XS"},
+		{31, "IQ1_M"},
+		{32, "BF16"},
+		{36, "TQ1_0"},
+		{37, "TQ2_0"},
+		{38, "MXFP4_MOE"},
+		{39, "NVFP4"},
+		{40, "Q1_0"},
+		{FileTypeUnknown, "unknown"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.want, func(t *testing.T) {
+			if got := tt.ftype.String(); got != tt.want {
+				t.Fatalf("FileType(%d).String() = %q, want %q", tt.ftype, got, tt.want)
+			}
+		})
+	}
+}
+
+func TestRemovedFileTypesAreUnknown(t *testing.T) {
+	for _, ftype := range []FileType{4, 5, 6, 33, 34, 35} {
+		t.Run(ftype.String(), func(t *testing.T) {
+			if got := ftype.String(); got != "unknown" {
+				t.Fatalf("FileType(%d).String() = %q, want unknown", ftype, got)
+			}
+		})
+	}
+}
+
+func TestTensorTypeStringMatchesGGMLType(t *testing.T) {
+	tests := []struct {
+		tt   TensorType
+		want string
+	}{
+		{0, "F32"},
+		{1, "F16"},
+		{2, "Q4_0"},
+		{3, "Q4_1"},
+		{6, "Q5_0"},
+		{7, "Q5_1"},
+		{8, "Q8_0"},
+		{9, "Q8_1"},
+		{10, "Q2_K"},
+		{11, "Q3_K"},
+		{12, "Q4_K"},
+		{13, "Q5_K"},
+		{14, "Q6_K"},
+		{15, "Q8_K"},
+		{16, "IQ2_XXS"},
+		{17, "IQ2_XS"},
+		{18, "IQ3_XXS"},
+		{19, "IQ1_S"},
+		{20, "IQ4_NL"},
+		{21, "IQ3_S"},
+		{22, "IQ2_S"},
+		{23, "IQ4_XS"},
+		{24, "I8"},
+		{25, "I16"},
+		{26, "I32"},
+		{27, "I64"},
+		{28, "F64"},
+		{29, "IQ1_M"},
+		{30, "BF16"},
+		{34, "TQ1_0"},
+		{35, "TQ2_0"},
+		{39, "MXFP4"},
+		{40, "NVFP4"},
+		{41, "Q1_0"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.want, func(t *testing.T) {
+			if got := tt.tt.String(); got != tt.want {
+				t.Fatalf("TensorType(%d).String() = %q, want %q", tt.tt, got, tt.want)
+			}
+		})
+	}
+}
--- a/go.mod
+++ b/go.mod
@ -24,7 +24,7 @@ require (
 	github.com/charmbracelet/bubbletea v1.3.10
 	github.com/charmbracelet/lipgloss v1.1.0
 	github.com/d4l3k/go-bfloat16 v0.0.0-20211005043715-690c3bdd05f1
-	github.com/dlclark/regexp2 v1.11.4
+	github.com/dlclark/regexp2 v1.11.5
 	github.com/emirpasic/gods/v2 v2.0.0-alpha
 	github.com/klauspost/compress v1.18.3
 	github.com/mattn/go-runewidth v0.0.16
--- a/go.sum
+++ b/go.sum
@ -62,8 +62,8 @@ github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/dgryski/trifles v0.0.0-20200323201526-dd97f9abfb48 h1:fRzb/w+pyskVMQ+UbP35JkH8yB7MYb4q/qhBarqZE6g=
 github.com/dgryski/trifles v0.0.0-20200323201526-dd97f9abfb48/go.mod h1:if7Fbed8SFyPtHLHbg49SI7NAdJiC5WIA09pe59rfAA=
-github.com/dlclark/regexp2 v1.11.4 h1:rPYF9/LECdNymJufQKmri9gV604RvvABwgOA8un7yAo=
-github.com/dlclark/regexp2 v1.11.4/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
+github.com/dlclark/regexp2 v1.11.5 h1:Q/sSnsKerHeCkc/jSTNq1oCm7KiVgUMZRDUoRu0JQZQ=
+github.com/dlclark/regexp2 v1.11.5/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
 github.com/emirpasic/gods/v2 v2.0.0-alpha h1:dwFlh8pBg1VMOXWGipNMRt8v96dKAIvBehtCt6OtunU=
 github.com/emirpasic/gods/v2 v2.0.0-alpha/go.mod h1:W0y4M2dtBB9U5z3YlghmpuUhiaZT2h6yoeE+C1sCp6A=
 github.com/envoyproxy/go-control-plane v0.9.0/go.mod h1:YTl/9mNaCwkRvm6d1a2C3ymFceY/DCBVvsKhRF0iEA4=
--- a/harmony/harmonyparser.go
+++ b/harmony/harmonyparser.go
@ -461,6 +461,18 @@ func (h *HarmonyMessageHandler) HasThinkingSupport() bool {
 	return true
 }

+func (h *HarmonyMessageHandler) PreservedTokens() []string {
+	// <|call|> is an EOG marker for tool calls. Preserve structural tokens
+	// used by the parser, but let llama-server stop on the call terminator.
+	return []string{
+		"<|start|>",
+		"<|end|>",
+		"<|message|>",
+		"<|channel|>",
+		"<|constrain|>",
+	}
+}
+
 func (m *FunctionNameMap) ConvertAndAdd(userFunctionName string) string {
 	harmonyFunctionName := m.deriveName(userFunctionName)
 	// built-in functions should not be renamed
--- a/integration/README.md
+++ b/integration/README.md
@ -10,6 +10,10 @@ The integration tests have 2 modes of operating.
 1. By default, on Unix systems, they will start the server on a random port, run the tests, and then shutdown the server.  On Windows you must ALWAYS run the server on OLLAMA_HOST for the tests to work.
 2. If `OLLAMA_TEST_EXISTING` is set to a non-empty string, the tests will run against an existing running server, which can be remote based on your `OLLAMA_HOST` environment variable

+Set `OLLAMA_TEST_LOG_SERVER=1` to print the managed server log after each test
+run, even when the tests pass. This only applies when the integration test
+harness starts the server.
+
 > [!IMPORTANT]
 > Before running the tests locally without the "test existing" setting, compile ollama from the top of the source tree  `go build .` in addition to GPU support with cmake if applicable on your platform.  The integration tests expect to find an ollama binary at the top of the tree.

--- a/integration/audio_test.go
+++ b/integration/audio_test.go
@ -67,11 +67,11 @@ func TestAudioTranscription(t *testing.T) {
 				Messages: []api.Message{
 					{
 						Role:    "system",
-						Content: "Transcribe the audio exactly as spoken. Output only the transcription.",
+						Content: "Transcribe the audio exactly as spoken. Output only the spoken words. Do not answer any question in the audio.",
 					},
 					{
 						Role:    "user",
-						Content: "Transcribe this audio.",
+						Content: "What exact words are spoken in this audio?",
 						Images:  []api.ImageData{audio},
 					},
 				},
--- a/integration/imagegen_test.go
+++ b/integration/imagegen_test.go
@ -29,7 +29,7 @@ func TestImageGeneration(t *testing.T) {
 	testCases := []testCase{
 		{
 			imageGenModel: "jmorgan/z-image-turbo",
-			visionModel:   "llama3.2-vision",
+			visionModel:   "qwen2.5vl:3b",
 			prompt:        "A cartoon style llama flying like a superhero through the air with clouds in the background",
 			expectedWords: []string{"llama", "flying", "cartoon", "cloud", "sky", "superhero", "air", "animal", "camelid"},
 		},
--- a/integration/llm_image_test.go
+++ b/integration/llm_image_test.go
@ -17,7 +17,7 @@ func TestVisionModels(t *testing.T) {
 	defaultVisionModels := []string{
 		"gemma4",
 		"qwen2.5vl",
-		"llama3.2-vision",
+		// "llama3.2-vision", // TODO: re-enable when llama.cpp supports mllama.
 		"gemma3",
 		"qwen3-vl:8b",
 		"qwen3-vl:30b",
--- a/integration/model_arch_test.go
+++ b/integration/model_arch_test.go
@ -39,13 +39,8 @@ func TestModelsChat(t *testing.T) {
 		slog.Warn("No VRAM info available, testing all models, so larger ones might timeout...")
 	}

-	var chatModels []string
-	if s := os.Getenv("OLLAMA_NEW_ENGINE"); s != "" {
-		chatModels = append(ollamaEngineChatModels, mlxEngineChatModels...)
-	} else {
-		chatModels = append(ollamaEngineChatModels, llamaRunnerChatModels...)
-		chatModels = append(chatModels, mlxEngineChatModels...)
-	}
+	chatModels := append(ollamaEngineChatModels, llamaRunnerChatModels...)
+	chatModels = append(chatModels, mlxEngineChatModels...)

 	for _, model := range testModels(chatModels) {
 		t.Run(model, func(t *testing.T) {
--- a/integration/model_perf_test.go
+++ b/integration/model_perf_test.go
@ -28,7 +28,6 @@ var (
 		"falcon2:latest", // 2k model
 		"minicpm-v:latest",
 		"qwen:latest",
-		"solar-pro:latest",
 	}
 )

@ -40,11 +39,7 @@ var (
 // cat int.log | grep MODEL_PERF_HEADER | head -1| cut -f2- -d: > perf.csv
 // cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
 func TestModelsPerf(t *testing.T) {
-	if s := os.Getenv("OLLAMA_NEW_ENGINE"); s != "" {
-		doModelPerfTest(t, ollamaEngineChatModels)
-	} else {
-		doModelPerfTest(t, append(ollamaEngineChatModels, llamaRunnerChatModels...))
-	}
+	doModelPerfTest(t, append(ollamaEngineChatModels, llamaRunnerChatModels...))
 }

 func TestLibraryModelsPerf(t *testing.T) {
--- a/integration/utils_test.go
+++ b/integration/utils_test.go
@ -46,7 +46,7 @@ var (
 	// Note: add newer models at the top of the list to test them first
 	ollamaEngineChatModels = []string{
 		"nemotron3:33b",
-		"laguna-xs.2:q4_K_M",
+		// "laguna-xs.2:q4_K_M", // TODO: re-enable when llama.cpp supports laguna.
 		"gemma4",
 		"lfm2.5-thinking",
 		"ministral-3",
@ -55,7 +55,7 @@ var (
 		"gemma3n:e2b",
 		"mistral-small3.2:latest",
 		"deepseek-r1:1.5b",
-		"llama3.2-vision:latest",
+		// "llama3.2-vision:latest", // TODO: re-enable when llama.cpp supports mllama.
 		"qwen2.5-coder:latest",
 		"qwen2.5vl:3b",
 		"qwen3:0.6b", // dense
@ -74,8 +74,8 @@ var (
 	// failure into a test skip.
 	mlxEngineChatModels = []string{
 		"laguna-xs.2:nvfp4",
-		"qwen3.5:2b-nvfp4",  // ~2.5GB, Qwen3_5 arch
-		"gemma4:e2b-nvfp4",  // ~7.1GB, Gemma4 arch (skipped under low VRAM)
+		"qwen3.5:2b-nvfp4", // ~2.5GB, Qwen3_5 arch
+		"gemma4:e2b-nvfp4", // ~7.1GB, Gemma4 arch (skipped under low VRAM)
 	}
 	llamaRunnerChatModels = []string{
 		"mistral:latest",
@ -84,7 +84,6 @@ var (
 		"command-r:latest",
 		"nemotron-mini:latest",
 		"phi3.5:latest",
-		"solar-pro:latest",
 		"internlm2:latest",
 		"codellama:latest", // arch=llama
 		"phi3:latest",
@ -166,7 +165,7 @@ var (
 		"llama3-gradient",
 		"llama3-groq-tool-use",
 		"llama3.1",
-		"llama3.2-vision",
+		// "llama3.2-vision", // TODO: re-enable when llama.cpp supports mllama.
 		"llama3.2",
 		"llama3.3",
 		"llama3",
@ -236,7 +235,6 @@ var (
 		"smallthinker",
 		"smollm",
 		"smollm2",
-		"solar-pro",
 		"solar",
 		"sqlcoder",
 		"stable-beluga",
@ -278,7 +276,7 @@ var (
 	}
 	libraryToolsModels = []string{
 		"nemotron3:33b",
-		"laguna-xs.2",
+		// "laguna-xs.2", // TODO: re-enable when llama.cpp supports laguna.
 		"gemma4",
 		"lfm2.5-thinking",
 		"qwen3-vl",
@ -555,7 +553,7 @@ func InitServerConnection(ctx context.Context, t *testing.T) (*api.Client, strin
 			<-serverDone
 			slog.Info("terminate complete")

-			if t.Failed() {
+			if t.Failed() || os.Getenv("OLLAMA_TEST_LOG_SERVER") != "" {
 				slog.Warn("SERVER LOG FOLLOWS")
 				io.Copy(os.Stderr, &serverLog)
 				slog.Warn("END OF SERVER")
--- a/integration/vision_test.go
+++ b/integration/vision_test.go
@ -19,7 +19,7 @@ var defaultVisionModels = []string{
 	"nemotron3:33b",
 	"gemma4",
 	"gemma3",
-	"llama3.2-vision",
+	// "llama3.2-vision", // TODO: re-enable when llama.cpp supports mllama.
 	"qwen2.5vl",
 	"qwen3-vl:8b",
 }
@ -116,7 +116,7 @@ func TestVisionMultiTurn(t *testing.T) {
 						Images:  []api.ImageData{abbeyRoad},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
@ -182,7 +182,7 @@ func TestVisionObjectCounting(t *testing.T) {
 						Images:  []api.ImageData{docs},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
@ -225,7 +225,7 @@ func TestVisionSceneUnderstanding(t *testing.T) {
 						Images:  []api.ImageData{abbeyRoad},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
@ -263,7 +263,7 @@ func TestVisionSpatialReasoning(t *testing.T) {
 						Images:  []api.ImageData{docs},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
@ -299,7 +299,7 @@ func TestVisionDetailRecognition(t *testing.T) {
 						Images:  []api.ImageData{docs},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
@ -344,7 +344,7 @@ func TestVisionMultiImage(t *testing.T) {
 						Images:  []api.ImageData{abbeyRoad, docs},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
@ -383,7 +383,7 @@ func TestVisionImageDescription(t *testing.T) {
 						Images:  []api.ImageData{ollamaHome},
 					},
 				},
-				Stream: &stream,
+				Stream:    &stream,
 				KeepAlive: &api.Duration{Duration: 10 * time.Second},
 				Options:   map[string]any{"temperature": 0.0, "seed": 42},
 			}
--- a/llama/README.md
+++ b/llama/README.md
@ -1,55 +0,0 @@
-# `llama`
-
-This package provides Go bindings to [llama.cpp](https://github.com/ggerganov/llama.cpp).
-
-## Vendoring
-
-Ollama vendors [llama.cpp](https://github.com/ggerganov/llama.cpp/) and [ggml](https://github.com/ggerganov/llama.cpp/tree/master/ggml/src). While we generally strive to contribute changes back upstream to avoid drift, we carry a small set of patches which are applied to the tracking commit.
-
-If you update the vendoring code, start by running the following command to establish the tracking llama.cpp repo in the `./vendor/` directory.
-
-```shell
-make -f Makefile.sync apply-patches
-```
-
-### Updating Base Commit
-
-**Pin to new base commit**
-
-To change the base commit, update `FETCH_HEAD` in Makefile.sync.
-
-When updating to a newer base commit, the existing patches may not apply cleanly and require manual merge resolution.
-
-Start by applying the patches. If any of the patches have conflicts, the `git am` will stop at the first failure.
-
-```shell
-make -f Makefile.sync apply-patches
-```
-
-If there are conflicts, you will see an error message. Resolve the conflicts in `./vendor/`, and continue the patch series with `git am --continue` and rerun `make -f Makefile.sync apply-patches`. Repeat until all patches are successfully applied.
-
-Once all patches are applied, commit the changes to the tracking repository.
-
-```shell
-make -f Makefile.sync format-patches sync
-```
-
-### Generating Patches
-
-When working on new fixes or features that impact vendored code, use the following model. First get a clean tracking repo with all current patches applied:
-
-```shell
-make -f Makefile.sync clean apply-patches
-```
-
-Iterate until you're ready to submit PRs. Once your code is ready, commit a change in the `./vendor/` directory, then generate the patches for ollama with
-
-```shell
-make -f Makefile.sync format-patches
-```
-
-In your `./vendor/` directory, create a branch, and cherry-pick the new commit to that branch, then submit a PR upstream to llama.cpp.
-
-Commit the changes in the ollama repo and submit a PR to Ollama, which will include the vendored code update with your change, along with the patches.
-
-After your PR upstream is merged, follow the **Updating Base Commit** instructions above, however first remove your patch before running `apply-patches` since the new base commit contains your change already.
--- a/llama/build-info.cpp
+++ b/llama/build-info.cpp
@ -1,4 +0,0 @@
-int LLAMA_BUILD_NUMBER = 0;
-char const *LLAMA_COMMIT = "ec98e2002";
-char const *LLAMA_COMPILER = "";
-char const *LLAMA_BUILD_TARGET = "";
--- a/llama/build-info.cpp.in
+++ b/llama/build-info.cpp.in
@ -1,4 +0,0 @@
-int LLAMA_BUILD_NUMBER = 0;
-char const *LLAMA_COMMIT = "@FETCH_HEAD@";
-char const *LLAMA_COMPILER = "";
-char const *LLAMA_BUILD_TARGET = "";
--- a/llama/compat/README.md
+++ b/llama/compat/README.md
@ -0,0 +1,137 @@
+# llama.cpp compatibility layer
+
+This directory holds a temporary in-process compatibility layer for existing
+published Ollama GGUFs whose metadata or tensor layout does not yet match what
+llama.cpp expects directly. The layer translates those files in memory at load
+time so users do not need to re-pull or re-create models during the transition
+to llama-server.
+
+This patch model is intended to be short lived. The target end state is that
+published models and newly created models use llama.cpp-compatible metadata and
+tensor layouts on disk, and this directory can be removed.
+
+The layer is applied automatically at build time via CMake `FetchContent`'s
+`PATCH_COMMAND` for normal fetched builds. If CMake is pointed at a source
+override through `FETCHCONTENT_SOURCE_DIR_LLAMA_CPP`, the same patch is applied
+during configure. If `OLLAMA_LLAMA_CPP_SOURCE` is set, the patch is
+intentionally skipped so a developer can iterate on a local llama.cpp tree.
+
+## Files
+
+- `llama-ollama-compat.h`, `llama-ollama-compat.cpp` - the compatibility
+  entry points and per-architecture handlers.
+- `llama-ollama-compat-util.h`, `llama-ollama-compat-util.cpp` - helpers for
+  KV edits, tensor renames, skip-prefix tracking, tensor load operations, and
+  small tensor repacking primitives.
+- `llama-cpp-hooks.patch` - small additive call-site edits in llama.cpp files.
+  It currently touches `src/llama-model-loader.cpp` and `tools/mtmd/clip.cpp`.
+- `compat.cmake`, `apply-patch.cmake` - CMake glue and an idempotent patch
+  applier used by `llama/server/CMakeLists.txt`.
+
+The compatibility source files stay in this directory and are linked into the
+fetched llama.cpp targets. The patch file only adds call sites.
+
+## Load-Time Hooks
+
+The layer runs at a small set of loader hook points:
+
+1. Main model constructor: `translate_metadata` inspects the parsed metadata
+   and mutates the in-memory `gguf_context` and `ggml_context` when a handler
+   recognizes an existing published model format. It can also request mmap
+   disablement when a handler needs writable backend buffers for transformed
+   tensor data.
+2. Main model tensor indexing: `should_skip_tensor` hides embedded projector,
+   vision, audio, MTP, or other tensors that the text loader should not claim.
+3. Main model tensor reads: `maybe_load_text_tensor` applies registered
+   text-side load operations, such as FFN concat or dtype promotion, before
+   the normal llama.cpp file read. This is wired into both full model loading
+   and single-tensor reads used by tools such as `llama-quantize`.
+4. `mtmd/clip` constructor: `translate_clip_metadata` rewrites a clip-facing
+   view of monolithic GGUFs into the mmproj form expected by llama.cpp.
+5. `mtmd/clip` tensor load loop: `maybe_load_tensor` applies clip-side load
+   operations, such as F16 to F32 promotion, QKV merge, tensor repack, tensor
+   split, or zero-fill.
+
+Files that do not match a supported published-model marker are left unchanged.
+Setting `OLLAMA_LLAMA_CPP_COMPAT=0` disables the hook bodies for internal
+create-time validation and for models that are already known to be
+llama.cpp-compatible on disk.
+
+## Supported Transformations
+
+This table tracks the dispatch surface. Keep it brief; the handler comments in
+`llama-ollama-compat.cpp` are the source of truth for exact KV and tensor maps.
+
+| Internal arch / marker | Text handling | Clip/mmproj handling |
+|---|---|---|
+| `gemma3` | Normalizes Gemma 3 metadata, tokenizer fields, and embedded vision/projector tensors. | Gemma 3 projector translation. |
+| `gemma3` + embedding markers (`embeddinggemma`) | Maps to `gemma-embedding` metadata and fixes embedding dense/norm tensors. | n/a |
+| `bert` + Snowflake markers (`snowflake-arctic-embed2`) | Fixes Snowflake Arctic Embed 2 tokenizer metadata. | n/a |
+| `gemma3n` | Normalizes tokenizer/EOS metadata, truncates vocab-shaped tensors, and hides unused embedded vision/audio/projector tensors. | n/a |
+| `gemma4` | Normalizes tokenizer metadata and hides embedded audio/vision/projector tensors from the text loader. | Gemma 4 projector translation; audio remains disabled. |
+| `gptoss` | Maps to `gpt-oss`, copies KVs, injects missing expert FFN metadata, and renames tensors. | n/a |
+| `lfm2` | Renames norm tensors and fixes feed-forward metadata. | n/a |
+| `olmo3` | Maps to the OLMo2-compatible loader path. | n/a |
+| `mistral3` | Fixes RoPE/YaRN metadata and hides embedded vision/projector tensors. | Pixtral-style projector translation. |
+| `qwen35`, `qwen35moe` | Fixes Qwen3.5/Qwen3-VL-style text metadata, translates embedded MTP tensors, and hides embedded vision/projector tensors. | Qwen3-VL merger-style projector translation. |
+| `qwen3next` | Normalizes hybrid attention KV-head metadata and renames SSM dt tensors to the names expected by llama.cpp. | n/a |
+| `qwen25vl` | Maps to `qwen2vl` metadata conventions. | Qwen2.5-VL projector translation. |
+| `qwen3vl`, `qwen3vlmoe` | Adds missing Qwen3-VL metadata and hides embedded vision/projector tensors. | Qwen3-VL projector translation, including QKV merge and patch-embedding split/repack. |
+| `deepseekocr` | Maps to `deepseek2-ocr`, injects missing OCR/MoE metadata, and hides embedded SAM/vision/projector tensors. | DeepSeek OCR projector translation. |
+| `glmocr` | Maps GLM OCR metadata/tensors to the llama.cpp-compatible view. | GLM OCR projector translation. |
+| `glm4moelite` | Maps GLM-4.7 Flash MLA metadata to the `deepseek2` path and fixes special-token metadata. | n/a |
+| `nemotron_h_moe` | Fixes latent-FFN variants and hides MTP tensors. | n/a |
+| `nemotron_h_omni` | Selects the Nemotron text loader and hides audio/vision/projector tensors from the text loader. | Nemotron V2 VL projector translation; audio remains disabled. |
+| `llama` with Llama 3 markers | Fixes Llama 3 tokenizer metadata. | n/a |
+| `llama4` | Hides embedded vision/projector tensors from the text loader. | Llama 4 projector translation. |
+| `clip` projector without `clip.projector_type` | n/a | Defaults LLaVA/BakLLaVA projectors to `clip.projector_type=mlp`. |
+
+Usage:
+
+```sh
+llama-server --model /path/to/ollama-blob --mmproj /path/to/ollama-blob
+```
+
+Passing the same monolithic GGUF as both `--model` and `--mmproj` works because
+each loader applies its own translation.
+
+Additional architectures are added by implementing a `handle_<arch>()` and,
+for vision models, `handle_<arch>_clip()` in `llama-ollama-compat.cpp`, then
+dispatching them from `translate_metadata` / `translate_clip_metadata`. For
+monolithic vision models, also update the `compatClipArches` allowlist in
+`llm/llama_server.go` so Ollama passes the main GGUF as `--mmproj`.
+
+## Regenerating the Patch File
+
+After a llama.cpp bump moves the insertion points, re-apply the edits to a
+fresh checkout and run:
+
+```sh
+cd /path/to/llama.cpp
+git diff -- \
+    src/llama-model-loader.cpp \
+    tools/mtmd/clip.cpp \
+    > /path/to/ollama/llama/compat/llama-cpp-hooks.patch
+```
+
+## Implementation Notes
+
+The compatibility code is mostly written against public APIs (`gguf.h`,
+`ggml.h`, `ggml-backend.h`). A few operations rely on implementation details
+because the public API does not expose equivalent mutators:
+
+| Dependency | Use | Replacement if needed |
+|---|---|---|
+| Direct writes to `ggml_tensor::type` / `ne[]` / `nb[]` | Post-creation tensor reshape/retype for in-memory translation. | Add public tensor shape/type mutators. |
+| `const_cast<char *>(gguf_get_tensor_name(...))` in `rename_tensor` | Renames gguf tensors in place. | Add a public `gguf_rename_tensor` helper. |
+| `llama_model_loader` forward declaration from `src/llama-model-loader.h` | Opaque key for per-loader registries. The pointer is never dereferenced. | Replace registry keys with `const void *`. |
+
+Two helpers need extra context:
+
+- `reclaim_slot_as` repurposes an orphaned tensor slot as a synthesized tensor
+  when a clip handler splits one source tensor into multiple destination
+  tensors. This is needed because clip metadata loading allocates exactly enough
+  tensor slots for the source file.
+- Load-op registry overrides ignore the caller-provided `file_offset` when a
+  registered operation exists. The operations capture their own source offsets
+  at translation time, before renames change tensor names.
--- a/llama/compat/apply-patch.cmake
+++ b/llama/compat/apply-patch.cmake
@ -0,0 +1,50 @@
+# Idempotent patch applier used by compat.cmake.
+#
+# Invocation (from a CMake PATCH_COMMAND):
+#   cmake -DPATCH_FILE=<abs path> -P apply-patch.cmake
+#
+# The patch is applied in the current working directory (which ExternalProject
+# / FetchContent sets to the fetched source's SOURCE_DIR). If the patch is
+# already applied — detected via `git apply --reverse --check` — this script
+# is a no-op. This makes re-configuring and re-building the project safe.
+
+if(NOT DEFINED PATCH_FILE)
+    message(FATAL_ERROR "apply-patch.cmake: PATCH_FILE not set")
+endif()
+if(NOT EXISTS "${PATCH_FILE}")
+    message(FATAL_ERROR "apply-patch.cmake: PATCH_FILE does not exist: ${PATCH_FILE}")
+endif()
+
+find_package(Git QUIET REQUIRED)
+
+get_filename_component(_patch_workdir "." ABSOLUTE)
+get_filename_component(_git_ceiling "${_patch_workdir}" DIRECTORY)
+set(_git_apply_env GIT_CEILING_DIRECTORIES=${_git_ceiling})
+
+# If the patch can be REVERSED cleanly, it's already applied. Skip.
+execute_process(
+    COMMAND ${CMAKE_COMMAND} -E env ${_git_apply_env}
+        ${GIT_EXECUTABLE} apply --reverse --check "${PATCH_FILE}"
+    RESULT_VARIABLE _reverse_check
+    OUTPUT_QUIET ERROR_QUIET
+)
+if(_reverse_check EQUAL 0)
+    message(STATUS "llama/compat: patch already applied, skipping")
+    return()
+endif()
+
+# Otherwise, apply forward.
+execute_process(
+    COMMAND ${CMAKE_COMMAND} -E env ${_git_apply_env}
+        ${GIT_EXECUTABLE} apply --whitespace=nowarn "${PATCH_FILE}"
+    RESULT_VARIABLE _apply_result
+)
+if(NOT _apply_result EQUAL 0)
+    message(FATAL_ERROR
+        "llama/compat: failed to apply ${PATCH_FILE}\n"
+        "This usually means the pinned llama.cpp source has changed. "
+        "Regenerate the patch (see llama/compat/README.md) against the "
+        "pinned LLAMA_CPP_VERSION and retry.")
+endif()
+
+message(STATUS "llama/compat: applied patch")
--- a/llama/compat/compat.cmake
+++ b/llama/compat/compat.cmake
@ -0,0 +1,57 @@
+# llama.cpp compatibility shim — CMake integration
+#
+# Include this file BEFORE calling FetchContent_Declare(llama_cpp ...) to
+# patch the fetched llama.cpp with Ollama's in-process compatibility
+# layer. Example usage:
+#
+#     include(${CMAKE_CURRENT_SOURCE_DIR}/../compat/compat.cmake)
+#
+#     FetchContent_Declare(
+#         llama_cpp
+#         GIT_REPOSITORY ...
+#         GIT_TAG        ${LLAMA_CPP_GIT_TAG}
+#         GIT_SHALLOW    TRUE
+#         PATCH_COMMAND  ${OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND}
+#         UPDATE_DISCONNECTED TRUE
+#     )
+#
+# The compat layer consists of:
+#   1. Ollama-owned compat source files linked into the fetched llama.cpp
+#      targets from this directory.
+#   2. A small patch file that adds call-sites in llama.cpp loaders.
+
+set(_compat_dir ${CMAKE_CURRENT_LIST_DIR})
+
+# Expose a single variable the main CMakeLists passes into FetchContent's
+# PATCH_COMMAND. The patch is applied via a small CMake script so the step
+# is idempotent — re-configuring or rebuilding won't fail with "already
+# applied".
+#
+# The compat source files are NOT copied into the fetched tree.
+# Instead, llama/server/CMakeLists.txt does target_sources() on the llama
+# target after FetchContent_MakeAvailable. That keeps Ollama's code in
+# Ollama's tree and makes the patch pure call-site insertions.
+set(OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND
+    ${CMAKE_COMMAND}
+        -DPATCH_FILE=${_compat_dir}/llama-cpp-hooks.patch
+        -P ${_compat_dir}/apply-patch.cmake
+    CACHE INTERNAL "llama.cpp compat patch command for FetchContent")
+
+# Where the compat source files live, so the main CMakeLists can wire them
+# into the llama.cpp targets that need the hooks.
+set(OLLAMA_LLAMA_CPP_COMPAT_DIR
+    "${_compat_dir}"
+    CACHE INTERNAL "Directory holding llama.cpp compat sources")
+
+# Also export the individual paths in case callers want to do something
+# custom (e.g. emit a dependency on the patch so reconfigures re-apply).
+set(OLLAMA_LLAMA_CPP_COMPAT_PATCH_FILE
+    "${_compat_dir}/llama-cpp-hooks.patch"
+    CACHE INTERNAL "Path to the llama.cpp compat patch")
+
+set(OLLAMA_LLAMA_CPP_COMPAT_SOURCES
+    "${_compat_dir}/llama-ollama-compat.h"
+    "${_compat_dir}/llama-ollama-compat.cpp"
+    "${_compat_dir}/llama-ollama-compat-util.h"
+    "${_compat_dir}/llama-ollama-compat-util.cpp"
+    CACHE INTERNAL "Source files linked into llama.cpp targets")
--- a/llama/compat/llama-cpp-hooks.patch
+++ b/llama/compat/llama-cpp-hooks.patch
@ -0,0 +1,89 @@
+diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
+index 4e65a45..a6e4fe2 100644
+--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
+@@ -4,6 +4,7 @@
+ #include "ggml.h"
+ #include "gguf.h"
+ #include "llama-hparams.h"
+#include "llama-ollama-compat.h"
+ 
+ #include <algorithm>
+ #include <array>
+@@ -549,6 +550,7 @@ llama_model_loader::llama_model_loader(
+         }
+ 
+         get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
+        if (llama_ollama_compat::translate_metadata(this, metadata, ctx, arch_name, fname.c_str())) use_mmap = false;
+         llm_kv = LLM_KV(llm_arch_from_string(arch_name));
+ 
+         files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
+@@ -573,6 +575,9 @@ llama_model_loader::llama_model_loader(
+         // so we build a unified tensors index for weights.
+         for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
+             std::string tensor_name = std::string(cur->name);
+            if (llama_ollama_compat::should_skip_tensor(this, tensor_name.c_str())) {
+                continue;
+            }
+             // make sure there is no duplicated tensor names
+             if (weights_map.find(tensor_name) != weights_map.end()) {
+                 throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
+@@ -683,6 +688,9 @@ llama_model_loader::llama_model_loader(
+         // Save tensors data offset info of the main file.
+         for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
+             std::string tensor_name = std::string(cur->name);
+            if (llama_ollama_compat::should_skip_tensor(this, tensor_name.c_str())) {
+                continue;
+            }
+             // make sure there is no duplicated tensor names
+             if (weights_map.find(tensor_name) != weights_map.end()) {
+                 throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
+@@ -1375,6 +1383,7 @@ void llama_model_loader::get_mapping_range(size_t * first, size_t * last, void *
+ 
+ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
+     const auto & w = require_weight(ggml_get_name(cur));
+    if (llama_ollama_compat::maybe_load_text_tensor(this, cur, w.offs)) return;
+ 
+     if (use_mmap) {
+         const auto & mapping = mappings.at(w.idx);
+@@ -1525,6 +1534,7 @@ bool llama_model_loader::load_all_data(
+         }
+ 
+         size_t n_size = ggml_nbytes(cur);
+        if (llama_ollama_compat::maybe_load_text_tensor(this, cur, weight->offs)) continue;
+ 
+         if (use_mmap) {
+             const auto & mapping = mappings.at(weight->idx);
+diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
+index 2e0cfa6..a0f2955 100644
+--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
+@@ -10,6 +10,8 @@
+ #include "ggml-backend.h"
+ #include "gguf.h"
+ 
+#include "llama-ollama-compat.h"
+
+ #include <algorithm>
+ #include <cassert>
+ #include <cmath>
+@@ -1009,6 +1011,11 @@ struct clip_model_loader {
+ 
+         ctx_meta.reset(meta);
+ 
+        // If this is an Ollama-format monolithic GGUF (text + embedded
+        // vision), translate its metadata and tensor names into the
+        // upstream mmproj shape so the rest of this loader runs unchanged.
+        llama_ollama_compat::translate_clip_metadata(ctx_gguf.get(), meta);
+
+         const int n_tensors = gguf_get_n_tensors(ctx_gguf.get());
+ 
+         // print gguf info
+@@ -2611,6 +2618,7 @@ struct clip_model_loader {
+                 auto it_off = tensor_offset.find(t->name);
+                 GGML_ASSERT(it_off != tensor_offset.end() && "no offset for tensor");
+                 const size_t offset = it_off->second;
+                if (llama_ollama_compat::maybe_load_tensor(cur, fname.c_str(), offset, buft)) continue;
+                 fin.seekg(offset, std::ios::beg);
+                 if (!fin) {
+                     throw std::runtime_error(string_format("%s: failed to seek for tensor %s\n", __func__, t->name));
--- a/Show more
+++ b/Show more