mirror of https://github.com/ollama/ollama.git synced 2026-07-03 03:38:52 +00:00

History

Daniel Hiltgen dba1e27fa8 llama: enable FA on CUDA CC 6.x GPUs (#16994 ) Recent upstream Pascal kernel fixes let us compile native SM60/SM61 kernels again instead of relying on PTX JIT, so allow Flash Attention auto at runtime for CC 6.x devices. Fixes #16591 Fixes #16754		2026-07-02 17:11:39 -07:00
..
compat	compat: use UTF-8-safe file open (#16999 )	2026-07-02 16:59:23 -07:00
server	llama: enable FA on CUDA CC 6.x GPUs (#16994 )	2026-07-02 17:11:39 -07:00
.gitignore	Re-introduce the `llama` package (#5034 )	2024-10-08 08:53:54 -07:00
README.md	docs: describe llama.cpp update process (#16603 )	2026-06-07 10:27:47 -07:00

README.md

Llama

Updating llama.cpp

LLAMA_CPP_VERSION pins Ollama's llama.cpp source. An update can change more than compilation: it can affect model loading, GPU discovery, scheduler inputs, runtime logs, streaming, and compatibility patches. Validate the upstream diff, the patched source Ollama actually builds, and the affected local paths.

Workflow

Record the old ref from the base branch and choose an explicit new llama.cpp tag or commit. After updating LLAMA_CPP_VERSION, materialize the source through Ollama's normal build path:

cmake -S llama/server --preset cpu

This configure step fetches the pinned source and applies llama/compat/ patches. Confirm the resulting checkout, usually build/llama-server-cpu/_deps/llama_cpp-src, resolves to the intended new ref. Do not trust an old or dirty _deps/ checkout as validation. This is only a source and patch-application check; it is not runtime validation.

Review the upstream diff using Git refs from the llama.cpp checkout:

git diff <old-ref> <new-ref> -- <path>
git show <new-ref>:<path>

Avoid treating patched working-tree files as pristine upstream source.

For build prerequisites, platform notes, and backend selection, see the developer guide.

What to review

Build option and dependency drift: changed GGML_* or LLAMA_* options, new find_package calls, generated assets, shader tools, or backend dependencies. Compare against llama/server/CMakeLists.txt, llama/server/CMakePresets.json, cmake/local.cmake, Dockerfiles, CI, and build scripts as needed.
Backend discovery contracts: GGML symbols used by discover/native_probe*.go, ggml_backend_dev_props, backend device type enums, backend registry loading, device ordering, visible-device filtering, and CUDA/ROCm/Vulkan/Metal runtime library behavior.
llama-server contracts: launch args and defaults, status and error payloads, memory/offload log lines, system_info:, flash-attention logging, --main-gpu, split-mode behavior, and scheduler-sensitive flags consumed by llm/llama_server.go or server/sched.go.
Streaming: any new SSE frame shape, heartbeat, keepalive ping, completion marker, or response cadence on paths Ollama parses directly.
Model and conversion surfaces: new architectures, tensor names, GGUF metadata, tokenizer behavior, speculative/MTP paths, sampler defaults, and server capabilities that may require updates under convert/, model/, x/create/, llm/, or llama/compat/. A model load alone is not enough; affected paths should run a real request and assert the expected result.

Compatibility patches

Patches under llama/compat/ are applied during configure. If a patch insertion point moved, regenerate the patch against a fresh checkout of the new ref rather than editing an already-patched _deps/ tree.

If compatibility sources, model patches, llama/server/CMakeLists.txt, or cmake/local.cmake changed, build the CPU target:

cmake --build build/llama-server-cpu --target llama-server --parallel 12

Configure-only validation can miss missing sources, template instantiation problems, and link errors. Also check whether upstream now supports a locally patched model natively; if it does, the local patch may need removal or rebase.

Local checks

Run the Go tests:

go test ./...

Then proceed to build the full Ollama release and verify.

End-to-end Testing

For runtime validation, build the full applicable native payload for the platform using the developer guide: Metal on macOS arm64, and the available CUDA, ROCm, and Vulkan backends on Linux and Windows.

Then run the integration tests on the platforms being validated. Use them to exercise real Ollama requests and inspect logs for device discovery, offload, memory accounting, flash attention, and request/response behavior. macOS, Windows, and Linux behavior must be validated on those platforms.