Recent upstream Pascal kernel fixes let us compile native SM60/SM61 kernels again instead of relying on PTX JIT, so allow Flash Attention auto at runtime for CC 6.x devices.
Fixes#16591Fixes#16754
Use ggml_fopen for compat tensor reads so Windows paths with Unicode characters are converted through the same UTF-8-to-wide path as llama.cpp model loading.
Fixes#16493
The presets and docs had fallen out of sync with what our current ROCm versions on Linux and Windows actually support. We rely on Vulkan now to cover these older unsupported devices.
* discover: use the SBSA CUDA build on JetPack 7 (L4T r38+)
JetPack 7 supports SBSA-based CUDA, so the standard cuda_v13 build — shipped
in the base linux-arm64 package, and given the Orin arch (CC 8.7) in #16628
— runs on these devices.
JETSON_JETPACK=7 previously selected a nonexistent jetpack7 runner, so
runner.go skipped every CUDA library and discovery fell back to CPU. The L4T
releases JetPack 7 uses (r38 on Thor, r39 on Orin) also hit the unrecognized
branch, and install.sh warned the version was unsupported. Map JetPack 7+
(L4T r38 and newer) to cuda_v13 (returned as "" from cudaJetpack); no
Jetson-specific download is needed, so install.sh no longer warns.
Fixes#16602
* discover: fall back to standard CUDA when the JetPack runner is absent
Per review, drop the L4T-version mapping (in cudaJetpack and install.sh) and
instead clear the jetpack override in runner.go when the detected cuda_jetpack
runner isn't installed. Normal discovery then selects the standard cuda_v13
build, which supports Orin (CC 8.7) on JetPack 7.
This change allows .experts.gate_proj / .up_proj / .down_proj tensor names to each
be used for both quantized (i.e. nvfp4 and mxfp8) and non-quantized (bf16) models.
Previous to this only non-quantized models used that tensor naming scheme.
Parser.done() counted the tag's open/close characters ({}, []) without
tracking JSON string context, so a streamed tool call whose string
argument value contained a closing brace or bracket (e.g.
{"code": "if (x) { y }"}) was treated as complete too early and flushed to
the user as plain text instead of being parsed as a tool call.
findArguments() in the same file already tracks string context; apply the
same handling in done() so open/close characters inside string values are
ignored.
In the PS output, expose the scheduler selected size (clamped by model context size) instead of always reporting the model max context. This will help provide a hint to clients to keep the context size below this value to avoid paging and poor performance on smaller VRAM systems.
Our cuda_v12 build requires nvcc fatbin compression, which in turn requires driver 550 or newer. This change filters incompatible CUDA devices based on the runtime and driver version. This allows users to build from source with older toolkits to support older drivers.
Fixes#16449
* server: align generate with native chat templates
/api/generate rebuilt chat-like prompts through the Go template path even when the model selected its native GGUF Jinja chat template, so the same model rendered differently between generate and chat.
Route chat-like generate requests through the shared native chat preparation path, keep deprecated context and image handling working there, and keep explicit OLLAMA_GO_TEMPLATE overrides intact.
Fixes#16792
* review comments
Fall back to "{{ .Prompt }}" when lacking templates
* llm: fix ollama ps double-counting mmap'd weights on partial offload
With mmap enabled, llama-server reports each CPU_Mapped model buffer as the
file-offset span of its CPU-resident tensors. During partial offload that span
covers nearly the whole file because the first and last tensors stay on CPU, so
the parsed buffer sizes count the offloaded weights twice and ollama ps shows
roughly 2x the real size with a false CPU/GPU split. Model weights can never
exceed the model file on disk, so trim the excess over the file size from the
mmap-backed portion when computing MemorySize. This makes the reported size
independent of use_mmap; VRAM accounting and scheduler placement are unchanged.
* llm: exclude repacked model buffers from the mmap overlap trim
The trim that corrects mmap double-counting computed the overlap from all
model buffers, including real copies such as CPU_REPACK. On a CPU-only
repacked model that inflated the excess and trimmed the repack out,
undercounting by the repack size (llama3.2 reported ~1918 MiB instead of
~3218 MiB).
Compute the overlap from file-backed buffers only: mmap views and direct
device copies, whose spans can overlap the file on partial offload.
Repacked or host-pinned CPU copies are separate allocations that never
overlap the on-disk weights, so leave them intact. Adds a CPU_Mapped +
CPU_REPACK regression test and corrects the Metal case to the real total.
Bump MLX to the latest selected upstream ref and update the MLX/imagegen
wrappers and tests for the new API behavior.
Fix the CUDA MLX archive so runtime NVRTC kernels work after deployment:
package CUTE/CUTLASS headers, include the CUDA runtime header closure, and
stage a coherent CUDA-toolkit-matched CCCL tree instead of MLX's fetched CCCL
for CUDA payloads. The previous archive could build successfully but crash at
runtime due to missing or incompatible JIT headers.
Existing qwen2.5vl GGUFs can contain an empty qwen25vl.vision.fullatt_block_indexes array. The compat layer translated the projector metadata but left clip.vision.n_wa_pattern unset, causing llama-server to fail loading the CLIP model.
Default the runtime compat value to the standard Qwen2.5-VL pattern when the key cannot be derived, and make the converter emit the same default for nil or empty fullatt block metadata.
Fixes#16540
* llm: size mmproj offload by projector memory
Replace the blanket 10 GiB VRAM cutoff with a projector tensor-size estimate plus backend headroom, while preserving the existing CPU-only, partial text offload, shared-memory GPU, and startup OOM retry gates.
This is a stopgap until fit accounts for mmproj memory directly.
The same limited-vram path appears in the qwen3.5 vision hang report: the logs show --no-mmproj-offload on a 7.5 GiB RTX 5050 with about 6.4 GiB free while llama-server estimates the inline mmproj at about 962 MiB.
Fixes#16496Fixes#16570
* review comments
The llama_cuda_v13_windows preset in llama/server/CMakePresets.json was missing sm_86 and sm_80 architectures, causing RTX 3060 laptop and similar mobile RTX 30-series GPUs to be skipped during runtime GPU detection on Windows with CUDA 13. The Linux preset (llama_cuda_v13_linux) included these architectures as "86-virtual" and "80-virtual", but the Windows preset only had "75-virtual;89-virtual;100-virtual;120-virtual", excluding Ampere mobile GPUs.
Signed-off-by: anish <anishesg@users.noreply.github.com>
Co-authored-by: anish <anishesg@users.noreply.github.com>
ollama launch codex-app sets root-level model_provider = "ollama-launch-codex-app"
in ~/.codex/config.toml to route requests through the local Ollama server.
In Codex, model_provider is a global config key, there is no per-model provider
in the catalog schema (ModelInfo has no model_provider field), so it applies to
every model, not just Ollama ones.
When a user switches to a built-in OpenAI model (e.g. gpt-5.5) in the Codex App
UI, the UI writes model = "gpt-5.5" to config.toml but does NOT update
model_provider. The root model_provider stays "ollama-launch-codex-app", so the
OpenAI model request goes to http://localhost:11434/v1/responses instead of
OpenAI API, resulting in a 404 ("model gpt-5.5 not found"). The user is
stuck: OpenAI models silently route to localhost until they know to run
"ollama launch codex-app --restore".
Fix: CurrentModel() now verifies the configured model appears as a slug in the
Ollama-managed catalog before reporting the integration as active. When the
model has drifted (user selected a non-Ollama model in the UI), CurrentModel()
returns empty, so the launcher accurately shows the integration as inactive and
the user is directed to restore or re-launch.
The heuristic schedule grew the draft toward a fixed cap on acceptance alone,
maximizing accepted-tokens-per-step rather than throughput, and on a
steep-forward target it regressed below no speculation. Replace it with an
engine-level controller that drafts the depth maximizing
committed-tokens-per-wallclock from live per-position acceptance and persisted
per-width forward cost, with no draft-length cap; the heuristic schedule and
the OLLAMA_MLX_MTP_* env vars go with it.
Acceptance took two blocking evals per round: one to read the accepted mask,
then a second for the bonus or residual token whose graph needed the
host-known rejection point. Sample the residual at every rejection point in
one batched draw alongside the bonus row, so a single eval covers acceptance
and the next token.
Each speculative round ran the target stack twice — once for the current
token's hidden and base logits, once to validate the drafts — capping
throughput below plain decode. Fuse them into one forward over [current,
draft_0..draft_{N-1}], whose hidden rows already line up with the acceptance
math, so the separate base-logits unembed disappears from the drafted path.
Sampler.Distribution built row i as if draftTokens[:i] were appended, leaving
a single-row proposal call with no draft history, so a drafter skipped the
repeat/presence penalties the target's validation applies and re-proposed
penalized tokens. Align rows with the end of the draft chain instead: the
final row sees every draft token, each earlier row one fewer.
Generalize the draft path so a head that maintains a KV cache (EAGLE-style)
and Gemma's read-only single-position assistant both fit one drafter
interface with no per-model branches, and make the committed stream the
drafter's maintenance mechanism — every committed run is reported, the
drafter pairs each draft slot with its look-ahead token and flushes completed
pairs to the draft caches. The draft KV thus stays prefix-cached alongside
the target in every session, drafting or not.
The pipeline and the MTP decoder each owned a decode loop with duplicated
prefill, budget, and emission handling. Split the pipeline into prefill and
decode phases behind a decoder interface, with the decode loop the sole
emitter enforcing the NumPredict budget, and split speculation into a generic
engine that returns the accepted run and a drafter interface that owns only
how proposals are made.
Greedy is a special case of sampled decoding — at temperature 0 the sampler
yields a point mass, so rejection-sampling acceptance reduces to argmax-match
— so collapse the separate greedy, sampled, and serial paths into one. MTP
now honors any temperature, penalty, and top-k/p/min-p setting; logprobs
remain the only gated feature.
On Windows hybrid-graphics systems (Intel iGPU + NVIDIA dGPU), discovery
could classify the integrated GPU as discrete and the discrete GPU as
integrated, dropping the dGPU's Vulkan device and scheduling models onto
the iGPU's shared system RAM (#16667). Two index-keyed correlations
between independently-ordered device enumerations caused this:
1. The native probe's stderr was concatenated into the output passed to
parseVulkanUMA. The probe enumerates Vulkan devices in its own order,
so its ggml_vulkan uma lines overwrote llama-server's index-keyed UMA
map with inverted values. Parse UMA metadata only from llama-server's
own output.
2. applyWindowsVulkanRefinement required the raw vkEnumeratePhysicalDevices
count to equal llama-server's Vulkan device count. The raw enumeration
is a superset on real systems (D3D12 mapping-layer devices, Microsoft
Basic Render Driver), so the refinement that reads the authoritative
VkPhysicalDeviceType was always skipped. Match devices by name against
the probed superset instead, bailing only when a device has no match or
matches conflicting device types.
Verified on the hardware from #16667 (Intel RaptorLake-S + RTX 4080
Laptop): the raw probe returns 5 devices vs llama-server's 2; with this
change the iGPU is dropped as integrated, the dGPU's Vulkan device
dedupes against CUDA0, and the model loads on the dGPU with no
environment overrides.
Fixes#16667