Commit graph

5508 commits

Author SHA1 Message Date
Daniel Hiltgen
dba1e27fa8
llama: enable FA on CUDA CC 6.x GPUs (#16994)
Recent upstream Pascal kernel fixes let us compile native SM60/SM61 kernels again instead of relying on PTX JIT, so allow Flash Attention auto at runtime for CC 6.x devices.

Fixes #16591

Fixes #16754
2026-07-02 17:11:39 -07:00
Daniel Hiltgen
e436db25ff
compat: use UTF-8-safe file open (#16999)
Use ggml_fopen for compat tensor reads so Windows paths with Unicode characters are converted through the same UTF-8-to-wide path as llama.cpp model loading.

Fixes #16493
2026-07-02 16:59:23 -07:00
Daniel Hiltgen
26acfa42b5
rocm: remove no longer supported devices (#17010)
The presets and docs had fallen out of sync with what our current ROCm versions on Linux and Windows actually support.  We rely on Vulkan now to cover these older unsupported devices.
2026-07-02 16:59:01 -07:00
Daniel Hiltgen
7b22ac9683
llama: clean up dead code from llama-server work (#17007)
These pieces were missed in the final merge of llama-server and are dead code.
2026-07-02 12:51:54 -07:00
Parth Sareen
a2b3a5e9a3
agent: harness core (#16963) 2026-07-02 11:44:31 -07:00
Kevin Park
624cada952
discover: fall back to standard CUDA when the JetPack runner is absent (#16949)
* discover: use the SBSA CUDA build on JetPack 7 (L4T r38+)

JetPack 7 supports SBSA-based CUDA, so the standard cuda_v13 build — shipped
in the base linux-arm64 package, and given the Orin arch (CC 8.7) in #16628
— runs on these devices.

JETSON_JETPACK=7 previously selected a nonexistent jetpack7 runner, so
runner.go skipped every CUDA library and discovery fell back to CPU. The L4T
releases JetPack 7 uses (r38 on Thor, r39 on Orin) also hit the unrecognized
branch, and install.sh warned the version was unsupported. Map JetPack 7+
(L4T r38 and newer) to cuda_v13 (returned as "" from cudaJetpack); no
Jetson-specific download is needed, so install.sh no longer warns.

Fixes #16602

* discover: fall back to standard CUDA when the JetPack runner is absent

Per review, drop the L4T-version mapping (in cudaJetpack and install.sh) and
instead clear the jetpack override in runner.go when the detected cuda_jetpack
runner isn't installed. Normal discovery then selects the standard cuda_v13
build, which supports Orin (CC 8.7) on JetPack 7.
2026-07-02 08:34:35 -07:00
Michael Yang
cecd265d3a
docs(cloud): update retirement list (#17000) 2026-07-01 19:43:14 -07:00
Mark Ward
2ea95fb059
fix cuda toolkit lookup and parallel (#16613)
* fix cuda toolkit lookup and parallel

* support user override first

* enable control over the nested parallel count
2026-06-30 10:56:54 -07:00
Daniel Hiltgen
8e7be3aed1
ci: avoid unbounded parallelism (#16966)
build-darwin has gotten very slow in the past few releases, most likely due to unbounded parallelism in the MLX build causing the builder to thrash
2026-06-30 10:49:55 -07:00
Patrick Devine
710292ff4f
mlx: tighten up gemma4 moe loading code (#16964)
This change allows .experts.gate_proj / .up_proj / .down_proj tensor names to each
be used for both quantized (i.e. nvfp4 and mxfp8) and non-quantized (bf16) models.
Previous to this only non-quantized models used that tensor naming scheme.
2026-06-29 21:15:08 -07:00
Bruce MacDonald
ada1eb5163
launch: check for min version for hermes desktop (#16912) 2026-06-29 11:50:11 -07:00
Daniel Hiltgen
1c5ebbf5f4
llama.cpp update (#16960) 2026-06-29 09:43:41 -07:00
Daniel Hiltgen
7926b99e0e
mlx: bump dependency (#16935)
Update MLX to 548dd80.

Fix direct MLX tests to run on pinned MLX threads so test execution matches the runner's MLX thread-affinity model.
2026-06-29 09:39:11 -07:00
Aditya Aggarwal
32a97b7493
tools: ignore braces inside JSON strings when detecting tool call end (#16937)
Parser.done() counted the tag's open/close characters ({}, []) without
tracking JSON string context, so a streamed tool call whose string
argument value contained a closing brace or bracket (e.g.
{"code": "if (x) { y }"}) was treated as complete too early and flushed to
the user as plain text instead of being parsed as a tool call.

findArguments() in the same file already tracks string context; apply the
same handling in done() so open/close characters inside string values are
ignored.
2026-06-27 12:00:55 -07:00
Daniel Hiltgen
d26a58557d
MLX: wire up scheduler selected context size for ps (#16918)
In the PS output, expose the scheduler selected size (clamped by model context size) instead of always reporting the model max context.  This will help provide a hint to clients to keep the context size below this value to avoid paging and poor performance on smaller VRAM systems.
2026-06-26 08:47:03 -07:00
Parth Sareen
2e474c98f9
parser/renderer: add Ornith 9B renderer/parser support (#16920) 2026-06-25 23:18:47 -07:00
Bruce MacDonald
2cb2c5381f
launch: update hermes install urls to official (#16913) 2026-06-25 16:22:19 -07:00
Eva H
2a6b50421a
fix capability grid dark mode style (#16907) 2026-06-25 13:55:39 -04:00
Daniel Hiltgen
f22ec2ec49
CUDA: require driver 550 or newer for v12 (#16895)
Our cuda_v12 build requires nvcc fatbin compression, which in turn requires driver 550 or newer.  This change filters incompatible CUDA devices based on the runtime and driver version.  This allows users to build from source with older toolkits to support older drivers.

Fixes #16449
2026-06-25 08:46:00 -07:00
Eva H
d9075caf1a
docs: redesign coding integration docs (#16808) 2026-06-25 10:03:59 -04:00
Daniel Hiltgen
e11eeb3ba0
llama.cpp version update (#16548) 2026-06-24 14:03:12 -07:00
Daniel Hiltgen
0a408b2225
jetson: add CC 87 for CUDA v13 (#16628)
The new Jetpack 7.2 supports SBSA based CUDA, so we can add the architecture now.
2026-06-24 14:02:41 -07:00
Daniel Hiltgen
16739dee60
server: align generate with native chat templates (#16878)
* server: align generate with native chat templates

/api/generate rebuilt chat-like prompts through the Go template path even when the model selected its native GGUF Jinja chat template, so the same model rendered differently between generate and chat.

Route chat-like generate requests through the shared native chat preparation path, keep deprecated context and image handling working there, and keep explicit OLLAMA_GO_TEMPLATE overrides intact.

Fixes #16792

* review comments

Fall back to "{{ .Prompt }}" when lacking templates
2026-06-24 13:43:56 -07:00
Eva H
d48d790baf
docs: redesign docs landing and integrations overview (#16807)
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2026-06-24 16:28:28 -04:00
Philip Sinitsin
0463940334
llm: fix ollama ps double-counting mmap'd weights on partial offload (#16709)
* llm: fix ollama ps double-counting mmap'd weights on partial offload

With mmap enabled, llama-server reports each CPU_Mapped model buffer as the
file-offset span of its CPU-resident tensors. During partial offload that span
covers nearly the whole file because the first and last tensors stay on CPU, so
the parsed buffer sizes count the offloaded weights twice and ollama ps shows
roughly 2x the real size with a false CPU/GPU split. Model weights can never
exceed the model file on disk, so trim the excess over the file size from the
mmap-backed portion when computing MemorySize. This makes the reported size
independent of use_mmap; VRAM accounting and scheduler placement are unchanged.

* llm: exclude repacked model buffers from the mmap overlap trim

The trim that corrects mmap double-counting computed the overlap from all
model buffers, including real copies such as CPU_REPACK. On a CPU-only
repacked model that inflated the excess and trimmed the repack out,
undercounting by the repack size (llama3.2 reported ~1918 MiB instead of
~3218 MiB).

Compute the overlap from file-backed buffers only: mmap views and direct
device copies, whose spans can overlap the file on partial offload.
Repacked or host-pinned CPU copies are separate allocations that never
overlap the on-disk weights, so leave them intact. Adds a CPU_Mapped +
CPU_REPACK regression test and corrects the Metal case to the real total.
2026-06-24 11:43:20 -07:00
Daniel Hiltgen
570679c9e0
mlx: update and fix CUDA JIT packaging (#16871)
Bump MLX to the latest selected upstream ref and update the MLX/imagegen
wrappers and tests for the new API behavior.

Fix the CUDA MLX archive so runtime NVRTC kernels work after deployment:
package CUTE/CUTLASS headers, include the CUDA runtime header closure, and
stage a coherent CUDA-toolkit-matched CCCL tree instead of MLX's fetched CCCL
for CUDA payloads. The previous archive could build successfully but crash at
runtime due to missing or incompatible JIT headers.
2026-06-24 10:36:02 -07:00
Daniel Hiltgen
89a171cc70
llm: use host Vulkan loader on Windows (#16869)
Stop bundling the Vulkan loader and resolve the host runtime for Windows Vulkan discovery and backend dependency loading.

Fixes #16677
2026-06-24 10:35:48 -07:00
Daniel Hiltgen
33878e671a
llama: default qwen2.5vl window attention metadata (#16868)
Existing qwen2.5vl GGUFs can contain an empty qwen25vl.vision.fullatt_block_indexes array. The compat layer translated the projector metadata but left clip.vision.n_wa_pattern unset, causing llama-server to fail loading the CLIP model.

Default the runtime compat value to the standard Qwen2.5-VL pattern when the key cannot be derived, and make the converter emit the same default for nil or empty fullatt block metadata.

Fixes #16540
2026-06-24 10:35:29 -07:00
Parth Sareen
c191a145bb
llm: preserve generation headroom for shifted prompts (#16856)
---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2026-06-23 15:29:40 -07:00
Parth Sareen
479e1cf94e
docs: document max think level (#16877) 2026-06-23 15:29:15 -07:00
Daniel Hiltgen
836507378b
llm: size mmproj offload by projector memory (#16866)
* llm: size mmproj offload by projector memory

Replace the blanket 10 GiB VRAM cutoff with a projector tensor-size estimate plus backend headroom, while preserving the existing CPU-only, partial text offload, shared-memory GPU, and startup OOM retry gates.

This is a stopgap until fit accounts for mmproj memory directly.

The same limited-vram path appears in the qwen3.5 vision hang report: the logs show --no-mmproj-offload on a 7.5 GiB RTX 5050 with about 6.4 GiB free while llama-server estimates the inline mmproj at about 962 MiB.

Fixes #16496

Fixes #16570

* review comments
2026-06-23 13:04:02 -07:00
anish
46bc1bcb4c
llama: add sm_86 architecture to cuda_v13_windows preset (#16834)
The llama_cuda_v13_windows preset in llama/server/CMakePresets.json was missing sm_86 and sm_80 architectures, causing RTX 3060 laptop and similar mobile RTX 30-series GPUs to be skipped during runtime GPU detection on Windows with CUDA 13. The Linux preset (llama_cuda_v13_linux) included these architectures as "86-virtual" and "80-virtual", but the Windows preset only had "75-virtual;89-virtual;100-virtual;120-virtual", excluding Ampere mobile GPUs.

Signed-off-by: anish <anishesg@users.noreply.github.com>
Co-authored-by: anish <anishesg@users.noreply.github.com>
2026-06-23 07:35:21 -07:00
Bruce MacDonald
2a8b31531e
launch/codex: detect model drift when Codex App UI switches away from Ollama (#16864)
ollama launch codex-app sets root-level model_provider = "ollama-launch-codex-app"
in ~/.codex/config.toml to route requests through the local Ollama server.
In Codex, model_provider is a global config key, there is no per-model provider
in the catalog schema (ModelInfo has no model_provider field), so it applies to
every model, not just Ollama ones.

When a user switches to a built-in OpenAI model (e.g. gpt-5.5) in the Codex App
UI, the UI writes model = "gpt-5.5" to config.toml but does NOT update
model_provider. The root model_provider stays "ollama-launch-codex-app", so the
OpenAI model request goes to http://localhost:11434/v1/responses instead of
OpenAI API, resulting in a 404 ("model gpt-5.5 not found"). The user is
stuck: OpenAI models silently route to localhost until they know to run
"ollama launch codex-app --restore".

Fix: CurrentModel() now verifies the configured model appears as a slug in the
Ollama-managed catalog before reporting the integration as active. When the
model has drifted (user selected a non-Ollama model in the UI), CurrentModel()
returns empty, so the launcher accurately shows the integration as inactive and
the user is directed to restore or re-launch.
2026-06-22 15:38:19 -07:00
Jesse Gross
505e35f2b9 mlxrunner: choose the speculative draft length to maximize throughput
The heuristic schedule grew the draft toward a fixed cap on acceptance alone,
maximizing accepted-tokens-per-step rather than throughput, and on a
steep-forward target it regressed below no speculation. Replace it with an
engine-level controller that drafts the depth maximizing
committed-tokens-per-wallclock from live per-position acceptance and persisted
per-width forward cost, with no draft-length cap; the heuristic schedule and
the OLLAMA_MLX_MTP_* env vars go with it.
2026-06-22 15:25:45 -07:00
Jesse Gross
114875133b mlxrunner: resolve each speculative round in one host sync
Acceptance took two blocking evals per round: one to read the accepted mask,
then a second for the bonus or residual token whose graph needed the
host-known rejection point. Sample the residual at every rejection point in
one batched draw alongside the bonus row, so a single eval covers acceptance
and the next token.
2026-06-22 15:25:45 -07:00
Jesse Gross
42c330283b mlxrunner: run one target forward per MTP decode step
Each speculative round ran the target stack twice — once for the current
token's hidden and base logits, once to validate the drafts — capping
throughput below plain decode. Fuse them into one forward over [current,
draft_0..draft_{N-1}], whose hidden rows already line up with the acceptance
math, so the separate base-logits unembed disappears from the drafted path.
2026-06-22 15:25:45 -07:00
Jesse Gross
f93efe2809 mlxrunner: apply in-flight drafts to proposal penalty history
Sampler.Distribution built row i as if draftTokens[:i] were appended, leaving
a single-row proposal call with no draft history, so a drafter skipped the
repeat/presence penalties the target's validation applies and re-proposed
penalized tokens. Align rows with the end of the draft chain instead: the
final row sees every draft token, each earlier row one fewer.
2026-06-22 15:25:45 -07:00
Jesse Gross
28fbbb06d5 mlxrunner: support draft heads that maintain draft caches
Generalize the draft path so a head that maintains a KV cache (EAGLE-style)
and Gemma's read-only single-position assistant both fit one drafter
interface with no per-model branches, and make the committed stream the
drafter's maintenance mechanism — every committed run is reported, the
drafter pairs each draft slot with its look-ahead token and flushes completed
pairs to the draft caches. The draft KV thus stays prefix-cached alongside
the target in every session, drafting or not.
2026-06-22 15:25:45 -07:00
Jesse Gross
340c51bbb7 mlxrunner: host speculative decoding in the text generation pipeline
The pipeline and the MTP decoder each owned a decode loop with duplicated
prefill, budget, and emission handling. Split the pipeline into prefill and
decode phases behind a decoder interface, with the decode loop the sole
emitter enforcing the NumPredict budget, and split speculation into a generic
engine that returns the accepted run and a drafter interface that owns only
how proposals are made.
2026-06-22 15:25:45 -07:00
Jesse Gross
2e9d68dc38 mlxrunner: unify the MTP decode paths
Greedy is a special case of sampled decoding — at temperature 0 the sampler
yields a point mass, so rejection-sampling acceptance reduces to argmax-match
— so collapse the separate greedy, sampled, and serial paths into one. MTP
now honors any temperature, penalty, and top-k/p/min-p setting; logprobs
remain the only gated feature.
2026-06-22 15:25:45 -07:00
Sahil Kadadekar
fc58544422
discover: fix inverted iGPU/dGPU Vulkan classification on Windows hybrid graphics (#16669)
On Windows hybrid-graphics systems (Intel iGPU + NVIDIA dGPU), discovery
could classify the integrated GPU as discrete and the discrete GPU as
integrated, dropping the dGPU's Vulkan device and scheduling models onto
the iGPU's shared system RAM (#16667). Two index-keyed correlations
between independently-ordered device enumerations caused this:

1. The native probe's stderr was concatenated into the output passed to
   parseVulkanUMA. The probe enumerates Vulkan devices in its own order,
   so its ggml_vulkan uma lines overwrote llama-server's index-keyed UMA
   map with inverted values. Parse UMA metadata only from llama-server's
   own output.

2. applyWindowsVulkanRefinement required the raw vkEnumeratePhysicalDevices
   count to equal llama-server's Vulkan device count. The raw enumeration
   is a superset on real systems (D3D12 mapping-layer devices, Microsoft
   Basic Render Driver), so the refinement that reads the authoritative
   VkPhysicalDeviceType was always skipped. Match devices by name against
   the probed superset instead, bailing only when a device has no match or
   matches conflicting device types.

Verified on the hardware from #16667 (Intel RaptorLake-S + RTX 4080
Laptop): the raw probe returns 5 devices vs llama-server's 2; with this
change the iGPU is dropped as integrated, the dGPU's Vulkan device
dedupes against CUDA0, and the model loads on the dGPU with no
environment overrides.

Fixes #16667
2026-06-22 14:52:03 -07:00
Eva H
e434a93884
launch: auto-install opencode when missing (#16806) 2026-06-19 10:12:11 -07:00
Eva H
9c02d8e69d
launch: auto-install Claude Code (#16802) 2026-06-19 10:11:50 -07:00
Eva H
07ed752353
launch: add thinking capability detection to opencode (#15434) 2026-06-18 13:45:16 -04:00
Parth Sareen
e1f7f9cbdb
ci: pin darwin release xcode (#16788) 2026-06-17 13:01:10 -07:00
Patrick Devine
8c432fc88a
llama: update llama.cpp to b9672 (#16775) 2026-06-16 23:15:52 -07:00
Jeffrey Morgan
acfb50d9af
models: add cohere2_moe (Command A / North) to the MLX engine (#16670)
Implements Cohere2MoeForCausalLM (e.g. CohereLabs/North-Mini-Code-1.0)
2026-06-16 23:15:21 -07:00
Jeffrey Morgan
0f047feef5
llm: context shift allow shiftable prompts (#16764) 2026-06-16 12:55:52 -07:00
Patrick Devine
9e4ed74efe
integration: look for the "hf" tool in integration tests (#16765)
The "huggingface-cli" tool is deprecated, so only try to use the "hf" tool.
2026-06-16 11:04:54 -07:00
Jeffrey Morgan
bbb40a0a6c
server: context shift for context windows larger than 8k, add error when hitting context limit (#16712) 2026-06-15 11:36:50 -07:00