kenaug74/ollama

mirror of https://github.com/ollama/ollama.git synced 2026-07-03 03:38:52 +00:00

Author	SHA1	Message	Date
Daniel Hiltgen	dba1e27fa8	llama: enable FA on CUDA CC 6.x GPUs (#16994 ) Recent upstream Pascal kernel fixes let us compile native SM60/SM61 kernels again instead of relying on PTX JIT, so allow Flash Attention auto at runtime for CC 6.x devices. Fixes #16591 Fixes #16754	2026-07-02 17:11:39 -07:00
Daniel Hiltgen	e436db25ff	compat: use UTF-8-safe file open (#16999 ) Use ggml_fopen for compat tensor reads so Windows paths with Unicode characters are converted through the same UTF-8-to-wide path as llama.cpp model loading. Fixes #16493	2026-07-02 16:59:23 -07:00
Daniel Hiltgen	26acfa42b5	rocm: remove no longer supported devices (#17010 ) The presets and docs had fallen out of sync with what our current ROCm versions on Linux and Windows actually support. We rely on Vulkan now to cover these older unsupported devices.	2026-07-02 16:59:01 -07:00
Daniel Hiltgen	7b22ac9683	llama: clean up dead code from llama-server work (#17007 ) These pieces were missed in the final merge of llama-server and are dead code.	2026-07-02 12:51:54 -07:00
Parth Sareen	a2b3a5e9a3	agent: harness core (#16963 )	2026-07-02 11:44:31 -07:00
Kevin Park	624cada952	discover: fall back to standard CUDA when the JetPack runner is absent (#16949 ) * discover: use the SBSA CUDA build on JetPack 7 (L4T r38+) JetPack 7 supports SBSA-based CUDA, so the standard cuda_v13 build — shipped in the base linux-arm64 package, and given the Orin arch (CC 8.7) in #16628 — runs on these devices. JETSON_JETPACK=7 previously selected a nonexistent jetpack7 runner, so runner.go skipped every CUDA library and discovery fell back to CPU. The L4T releases JetPack 7 uses (r38 on Thor, r39 on Orin) also hit the unrecognized branch, and install.sh warned the version was unsupported. Map JetPack 7+ (L4T r38 and newer) to cuda_v13 (returned as "" from cudaJetpack); no Jetson-specific download is needed, so install.sh no longer warns. Fixes #16602 * discover: fall back to standard CUDA when the JetPack runner is absent Per review, drop the L4T-version mapping (in cudaJetpack and install.sh) and instead clear the jetpack override in runner.go when the detected cuda_jetpack runner isn't installed. Normal discovery then selects the standard cuda_v13 build, which supports Orin (CC 8.7) on JetPack 7.	2026-07-02 08:34:35 -07:00
Michael Yang	cecd265d3a	docs(cloud): update retirement list (#17000 )	2026-07-01 19:43:14 -07:00
Mark Ward	2ea95fb059	fix cuda toolkit lookup and parallel (#16613 ) * fix cuda toolkit lookup and parallel * support user override first * enable control over the nested parallel count	2026-06-30 10:56:54 -07:00
Daniel Hiltgen	8e7be3aed1	ci: avoid unbounded parallelism (#16966 ) build-darwin has gotten very slow in the past few releases, most likely due to unbounded parallelism in the MLX build causing the builder to thrash	2026-06-30 10:49:55 -07:00
Patrick Devine	710292ff4f	mlx: tighten up gemma4 moe loading code (#16964 ) This change allows .experts.gate_proj / .up_proj / .down_proj tensor names to each be used for both quantized (i.e. nvfp4 and mxfp8) and non-quantized (bf16) models. Previous to this only non-quantized models used that tensor naming scheme.	2026-06-29 21:15:08 -07:00
Bruce MacDonald	ada1eb5163	launch: check for min version for hermes desktop (#16912 )	2026-06-29 11:50:11 -07:00
Daniel Hiltgen	1c5ebbf5f4	llama.cpp update (#16960 )	2026-06-29 09:43:41 -07:00
Daniel Hiltgen	7926b99e0e	mlx: bump dependency (#16935 ) Update MLX to 548dd80. Fix direct MLX tests to run on pinned MLX threads so test execution matches the runner's MLX thread-affinity model.	2026-06-29 09:39:11 -07:00
Aditya Aggarwal	32a97b7493	tools: ignore braces inside JSON strings when detecting tool call end (#16937 ) Parser.done() counted the tag's open/close characters ({}, []) without tracking JSON string context, so a streamed tool call whose string argument value contained a closing brace or bracket (e.g. {"code": "if (x) { y }"}) was treated as complete too early and flushed to the user as plain text instead of being parsed as a tool call. findArguments() in the same file already tracks string context; apply the same handling in done() so open/close characters inside string values are ignored.	2026-06-27 12:00:55 -07:00
Daniel Hiltgen	d26a58557d	MLX: wire up scheduler selected context size for ps (#16918 ) In the PS output, expose the scheduler selected size (clamped by model context size) instead of always reporting the model max context. This will help provide a hint to clients to keep the context size below this value to avoid paging and poor performance on smaller VRAM systems.	2026-06-26 08:47:03 -07:00
Parth Sareen	2e474c98f9	parser/renderer: add Ornith 9B renderer/parser support (#16920 )	2026-06-25 23:18:47 -07:00
Bruce MacDonald	2cb2c5381f	launch: update hermes install urls to official (#16913 )	2026-06-25 16:22:19 -07:00
Eva H	2a6b50421a	fix capability grid dark mode style (#16907 )	2026-06-25 13:55:39 -04:00
Daniel Hiltgen	f22ec2ec49	CUDA: require driver 550 or newer for v12 (#16895 ) Our cuda_v12 build requires nvcc fatbin compression, which in turn requires driver 550 or newer. This change filters incompatible CUDA devices based on the runtime and driver version. This allows users to build from source with older toolkits to support older drivers. Fixes #16449	2026-06-25 08:46:00 -07:00
Eva H	d9075caf1a	docs: redesign coding integration docs (#16808 )	2026-06-25 10:03:59 -04:00
Daniel Hiltgen	e11eeb3ba0	llama.cpp version update (#16548 )	2026-06-24 14:03:12 -07:00
Daniel Hiltgen	0a408b2225	jetson: add CC 87 for CUDA v13 (#16628 ) The new Jetpack 7.2 supports SBSA based CUDA, so we can add the architecture now.	2026-06-24 14:02:41 -07:00
Daniel Hiltgen	16739dee60	server: align generate with native chat templates (#16878 ) * server: align generate with native chat templates /api/generate rebuilt chat-like prompts through the Go template path even when the model selected its native GGUF Jinja chat template, so the same model rendered differently between generate and chat. Route chat-like generate requests through the shared native chat preparation path, keep deprecated context and image handling working there, and keep explicit OLLAMA_GO_TEMPLATE overrides intact. Fixes #16792 * review comments Fall back to "{{ .Prompt }}" when lacking templates	2026-06-24 13:43:56 -07:00
Eva H	d48d790baf	docs: redesign docs landing and integrations overview (#16807 ) Co-authored-by: Parth Sareen <parth.sareen@ollama.com>	2026-06-24 16:28:28 -04:00
Philip Sinitsin	0463940334	llm: fix ollama ps double-counting mmap'd weights on partial offload (#16709 ) * llm: fix ollama ps double-counting mmap'd weights on partial offload With mmap enabled, llama-server reports each CPU_Mapped model buffer as the file-offset span of its CPU-resident tensors. During partial offload that span covers nearly the whole file because the first and last tensors stay on CPU, so the parsed buffer sizes count the offloaded weights twice and ollama ps shows roughly 2x the real size with a false CPU/GPU split. Model weights can never exceed the model file on disk, so trim the excess over the file size from the mmap-backed portion when computing MemorySize. This makes the reported size independent of use_mmap; VRAM accounting and scheduler placement are unchanged. * llm: exclude repacked model buffers from the mmap overlap trim The trim that corrects mmap double-counting computed the overlap from all model buffers, including real copies such as CPU_REPACK. On a CPU-only repacked model that inflated the excess and trimmed the repack out, undercounting by the repack size (llama3.2 reported ~1918 MiB instead of ~3218 MiB). Compute the overlap from file-backed buffers only: mmap views and direct device copies, whose spans can overlap the file on partial offload. Repacked or host-pinned CPU copies are separate allocations that never overlap the on-disk weights, so leave them intact. Adds a CPU_Mapped + CPU_REPACK regression test and corrects the Metal case to the real total.	2026-06-24 11:43:20 -07:00
Daniel Hiltgen	570679c9e0	mlx: update and fix CUDA JIT packaging (#16871 ) Bump MLX to the latest selected upstream ref and update the MLX/imagegen wrappers and tests for the new API behavior. Fix the CUDA MLX archive so runtime NVRTC kernels work after deployment: package CUTE/CUTLASS headers, include the CUDA runtime header closure, and stage a coherent CUDA-toolkit-matched CCCL tree instead of MLX's fetched CCCL for CUDA payloads. The previous archive could build successfully but crash at runtime due to missing or incompatible JIT headers.	2026-06-24 10:36:02 -07:00
Daniel Hiltgen	89a171cc70	llm: use host Vulkan loader on Windows (#16869 ) Stop bundling the Vulkan loader and resolve the host runtime for Windows Vulkan discovery and backend dependency loading. Fixes #16677	2026-06-24 10:35:48 -07:00
Daniel Hiltgen	33878e671a	llama: default qwen2.5vl window attention metadata (#16868 ) Existing qwen2.5vl GGUFs can contain an empty qwen25vl.vision.fullatt_block_indexes array. The compat layer translated the projector metadata but left clip.vision.n_wa_pattern unset, causing llama-server to fail loading the CLIP model. Default the runtime compat value to the standard Qwen2.5-VL pattern when the key cannot be derived, and make the converter emit the same default for nil or empty fullatt block metadata. Fixes #16540	2026-06-24 10:35:29 -07:00
Parth Sareen	c191a145bb	llm: preserve generation headroom for shifted prompts (#16856 ) --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2026-06-23 15:29:40 -07:00
Parth Sareen	479e1cf94e	docs: document max think level (#16877 )	2026-06-23 15:29:15 -07:00
Daniel Hiltgen	836507378b	llm: size mmproj offload by projector memory (#16866 ) * llm: size mmproj offload by projector memory Replace the blanket 10 GiB VRAM cutoff with a projector tensor-size estimate plus backend headroom, while preserving the existing CPU-only, partial text offload, shared-memory GPU, and startup OOM retry gates. This is a stopgap until fit accounts for mmproj memory directly. The same limited-vram path appears in the qwen3.5 vision hang report: the logs show --no-mmproj-offload on a 7.5 GiB RTX 5050 with about 6.4 GiB free while llama-server estimates the inline mmproj at about 962 MiB. Fixes #16496 Fixes #16570 * review comments	2026-06-23 13:04:02 -07:00
anish	46bc1bcb4c	llama: add sm_86 architecture to cuda_v13_windows preset (#16834 ) The llama_cuda_v13_windows preset in llama/server/CMakePresets.json was missing sm_86 and sm_80 architectures, causing RTX 3060 laptop and similar mobile RTX 30-series GPUs to be skipped during runtime GPU detection on Windows with CUDA 13. The Linux preset (llama_cuda_v13_linux) included these architectures as "86-virtual" and "80-virtual", but the Windows preset only had "75-virtual;89-virtual;100-virtual;120-virtual", excluding Ampere mobile GPUs. Signed-off-by: anish <anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com>	2026-06-23 07:35:21 -07:00
Bruce MacDonald	2a8b31531e	launch/codex: detect model drift when Codex App UI switches away from Ollama (#16864 ) ollama launch codex-app sets root-level model_provider = "ollama-launch-codex-app" in ~/.codex/config.toml to route requests through the local Ollama server. In Codex, model_provider is a global config key, there is no per-model provider in the catalog schema (ModelInfo has no model_provider field), so it applies to every model, not just Ollama ones. When a user switches to a built-in OpenAI model (e.g. gpt-5.5) in the Codex App UI, the UI writes model = "gpt-5.5" to config.toml but does NOT update model_provider. The root model_provider stays "ollama-launch-codex-app", so the OpenAI model request goes to http://localhost:11434/v1/responses instead of OpenAI API, resulting in a 404 ("model gpt-5.5 not found"). The user is stuck: OpenAI models silently route to localhost until they know to run "ollama launch codex-app --restore". Fix: CurrentModel() now verifies the configured model appears as a slug in the Ollama-managed catalog before reporting the integration as active. When the model has drifted (user selected a non-Ollama model in the UI), CurrentModel() returns empty, so the launcher accurately shows the integration as inactive and the user is directed to restore or re-launch.	2026-06-22 15:38:19 -07:00
Jesse Gross	505e35f2b9	mlxrunner: choose the speculative draft length to maximize throughput The heuristic schedule grew the draft toward a fixed cap on acceptance alone, maximizing accepted-tokens-per-step rather than throughput, and on a steep-forward target it regressed below no speculation. Replace it with an engine-level controller that drafts the depth maximizing committed-tokens-per-wallclock from live per-position acceptance and persisted per-width forward cost, with no draft-length cap; the heuristic schedule and the OLLAMA_MLX_MTP_* env vars go with it.	2026-06-22 15:25:45 -07:00
Jesse Gross	114875133b	mlxrunner: resolve each speculative round in one host sync Acceptance took two blocking evals per round: one to read the accepted mask, then a second for the bonus or residual token whose graph needed the host-known rejection point. Sample the residual at every rejection point in one batched draw alongside the bonus row, so a single eval covers acceptance and the next token.	2026-06-22 15:25:45 -07:00
Jesse Gross	42c330283b	mlxrunner: run one target forward per MTP decode step Each speculative round ran the target stack twice — once for the current token's hidden and base logits, once to validate the drafts — capping throughput below plain decode. Fuse them into one forward over [current, draft_0..draft_{N-1}], whose hidden rows already line up with the acceptance math, so the separate base-logits unembed disappears from the drafted path.	2026-06-22 15:25:45 -07:00
Jesse Gross	f93efe2809	mlxrunner: apply in-flight drafts to proposal penalty history Sampler.Distribution built row i as if draftTokens[:i] were appended, leaving a single-row proposal call with no draft history, so a drafter skipped the repeat/presence penalties the target's validation applies and re-proposed penalized tokens. Align rows with the end of the draft chain instead: the final row sees every draft token, each earlier row one fewer.	2026-06-22 15:25:45 -07:00
Jesse Gross	28fbbb06d5	mlxrunner: support draft heads that maintain draft caches Generalize the draft path so a head that maintains a KV cache (EAGLE-style) and Gemma's read-only single-position assistant both fit one drafter interface with no per-model branches, and make the committed stream the drafter's maintenance mechanism — every committed run is reported, the drafter pairs each draft slot with its look-ahead token and flushes completed pairs to the draft caches. The draft KV thus stays prefix-cached alongside the target in every session, drafting or not.	2026-06-22 15:25:45 -07:00
Jesse Gross	340c51bbb7	mlxrunner: host speculative decoding in the text generation pipeline The pipeline and the MTP decoder each owned a decode loop with duplicated prefill, budget, and emission handling. Split the pipeline into prefill and decode phases behind a decoder interface, with the decode loop the sole emitter enforcing the NumPredict budget, and split speculation into a generic engine that returns the accepted run and a drafter interface that owns only how proposals are made.	2026-06-22 15:25:45 -07:00
Jesse Gross	2e9d68dc38	mlxrunner: unify the MTP decode paths Greedy is a special case of sampled decoding — at temperature 0 the sampler yields a point mass, so rejection-sampling acceptance reduces to argmax-match — so collapse the separate greedy, sampled, and serial paths into one. MTP now honors any temperature, penalty, and top-k/p/min-p setting; logprobs remain the only gated feature.	2026-06-22 15:25:45 -07:00
Sahil Kadadekar	fc58544422	discover: fix inverted iGPU/dGPU Vulkan classification on Windows hybrid graphics (#16669 ) On Windows hybrid-graphics systems (Intel iGPU + NVIDIA dGPU), discovery could classify the integrated GPU as discrete and the discrete GPU as integrated, dropping the dGPU's Vulkan device and scheduling models onto the iGPU's shared system RAM (#16667). Two index-keyed correlations between independently-ordered device enumerations caused this: 1. The native probe's stderr was concatenated into the output passed to parseVulkanUMA. The probe enumerates Vulkan devices in its own order, so its ggml_vulkan uma lines overwrote llama-server's index-keyed UMA map with inverted values. Parse UMA metadata only from llama-server's own output. 2. applyWindowsVulkanRefinement required the raw vkEnumeratePhysicalDevices count to equal llama-server's Vulkan device count. The raw enumeration is a superset on real systems (D3D12 mapping-layer devices, Microsoft Basic Render Driver), so the refinement that reads the authoritative VkPhysicalDeviceType was always skipped. Match devices by name against the probed superset instead, bailing only when a device has no match or matches conflicting device types. Verified on the hardware from #16667 (Intel RaptorLake-S + RTX 4080 Laptop): the raw probe returns 5 devices vs llama-server's 2; with this change the iGPU is dropped as integrated, the dGPU's Vulkan device dedupes against CUDA0, and the model loads on the dGPU with no environment overrides. Fixes #16667	2026-06-22 14:52:03 -07:00
Eva H	e434a93884	launch: auto-install opencode when missing (#16806 )	2026-06-19 10:12:11 -07:00
Eva H	9c02d8e69d	launch: auto-install Claude Code (#16802 )	2026-06-19 10:11:50 -07:00
Eva H	07ed752353	launch: add thinking capability detection to opencode (#15434 )	2026-06-18 13:45:16 -04:00
Parth Sareen	e1f7f9cbdb	ci: pin darwin release xcode (#16788 )	2026-06-17 13:01:10 -07:00
Patrick Devine	8c432fc88a	llama: update llama.cpp to b9672 (#16775 )	2026-06-16 23:15:52 -07:00
Jeffrey Morgan	acfb50d9af	models: add cohere2_moe (Command A / North) to the MLX engine (#16670 ) Implements Cohere2MoeForCausalLM (e.g. CohereLabs/North-Mini-Code-1.0)	2026-06-16 23:15:21 -07:00
Jeffrey Morgan	0f047feef5	llm: context shift allow shiftable prompts (#16764 )	2026-06-16 12:55:52 -07:00
Patrick Devine	9e4ed74efe	integration: look for the "hf" tool in integration tests (#16765 ) The "huggingface-cli" tool is deprecated, so only try to use the "hf" tool.	2026-06-16 11:04:54 -07:00
Jeffrey Morgan	bbb40a0a6c	server: context shift for context windows larger than 8k, add error when hitting context limit (#16712 )	2026-06-15 11:36:50 -07:00

1 2 3 4 5 ...

5508 commits