NemoClaw/tools
Aaron Erickson 🦞 3a767fb809
fix(gpu): prefer native OpenShell injection (#6142)
## Summary

Routes ordinary native CDI Linux through OpenShell 0.0.71's native
`--gpu` path instead of the legacy Docker container swap. The legacy
path remains available for WSL, Jetson, and explicit
`NEMOCLAW_DOCKER_GPU_PATCH=1`, with its OpenShell supervisor command
boundary and rollback diagnostics hardened.

## Related Issue

Related to #6110

## Changes

- Use native OpenShell GPU injection by default on ordinary native CDI
Linux.
- Document the `NEMOCLAW_DOCKER_GPU_PATCH` auto, forced-legacy, and
native-routing behavior for ordinary native Linux, Docker Desktop WSL,
and Jetson/Tegra.
- Keep `NEMOCLAW_DOCKER_GPU_PATCH=1` as the explicit legacy-swap force
control and `=0` as the existing native opt-out; Docker Desktop WSL
still ignores `=0`, and Jetson keeps its compatibility default.
- Preserve OpenShell's supervisor entrypoint on the legacy swap: Docker
receives no command tail, while the workload stays in
`OPENSHELL_SANDBOX_COMMAND`.
- Validate legacy startup tokens before stopping or renaming the
original container and serialize extra-placeholder keys as one
comma-delimited token.
- Defer every legacy recreate through the same supervisor-wait/finalize
boundary so failed clones are captured before rollback on both create
timing paths.
- Persist only allowlisted/redacted failed-clone topology, state,
process, network, and log evidence, with a 10-second total / 2-second
per-call budget so diagnostics cannot materially delay rollback.
- Permit only the canonical OpenShell Docker/Podman TLS key path in the
Hermes runtime environment; arbitrary values and persisted `.env`
entries remain rejected.
- Refresh the Dockerfile integrity pin for the changed validator so
production Hermes images fail closed on any later digest drift.
- Prove native and legacy Docker command boundaries separately,
including Ready/CUDA status, supervisor PID 1, placeholder transport,
config hashes, no backup-container leak, and inference requests.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [x] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Quality Gates

- [x] Tests added or updated for changed behavior
- [ ] Existing tests cover changed behavior — justification:
- [ ] Tests not applicable — justification:
- [x] Docs updated for user-facing behavior changes
- [ ] Docs not applicable — justification:
- [x] Sensitive paths changed (security, policy, credentials, preflight,
onboarding, inference, runner, sandbox, or messaging)
- [x] Sensitive-path review completed or maintainer-approved waiver
recorded — reviewer/approval link/justification: independent exact-diff
review found no remaining substantive issue after startup-envelope
secret redaction, bounded pre-rollback capture, unified finalize
ordering, and route-specific live assertions
- [ ] Non-success, skipped, or missing CI check accepted by maintainer —
check name, approval link, and follow-up issue:

## Verification

- [x] PR description includes the DCO sign-off declaration and every
commit appears as `Verified` in GitHub
- [ ] Git hooks passed during commit and push, or `npx prek run
--from-ref main --to-ref HEAD` passes
- [x] Targeted tests pass for changed behavior
- [ ] Full `npm test` passes (broad runtime changes only)
- [x] Quality Gates section completed with required justifications or
waivers
- [x] No secrets, API keys, or credentials committed
- [ ] `npm run docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Local exact-candidate verification at
`d76f1647a5f354ba04737a6a049b82bfbf6d5454`:

- CLI build and typecheck passed.
- Hermes GPU support/client/workflow coverage passed 48/48; the built
gateway-cleanup module resolves through Node, and runtime cleanup,
registration removal, and bind availability remain fail-closed.
- The shared Docker GPU diagnostic collector owns redaction for every
text/JSON artifact and returned summary; direct conventional `*_KEY` and
custom-placeholder canaries, JSON-validity, inspect-before-write,
exhausted-budget, and collector-owned top regressions pass.
- All 12 Docker GPU suites pass 126/126; the exact-head focused
E2E-support set passes 39/39, including six process-token self-match
regressions, the scrubbed integrity-proof environment, and total
forbidden-marker count.
- Conditional scan, source-shape, test-size, Biome, repository checks,
commit hooks, and push typecheck passed for the final harness
correction; only the documented macOS-invalid full CLI hook lane was
excluded.
- The forbidden-marker request sensor support suite passed 14/14 and
records counts only, never raw request bodies or marker values.
- OpenShell transport-boundary coverage passed 4/4; Docker GPU
command-envelope coverage passed 6/6; extra-placeholder parsing coverage
passed 16/16.
- Hermes validator and wrapper integrity pins match their source SHA256
digests; hadolint and diff checks pass.
- Hermes startup/boundary coverage passed 43/43 locally; the Linux-only
wrapper cases are delegated to exact-head CI.
- The full local CLI hook is not a valid gate on this macOS Node 22
host: unchanged `a1fc52c7` TypeScript entrypoints fail through the
CommonJS preload with `ERR_UNKNOWN_FILE_EXTENSION`; exact-head Linux CI
remains required.
- Hermes runtime-guard plus current-main docs regression tests: 24/24
passed.
- Hermes workflow-boundary test passed.
- Project-boundary, project-membership, test-title, source-shape,
test-size, targeted Biome, and diff checks passed.
- `npm run docs:sync-agent-variants`, `npm run
docs:check-agent-variants`, and `npm run docs` pass; Fern reports 0
errors and 2 existing warnings.

Live A/B evidence:

- Forced legacy swap at `97e3e7e1`: [run
28554699811](https://github.com/NVIDIA/NemoClaw/actions/runs/28554699811)
reproduced sustained OpenShell `Error` followed by safe rollback. It
also exposed and now closes a lifecycle instrumentation gap: the
post-create `ensureApplied()` branch bypassed pre-rollback capture.
- Native OpenShell route at signed diagnostic SHA `15f50182`: [run
28555110558](https://github.com/NVIDIA/NemoClaw/actions/runs/28555110558)
completed onboarding with exit 0, reached `Phase: Ready`, reported `CUDA
verified`, and sent authenticated Hermes chat-completions requests to
the hermetic inference endpoint. The job's only failure was the
now-fixed test regex not stripping ANSI around `Ready`.
- Prior native evidence at `7cb219d9`: [full run
28559814959](https://github.com/NVIDIA/NemoClaw/actions/runs/28559814959)
and [second pass
28559816026](https://github.com/NVIDIA/NemoClaw/actions/runs/28559816026)
both reached Ready/CUDA with clean runtime and teardown; their Hermes
proof stopped on the now-fixed ANSI matcher before downstream
assertions.
- Prior forced-legacy diagnostic on production parent `a1fc52c7` plus
workflow-only child `6d0cf6a5`: [run
28561607207](https://github.com/NVIDIA/NemoClaw/actions/runs/28561607207)
selected the legacy swap and rolled back cleanly, but failed because the
Hermes boundary rejected the driver-owned OpenShell `OPENSHELL_TLS_KEY`
path. Candidate `54cf259d` adds an exact runtime-only allowance with
negative boundary tests.
- Prior exact-head set at `a1fc52c7`: [run
28561548749](https://github.com/NVIDIA/NemoClaw/actions/runs/28561548749)
proved the GPU and security companion jobs, while Hermes GPU stopped at
a sandbox-user `/proc` permission probe after Ready/CUDA. Candidate
`54cf259d` keeps the same proof but runs it as root and restricts the
match to the exact `nemoclaw-start` process.
- Prior second native pass at `a1fc52c7`: [run
28561555945](https://github.com/NVIDIA/NemoClaw/actions/runs/28561555945)
reproduced only the same harness permission failure after Ready/CUDA,
correct PID 1 topology, authenticated inference, zero forbidden-marker
matches, and clean teardown.
- Superseded six-job set at `54cf259d`: [run
28564960504](https://github.com/NVIDIA/NemoClaw/actions/runs/28564960504)
exposed the stale Dockerfile validator digest and was canceled before
runtime proof. Candidate `970803a4` updates the integrity pin, retained
by final head `c5a67c4c`.
- Superseded second native pass at `54cf259d`: [run
28564973806](https://github.com/NVIDIA/NemoClaw/actions/runs/28564973806)
was canceled during pre-cleanup and supplies no acceptance evidence.
- Superseded forced-legacy proof on production parent `54cf259d` plus
child `69f4e1b2`: [run
28564983760](https://github.com/NVIDIA/NemoClaw/actions/runs/28564983760)
was canceled during pre-cleanup and supplies no acceptance evidence.
- Superseded six-job set at `970803a4`: [run
28565197328](https://github.com/NVIDIA/NemoClaw/actions/runs/28565197328)
was canceled before acceptance execution when the canonical
placeholder-format advisor fix advanced the head.
- Superseded second native pass at `970803a4`: [run
28565207911](https://github.com/NVIDIA/NemoClaw/actions/runs/28565207911)
was canceled before runner assignment and supplies no acceptance
evidence.
- Superseded forced-legacy proof on production parent `970803a4` plus
child `b4d5679e`: [run
28565222881](https://github.com/NVIDIA/NemoClaw/actions/runs/28565222881)
was canceled before runner assignment and supplies no acceptance
evidence.
- Superseded six-job set at `7335903b`: [run
28565576066](https://github.com/NVIDIA/NemoClaw/actions/runs/28565576066)
was intentionally canceled when the documentation gap advanced the
candidate; the root-entrypoint smoke passed, but the remaining lanes
provide no complete acceptance proof.
- Superseded second native pass at `7335903b`: [run
28565587094](https://github.com/NVIDIA/NemoClaw/actions/runs/28565587094)
was canceled before acceptance execution and supplies no acceptance
evidence.
- Superseded forced-legacy proof on production parent `7335903b` plus
child `091e16fd`: [run
28565603460](https://github.com/NVIDIA/NemoClaw/actions/runs/28565603460)
was canceled before acceptance execution and supplies no acceptance
evidence.
- Superseded six-job set at `c5a67c4c`: [run
28566083673](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083673)
reached the native GPU/runtime proofs before the obsolete raw
strict-hash assertion failed; messaging independently hit the
process-probe self-match fixed by merged #6167. GPU, root-entrypoint,
secret-boundary, and credential companion lanes passed.
- Superseded second native pass at `c5a67c4c`: [run
28566083641](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083641)
proved native routing, Ready/CUDA, `nvidia-smi`, `/proc`, `cuInit(0)=0`,
PID 1, authenticated inference, and cleanup, then failed only the
obsolete raw strict-hash assertion.
- Superseded forced-legacy proof on production parent `c5a67c4c` plus
child `48c46a7f`: [run
28566083589](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083589)
proved the same runtime boundary on the legacy route, then failed only
the obsolete raw strict-hash assertion.
- Superseded six-job set at `a04a70ac`: [run
28568069499](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069499)
exposed a pre-onboarding harness defect: direct Vitest import of the
production cleanup helper could not resolve its lazy CommonJS TypeScript
dependencies. Companion results do not count as final-head evidence.
- Superseded second native pass at `a04a70ac`: [run
28568069558](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069558)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded forced-legacy proof on production parent `a04a70ac` plus
child `085a3b7d`: [run
28568069530](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069530)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded six-job set at `6ac4ebc8`: [run
28568490864](https://github.com/NVIDIA/NemoClaw/actions/runs/28568490864)
exposed a clean-runner preinstall edge: the compiled cleanup child was
invoked before OpenShell existed and failed before onboarding. Companion
results do not count as final-head evidence.
- Superseded second native pass at `6ac4ebc8`: [run
28568494928](https://github.com/NVIDIA/NemoClaw/actions/runs/28568494928)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded forced-legacy proof on production parent `6ac4ebc8` plus
child `5d3742a7`: [run
28568501430](https://github.com/NVIDIA/NemoClaw/actions/runs/28568501430)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded six-job set at `65b06d64`: [run
28568954862](https://github.com/NVIDIA/NemoClaw/actions/runs/28568954862)
passed all six jobs, but the candidate advanced to close the
advisor-confirmed shared diagnostic-redaction boundary and two
proof-hardening review threads.
- Superseded second native pass at `65b06d64`: [run
28568959028](https://github.com/NVIDIA/NemoClaw/actions/runs/28568959028)
passed the full native runtime proof but is not final-head evidence.
- Superseded forced-legacy proof on production parent `65b06d64` plus
child `2c6dca1b`: [run
28568966557](https://github.com/NVIDIA/NemoClaw/actions/runs/28568966557)
passed but is not final-parent evidence.
- Final six-job exact-head set at `d76f1647`: [run
28601346031](https://github.com/NVIDIA/NemoClaw/actions/runs/28601346031)
passed all six requested jobs. Native GPU, Hermes startup,
root-entrypoint, secret-boundary, credential-sanitization, and messaging
proofs are green; all 21 messaging raw-token surface probes are
`ABSENT`, and every cleanup record has zero failures.
- Final second native Hermes GPU pass at `d76f1647`: [run
28601348023](https://github.com/NVIDIA/NemoClaw/actions/runs/28601348023)
passed 9/9 assertions. Artifact `8043702467`
(`sha256:0ef929fa2478f5c9579ad2f282c46de8fe4d6eefb04acb062de1af29b4eb002c`)
proves native routing, Ready/CUDA, `nvidia-smi`, `/proc`, successful
`cuInit(0)`, OpenShell PID 1, one container/no backup, authenticated
inference with zero forbidden-marker matches, and clean teardown.
- Final failed-clone rollback proof checks out exact production SHA
`d76f1647` from signed workflow-only child `4c49b5bc`: [run
28602166456](https://github.com/NVIDIA/NemoClaw/actions/runs/28602166456)
passed. Artifact `8044114375`
(`sha256:b5cff9cc7e7cf361f55e074a265c8cfefd3e6dc21ad0d300b140d2a749cde00b`)
records clone exit 137 with `failure_kind=patched_container_failed` and
`rolled_back=no` before finalize, then `rolled_back=yes`, exactly one
running original container, no backup leak, guard-observed clone
removal, clean canary scans, and clean fixture teardown.
- Final forced-legacy success proof on exact production parent
`d76f1647` plus signed workflow-only child `078a372d`: [run
28603335692](https://github.com/NVIDIA/NemoClaw/actions/runs/28603335692)
passed 9/9 assertions. Artifact `8044550700`
(`sha256:56d97c5caa8536ebff735ff600399ac7fafa2fed71206de1f5470e7d15549f6f`)
proves `gpuRoute=legacy-patch`, `--device nvidia.com/gpu=all`,
Ready/CUDA with all three GPU probes, correct OpenShell PID 1/command
envelope, one container/no backup, integrity and negative guard checks,
two authenticated inference requests with zero forbidden-marker matches,
a clean artifact canary scan, and clean teardown.

Source-of-truth review for the retained compatibility path:

- **Invalid state:** the legacy swap temporarily leaves a stopped backup
and running clone with the same OpenShell sandbox ID.
- **Source boundary:** OpenShell's Docker driver reconciles container
summaries into a map keyed only by sandbox ID; 0.0.71 can let the
stopped backup overwrite the running clone and drive the gateway into
terminal `Error`.
- **Source-fix constraint:** the NemoClaw-supported OpenShell release
does not contain deterministic active-container selection. The focused
source fix is open as
[NVIDIA/OpenShell#2116](https://github.com/NVIDIA/OpenShell/pull/2116)
and passes 96/96 Docker-driver tests, strict clippy, and formatting, but
is not yet released or pinned here.
- **Regression coverage:** routing tests pin native auto / forced legacy
/ WSL / Jetson behavior; recreate tests pin capture-before-finalize on
both create timing paths; the secret canary uses the actual single
`OPENSHELL_SANDBOX_COMMAND=env ...` envelope; the live test pins native
and legacy runtime topology separately.
- **Removal condition:** remove the legacy swap and its
rollback/diagnostic modules after WSL and Jetson are proven on native
OpenShell GPU injection and the supported OpenShell floor contains
deterministic duplicate-container reconciliation.
- **WSL boundary:** Docker Desktop WSL does not expose a usable native
CDI route to this flow, so WSL retains the compatibility path and
ignores `NEMOCLAW_DOCKER_GPU_PATCH=0`; routing tests lock that behavior.
Remove it when Docker Desktop exposes usable `nvidia.com/gpu` CDI
devices to the WSL distro.
- **Jetson boundary:** Tegra `/dev/nvmap` and `/dev/nvhost-*` device
ownership requires host group propagation for the non-root sandbox user;
group-add tests lock that behavior. Remove it only when the
platform/runtime supplies equivalent access without the compatibility
recreate.

Diagnostic redaction boundary:

- **Source-of-truth invariant:** `collectDockerGpuPatchDiagnostics()`
constructs the trusted per-bundle redactor, discovers conventional and
custom-placeholder values from every known/discovered full inspect
before writing, recursively redacts JSON values, and publishes every
summary, Docker, OpenShell, and pre-rollback top artifact through that
boundary.
- **Bounded pre-rollback path:** the caller contributes only additive
values discovered from the failed clone before snapshot capture so the
shared 10-second budget cannot hide opaque values; the collector still
performs its own discovery and owns every write. Direct raw-caller and
exhausted-budget regressions scan all artifacts and returned summaries.
- **Removal condition:** remove the additive pre-rollback discovery only
when the shared collector can own snapshot capture inside the same
budget without delaying rollback.

Advisor architecture and follow-up rationale:

- `docker-gpu-patch.ts` grows 58 lines to keep token validation before
container mutation and bounded failed-clone capture before rollback.
Splitting this security-critical ordering during the release-blocker fix
would add cross-module state transfer; extract it when the legacy swap
is retired after WSL and Jetson native proof.
- `docker-gpu-local-inference.test.ts` grows 32 lines so bridge-probe
routing assertions stay beside the behavior under test. Extract a
bridge-probe module and focused test file if that surface grows again.
- `docker-gpu-local-inference.ts` grows 15 lines to keep the
bridge-probe/host-network decision beside its caller-facing contract;
extract it with the tests if that surface grows again.
- `docker-gpu-patch.test.ts` grows 13 lines and remains below 1,350
lines; split the mode-routing cases on the next growth.
- The live fixture now supplies the canonical comma-delimited
placeholder transport. Whitespace compatibility remains covered by
parser unit tests and the messaging-provider scenario; the live proof
intentionally matches the exact canonical startup environ token.
- Dedicated `OPENSHELL_TLS_KEY` tests prove exact runtime acceptance,
arbitrary/PEM/relative/near-miss rejection, persisted `.env` rejection,
and continued rejection of supervisor identity tokens. The exact allowed
path is sourced from `NVIDIA/OpenShell@v0.0.71`
(`a242f84bb367d6df7d4d133e95a93857406c67f7`), where
`driver_utils.rs::TLS_KEY_MOUNT_PATH` defines
`/etc/openshell/tls/client/tls.key` and the Docker/Podman drivers inject
it.

This PR does not claim #6110 resolved until the reporter-class DGX Spark
aarch64 or DGX Station GB300 NVIDIA Endpoints path passes. No such
runner is declared in this repository, and organization runner inventory
is not visible with the current permissions. Missing reporter hardware
is an external acceptance blocker, not a passing result.

The #6155 docs regression fix and current `main` through `9fe45362` are
integrated. A refreshed pairwise merge-tree audit at `d76f1647` is clean
with #5595 and #6153. #5876 directly conflicts in `e2e.yaml` and related
Hermes/docs/workflow-boundary files; its resolution must union
`hermes-gpu-startup` and `mcp-bridge-dev` selectors/result summaries and
recompute uploader validation. #6020 already conflicts with current
`main` and also overlaps #6142 outside `e2e.yaml`; #6053 is mergeable
with `main` but conflicts pairwise in the uploader boundary. Those later
branches must preserve #6142's explicit-only inventory and artifact
contracts during retargeting; #6142 itself remains mergeable/CLEAN.

---
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Strengthened Hermes GPU startup proof and managed startup integrity
assertions, plus added a skipped-by-default GPU live E2E run for startup
readiness.
* **Bug Fixes**
* Tightened PID 1 identity validation to block runtime mutation under a
foreign PID 1.
* Improved Docker GPU/OpenShell sandbox command and placeholder
handling; enhanced GPU failure diagnostics with safer redaction.
* **Documentation**
* Refined GPU passthrough and `NEMOCLAW_DOCKER_GPU_PATCH` guidance
across native Linux, Docker Desktop WSL, and Jetson/Tegra.
* **Tests**
* Expanded unit/E2E coverage for placeholder parsing, readiness refusal,
env/secret boundary enforcement, and diagnostic redaction.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
2026-07-02 09:50:03 -07:00
..
advisors ci(pr-advisor): add Nemotron Ultra lane (#5873) 2026-06-26 11:47:48 -07:00
e2e fix(gpu): prefer native OpenShell injection (#6142) 2026-07-02 09:50:03 -07:00
e2e-advisor test(e2e): retire legacy shell lanes (#5756) 2026-06-29 22:32:24 -05:00
pr-review-advisor test(e2e): retire legacy shell lanes (#5756) 2026-06-29 22:32:24 -05:00