NemoClaw/scripts/checks
Aaron Erickson 🦞 0cfaf8fbf3
fix(sandbox): add host-mediated gateway restart (#5874)
<!-- markdownlint-disable MD041 -->
## Summary
Adds a supported host-mediated `gateway restart` command for
NemoClaw-managed OpenClaw and Hermes gateways and routes automatic
recovery through the same topology-specific authenticated controller or
supervisor path. The lifecycle acts only on proven process and listener
identity, applies configuration and state changes transactionally, and
fails closed when privileged execution or built-in health probes are
ambiguous.

## Related Issue
Fixes #2426
Fixes #5253
Supersedes #5416

## Changes
- Added `sandbox:gateway:restart` plus public `nemoclaw <name> gateway
restart` and `nemohermes <name> gateway restart` routing while keeping
healthy `recover` idempotent.
- Added managed gateway control for both supported topologies: a root
PID 1 request channel for direct-root entrypoints and an authenticated
root-owned controller paired with the nonroot supervisor in
OpenShell-managed sandboxes.
- Enforced exact PID and process-start identity checks, listener
ownership validation, post-stop absence proof, exact reaping or respawn
proof, health waits, and port-forward recovery.
- Added descriptor-based, no-follow, atomic configuration and state
guards, including bounded Hermes secret-boundary and configuration-hash
validation.
- Serialized shields mutations and recovery timers, and ordered
snapshot, destroy, and inference transitions so stale callbacks cannot
reapply state.
- Made built-in OpenClaw and Hermes probes fail closed after trusted
execution failure or timeout; unsupported privileged-exec drivers and
ambiguous container selection are rejected while explicit custom gateway
agents retain their compatibility path.
- Wired runtime helpers into optimized build contexts with root-owned,
non-writable permissions and preserved root PID 1 access to mutable
sandbox-group state when hardened runtimes drop `CAP_DAC_OVERRIDE`.
- Published a Trusted Computing Base page in both guide variants
covering the direct-root and OpenShell-managed boundaries, shared-UID
and mutable-config limits, JSON5 read compatibility, root group
membership, and compatibility-removal conditions.
- Restricted managed-controller procfs and filesystem overrides to
source execution with the explicit test flag, and verified the pinned
Node/JSON5 installation is root-owned and non-writable.
- Added final-image proof for both built-in images covering root-only
helper modes, required supplementary groups, root probe access, and
sandbox-user refusal; made sandbox-operations assertions topology-aware
for direct-root and OpenShell-managed runs.
- Excluded authenticated, exact-identity Hermes controller replacements
from crash quarantine with a root-only `0600` lifecycle lock and
root-owned `0444` authorization bound to the live root controller; all
orphaned, mismatched, unexpected, and failed replacement exits still
count.
- Tightened the OpenClaw image health fallback to read the tracked PID
start identity before and after an exact installed gateway command line,
while retaining known rewritten process-title compatibility.
- Preserved the injected managed supervisor through the post-restart
settle probe and locked the built-in completion protocol to
`GATEWAY_PID=` rather than accepting the legacy custom-script
`ALREADY_RUNNING` marker.
- Migrated the branch's live recovery coverage onto the typed E2E
workflow and removed the legacy shell lanes deleted on `main`.
- Updated lifecycle, runtime-control, command, troubleshooting,
security, and Hermes documentation.
- Review scope: the existing OpenShell bridge-host allowance is
preserved while adjacent private or internal endpoint shapes are
DNS-validated and rejected. Broad module splitting and consumer-wide
HTTPS binding remain separate architecture work rather than widening
this security-sensitive recovery fix.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [x] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Quality Gates
- [x] Tests added or updated for changed behavior
- [ ] Existing tests cover changed behavior — justification:
- [ ] Tests not applicable — justification:
- [x] Docs updated for user-facing behavior changes
- [ ] Docs not applicable — justification:
- [x] Sensitive paths changed (security, policy, credentials, preflight,
onboarding, inference, runner, sandbox, or messaging)
- [ ] Sensitive-path review completed or maintainer-approved waiver
recorded — reviewer/approval link/justification: Independent security
review found no blocking findings across privilege, process identity,
filesystem, network, fail-closed, and regression-coverage boundaries;
human security acceptance remains pending and required before merge.
- [ ] Non-success, skipped, or missing CI check accepted by maintainer —
check name, approval link, and follow-up issue:

## Verification
- [x] PR description includes the DCO sign-off declaration and every
commit appears as `Verified` in GitHub
- [ ] Git hooks passed during commit and push, or `npx prek run
--from-ref main --to-ref HEAD` passes
- [x] Targeted tests pass for changed behavior
- [ ] Full `npm test` passes (broad runtime changes only)
- [x] Quality Gates section completed with required justifications or
waivers
- [x] No secrets, API keys, or credentials committed
- [ ] `npm run docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [x] New doc pages include SPDX header and frontmatter (new pages only)

Exact-head evidence for `8dbe420d3d835546683cbfa67fec0537791ad7c1`:

- Signed, DCO-compliant commit
`d86c60c325a3dbc74fc42f74447bc00f89be4400` addresses the current
CodeRabbit correctness and documentation findings. Signed merge commit
`c9061c68fd82b1b73044383bc4e8962808644744` incorporates `main` at
`e10462ff3e2e0727350a2532fc7bb7edc64116b2` without rebasing or rewriting
the verified history. Signed cleanup commit
`8dbe420d3d835546683cbfa67fec0537791ad7c1` addresses the resulting
CodeQL/code-quality diagnostics.
- Post-merge controller, trust-contract, provisioning, managed-exit
authorization, OpenClaw/Hermes config-guard, state-dir guard,
endpoint-security, restart/boundary, and workflow-boundary suites pass:
15 files, 258 tests.
- `npm run build:cli`, `npm run typecheck:cli`, Biome, ShellCheck,
Python compilation, branch diff checks, test-size/project/title guards,
and `npm run docs:check-agent-variants` pass.
- `npm run docs` completes with 0 errors and two existing Fern warnings;
the warning-free checkbox remains unchecked.
- All 80 PR commits appear as GitHub `Verified`; all three new commits
contain the DCO sign-off declaration.
- The review-fix commit's normal formatting, lint, repository,
secret-scan, commitlint, source-shape, and test-size hooks pass. The
normal pre-push CLI typecheck and version-sync hooks pass. The full
local `test-cli` hook and full-`npm test` boxes remain unchecked;
exact-head CI is required for those broader platform lanes.
- Standard CI, automated current-head review, human security acceptance,
and exact-head E2E evidence are pending on this head and must settle
before merge.

---
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a `gateway restart` command for sandboxes with forced reload and
post-restart health/forward verification.

* **Bug Fixes**
* Improved recovery so host forwards and supervised gateway processes
are repaired more reliably.
* Strengthened failure handling to be fail-closed when required Hermes
secret-boundary validation or supervisor control is unavailable.
* Corrected forward-health classification so stopped/occupied forwards
are handled consistently.

* **Documentation**
* Expanded lifecycle, runtime-controls, command reference, and
troubleshooting guidance for `recover`, `gateway restart`, shields
windows, and Hermes recovery constraints.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
2026-06-30 14:46:13 -04:00
..
direct-credential-env.ts fix: address issue #5667 (#5672) 2026-06-24 16:21:52 -07:00
layer-import-boundaries.ts test(scanner): catch source-shape assertions (#4138) 2026-05-24 16:54:41 -07:00
no-coverage-ignore.ts chore(tooling): enforce honest coverage reporting (#3154) 2026-05-08 08:48:02 -07:00
no-test-dist-imports.ts fix(sandbox): add host-mediated gateway restart (#5874) 2026-06-30 14:46:13 -04:00
run.ts test: make test titles behavior-oriented (#5918) 2026-06-27 19:34:28 -07:00
test-title-style.ts test: make test titles behavior-oriented (#5918) 2026-06-27 19:34:28 -07:00
vitest-project-overlap.ts test(e2e): retire legacy shell lanes (#5756) 2026-06-29 22:32:24 -05:00