mirror of
https://github.com/NVIDIA/NemoClaw.git
synced 2026-07-03 03:37:16 +00:00
<!-- markdownlint-disable MD041 --> ## Summary Adds a supported host-mediated `gateway restart` command for NemoClaw-managed OpenClaw and Hermes gateways and routes automatic recovery through the same topology-specific authenticated controller or supervisor path. The lifecycle acts only on proven process and listener identity, applies configuration and state changes transactionally, and fails closed when privileged execution or built-in health probes are ambiguous. ## Related Issue Fixes #2426 Fixes #5253 Supersedes #5416 ## Changes - Added `sandbox:gateway:restart` plus public `nemoclaw <name> gateway restart` and `nemohermes <name> gateway restart` routing while keeping healthy `recover` idempotent. - Added managed gateway control for both supported topologies: a root PID 1 request channel for direct-root entrypoints and an authenticated root-owned controller paired with the nonroot supervisor in OpenShell-managed sandboxes. - Enforced exact PID and process-start identity checks, listener ownership validation, post-stop absence proof, exact reaping or respawn proof, health waits, and port-forward recovery. - Added descriptor-based, no-follow, atomic configuration and state guards, including bounded Hermes secret-boundary and configuration-hash validation. - Serialized shields mutations and recovery timers, and ordered snapshot, destroy, and inference transitions so stale callbacks cannot reapply state. - Made built-in OpenClaw and Hermes probes fail closed after trusted execution failure or timeout; unsupported privileged-exec drivers and ambiguous container selection are rejected while explicit custom gateway agents retain their compatibility path. - Wired runtime helpers into optimized build contexts with root-owned, non-writable permissions and preserved root PID 1 access to mutable sandbox-group state when hardened runtimes drop `CAP_DAC_OVERRIDE`. - Published a Trusted Computing Base page in both guide variants covering the direct-root and OpenShell-managed boundaries, shared-UID and mutable-config limits, JSON5 read compatibility, root group membership, and compatibility-removal conditions. - Restricted managed-controller procfs and filesystem overrides to source execution with the explicit test flag, and verified the pinned Node/JSON5 installation is root-owned and non-writable. - Added final-image proof for both built-in images covering root-only helper modes, required supplementary groups, root probe access, and sandbox-user refusal; made sandbox-operations assertions topology-aware for direct-root and OpenShell-managed runs. - Excluded authenticated, exact-identity Hermes controller replacements from crash quarantine with a root-only `0600` lifecycle lock and root-owned `0444` authorization bound to the live root controller; all orphaned, mismatched, unexpected, and failed replacement exits still count. - Tightened the OpenClaw image health fallback to read the tracked PID start identity before and after an exact installed gateway command line, while retaining known rewritten process-title compatibility. - Preserved the injected managed supervisor through the post-restart settle probe and locked the built-in completion protocol to `GATEWAY_PID=` rather than accepting the legacy custom-script `ALREADY_RUNNING` marker. - Migrated the branch's live recovery coverage onto the typed E2E workflow and removed the legacy shell lanes deleted on `main`. - Updated lifecycle, runtime-control, command, troubleshooting, security, and Hermes documentation. - Review scope: the existing OpenShell bridge-host allowance is preserved while adjacent private or internal endpoint shapes are DNS-validated and rejected. Broad module splitting and consumer-wide HTTPS binding remain separate architecture work rather than widening this security-sensitive recovery fix. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [x] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Quality Gates - [x] Tests added or updated for changed behavior - [ ] Existing tests cover changed behavior — justification: - [ ] Tests not applicable — justification: - [x] Docs updated for user-facing behavior changes - [ ] Docs not applicable — justification: - [x] Sensitive paths changed (security, policy, credentials, preflight, onboarding, inference, runner, sandbox, or messaging) - [ ] Sensitive-path review completed or maintainer-approved waiver recorded — reviewer/approval link/justification: Independent security review found no blocking findings across privilege, process identity, filesystem, network, fail-closed, and regression-coverage boundaries; human security acceptance remains pending and required before merge. - [ ] Non-success, skipped, or missing CI check accepted by maintainer — check name, approval link, and follow-up issue: ## Verification - [x] PR description includes the DCO sign-off declaration and every commit appears as `Verified` in GitHub - [ ] Git hooks passed during commit and push, or `npx prek run --from-ref main --to-ref HEAD` passes - [x] Targeted tests pass for changed behavior - [ ] Full `npm test` passes (broad runtime changes only) - [x] Quality Gates section completed with required justifications or waivers - [x] No secrets, API keys, or credentials committed - [ ] `npm run docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [x] New doc pages include SPDX header and frontmatter (new pages only) Exact-head evidence for `8dbe420d3d835546683cbfa67fec0537791ad7c1`: - Signed, DCO-compliant commit `d86c60c325a3dbc74fc42f74447bc00f89be4400` addresses the current CodeRabbit correctness and documentation findings. Signed merge commit `c9061c68fd82b1b73044383bc4e8962808644744` incorporates `main` at `e10462ff3e2e0727350a2532fc7bb7edc64116b2` without rebasing or rewriting the verified history. Signed cleanup commit `8dbe420d3d835546683cbfa67fec0537791ad7c1` addresses the resulting CodeQL/code-quality diagnostics. - Post-merge controller, trust-contract, provisioning, managed-exit authorization, OpenClaw/Hermes config-guard, state-dir guard, endpoint-security, restart/boundary, and workflow-boundary suites pass: 15 files, 258 tests. - `npm run build:cli`, `npm run typecheck:cli`, Biome, ShellCheck, Python compilation, branch diff checks, test-size/project/title guards, and `npm run docs:check-agent-variants` pass. - `npm run docs` completes with 0 errors and two existing Fern warnings; the warning-free checkbox remains unchecked. - All 80 PR commits appear as GitHub `Verified`; all three new commits contain the DCO sign-off declaration. - The review-fix commit's normal formatting, lint, repository, secret-scan, commitlint, source-shape, and test-size hooks pass. The normal pre-push CLI typecheck and version-sync hooks pass. The full local `test-cli` hook and full-`npm test` boxes remain unchecked; exact-head CI is required for those broader platform lanes. - Standard CI, automated current-head review, human security acceptance, and exact-head E2E evidence are pending on this head and must settle before merge. --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a `gateway restart` command for sandboxes with forced reload and post-restart health/forward verification. * **Bug Fixes** * Improved recovery so host forwards and supervised gateway processes are repaired more reliably. * Strengthened failure handling to be fail-closed when required Hermes secret-boundary validation or supervisor control is unavailable. * Corrected forward-health classification so stopped/occupied forwards are handled consistently. * **Documentation** * Expanded lifecycle, runtime-controls, command reference, and troubleshooting guidance for `recover`, `gateway restart`, shields windows, and Hermes recovery constraints. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> |
||
|---|---|---|
| .. | ||
| direct-credential-env.ts | ||
| layer-import-boundaries.ts | ||
| no-coverage-ignore.ts | ||
| no-test-dist-imports.ts | ||
| run.ts | ||
| test-title-style.ts | ||
| vitest-project-overlap.ts | ||