Runtime isolation
Denia runs workloads under its own Linux runtime — not Docker, containerd, or
runc. The mechanics live in src/runtime/ (the LinuxRuntime) and src/syscall/
(thin rustix wrappers). See ADR-003,
ADR-005, ADR-019, and
ADR-026.
Namespaces
Each replica is launched via unshare into private namespaces:
| Namespace | Effect |
|---|---|
user | Workload uid 0 maps to host userns_base (default 100000), an unprivileged uid |
pid | Workload sees only its own process tree |
mount | Private rootfs; no host filesystem bleed |
uts | Own hostname |
ipc | Private SysV IPC |
The network namespace is intentionally not unshared in v1 — workloads share the host network and are reached over a private Unix socket, not a port.
Overlay rootfs (per replica)
Each replica boots a private OverlayFS rootfs:
- lower — the immutable base image rootfs (from an OCI pull or a BuildKit build).
- upper — a per-replica mutable layer (writes stay isolated per replica).
- work / merged — overlay scratch + the guest-visible unified root.
The child pivot_roots into the merged root. Because mounting overlay inside an
unprivileged userns is fragile, the privileged overlay mount is performed
before the user-namespace unshare (ADR-026). The
rootfs is chown-mapped to userns_base so uid 0 inside maps correctly.
Capabilities & no_new_privs
- The daemon runs as the unprivileged
deniauser with a tight set:CAP_NET_BIND_SERVICE,CAP_SYS_ADMIN,CAP_SETUID,CAP_SETGID. CAP_SYS_ADMINis needed forunshare/mount/cgroup delegation during setup, then the capability bounding set is dropped in the child.PR_SET_NO_NEW_PRIVSis set beforeexec, preventing setuid/file-capability escalation.
Cgroup v2 limits
Each replica gets a cgroup under DENIA_CGROUP_ROOT
(/sys/fs/cgroup/denia/<service_id>/<replica_index>) with controllers driven from
the service's resource limits:
cpu.maxfromcpu_millismemory.maxfrommemory_bytes(hard limit; OOM-kill on breach)pids.max/ioas applicable
These same controllers feed metrics.
Lifecycle
Workload lifecycle is bound to the daemon (ADR-027): the daemon stops all workloads on shutdown and autostarts the promoted deployments on boot. The parent reaps children (SIGTERM, then SIGKILL after a timeout).
:::danger Not a multi-tenant adversarial sandbox
v1 isolation is process + filesystem isolation, not raw-syscall hardening. Treat
CAP_SYS_ADMIN as host-root-equivalent — a daemon RCE escalates to host root, the
same class as dockerd/containerd/kubelet. Run untrusted code on its own host
or VM. See Security.
:::