diff --git a/README.md b/README.md index ee6822a..9eab8f2 100644 --- a/README.md +++ b/README.md @@ -66,6 +66,14 @@ sh -c "printf 'install esp4 /bin/false\ninstall esp6 /bin/false\ninstall rxrpc / 2. Once each distribution backports a patch, update accordingly. +## Kubernetes + +For Kubernetes clusters, [`k8s/dirtyfrag-mitigation.yaml`](k8s/dirtyfrag-mitigation.yaml) deploys a DaemonSet that applies the same mitigation on every Linux node and re-applies it automatically on any new node that joins the cluster. See [`k8s/README.md`](k8s/README.md) for details, compatibility notes, and the revert procedure. + +```bash +kubectl apply -f https://raw.githubusercontent.com/V4bel/dirtyfrag/master/k8s/dirtyfrag-mitigation.yaml +``` + # FAQ ## Why did you chain two vulnerabilities? diff --git a/k8s/README.md b/k8s/README.md new file mode 100644 index 0000000..39a3ab6 --- /dev/null +++ b/k8s/README.md @@ -0,0 +1,54 @@ +# Kubernetes mitigation + +A self-contained Kubernetes manifest that applies the [Dirty Frag mitigation](../README.md#mitigation) to every Linux node in a cluster. + +## What it does + +Deploys a DaemonSet (`dirtyfrag-mitigation` in `kube-system`) whose init container — running on every Linux node, including system pools — performs the steps from the disclosure README inside the host's namespaces via `nsenter`: + +1. Writes `/etc/modprobe.d/disable-dirtyfrag.conf` blacklisting `esp4`, `esp6` and `rxrpc` so they cannot be loaded on demand. +2. For each of these modules currently loaded with `refcnt=0`, runs `modprobe -r` to unload it from the live kernel. +3. Runs `sync; echo 3 > /proc/sys/vm/drop_caches` to clear any contaminated cached pages. +4. If any of these modules is loaded with `refcnt > 0` (in active use), emits a single aggregated Warning [Kubernetes Event](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) (`reason=DirtyFragModulesInUse`) on the affected `Node` listing the in-use modules, so operators can drain and reboot/replace the node. **No auto-cordon.** + +A long-running [`pause`](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#understanding-init-containers) container keeps the pod in `Running` state so the init container is only re-executed on pod recreation — i.e. on each new node that joins the cluster (autoscaling, node-image upgrade, scale-set rolling update). + +## Apply + +```bash +kubectl apply -f https://raw.githubusercontent.com/V4bel/dirtyfrag/master/k8s/dirtyfrag-mitigation.yaml +kubectl -n kube-system rollout status ds/dirtyfrag-mitigation +``` + +Check for nodes that need a drain+reboot to complete the mitigation (modules that were already in use): + +```bash +kubectl -n default get events --field-selector reason=DirtyFragModulesInUse +``` + +## Compatibility + +`esp4` and `esp6` provide IPsec ESP transforms; `rxrpc` provides the RxRPC socket family used by AFS. **None of these are required by a typical workload-only Kubernetes cluster.** + +If your cluster does require one of these modules (e.g. a node-level IPsec tunnel, an AFS client running on the host or in a privileged pod), edit the `MODULES` env var in the manifest and remove the affected module(s) before applying — or label-exclude the affected node pool. + +## Revert (once upstream kernel patches roll out) + +The modprobe drop-in persists for the lifetime of each node. To clean it up from live nodes before deleting the DaemonSet: + +```bash +# 1. Flip the init container into cleanup mode and roll the fleet +kubectl -n kube-system set env ds/dirtyfrag-mitigation CLEANUP_MODE=true +kubectl -n kube-system rollout restart ds/dirtyfrag-mitigation +kubectl -n kube-system rollout status ds/dirtyfrag-mitigation + +# 2. Delete the DaemonSet, ServiceAccount and ClusterRole/Binding +kubectl delete -f https://raw.githubusercontent.com/V4bel/dirtyfrag/master/k8s/dirtyfrag-mitigation.yaml +``` + +If you skip step 1, the `/etc/modprobe.d/disable-dirtyfrag.conf` drop-in remains on existing nodes until each is recycled (node-image upgrade, scale-down, or manual `kubectl drain && kubectl delete node`). + +## Tested with + +- Kubernetes 1.30 on AKS (Azure), in a production environment across staging and production clusters +- Linux nodes only (the DaemonSet has `nodeSelector: kubernetes.io/os: linux` so Windows nodes are skipped automatically) diff --git a/k8s/dirtyfrag-mitigation.yaml b/k8s/dirtyfrag-mitigation.yaml new file mode 100644 index 0000000..a03bd00 --- /dev/null +++ b/k8s/dirtyfrag-mitigation.yaml @@ -0,0 +1,300 @@ +# Dirty Frag Kubernetes mitigation +# +# Disclosure: https://github.com/V4bel/dirtyfrag +# +# This manifest applies the Dirty Frag mitigation recommended in the disclosure +# README to every Linux node in a Kubernetes cluster: +# +# printf 'install esp4 /bin/false\ninstall esp6 /bin/false\ninstall rxrpc /bin/false\n' \ +# > /etc/modprobe.d/dirtyfrag.conf +# rmmod esp4 esp6 rxrpc 2>/dev/null +# echo 3 > /proc/sys/vm/drop_caches +# +# It runs as a DaemonSet so that: +# - The mitigation is applied on every existing node, and +# - It is automatically re-applied to any new node that joins the cluster +# (autoscaling, node-image upgrade, scale-set rolling update, etc.) before +# workloads schedule onto it. +# +# How it works: +# - An init container enters the host's PID, mount, network, IPC and UTS +# namespaces with `nsenter -t 1 -m -u -i -n -p` and: +# 1. Writes /etc/modprobe.d/disable-dirtyfrag.conf so esp4, esp6 and +# rxrpc cannot be loaded on demand. +# 2. For each module currently loaded with refcnt=0, runs `modprobe -r` +# to unload it from the live kernel. +# 3. Runs `sync; echo 3 > /proc/sys/vm/drop_caches` to clear any +# contaminated cached pages (gated on DROP_CACHES, default true). +# 4. If any module remains loaded with refcnt > 0, emits a single +# aggregated Warning Kubernetes Event (reason=DirtyFragModulesInUse) +# on the Node listing the in-use modules so operators can drain and +# reboot/replace the node. This DaemonSet does NOT auto-cordon. +# - A long-running `pause` container keeps the pod in Running state so the +# init container is only re-executed on pod recreation (i.e. on each new +# node). +# +# Compatibility note: +# esp4 and esp6 provide IPsec ESP transforms; rxrpc provides the RxRPC +# socket family used by AFS. If any of your workloads (or the host network) +# require these modules, do NOT apply this manifest as-is — either remove +# the affected module(s) from the MODULES env var below, or label-exclude +# the affected node pool. On a typical workload-only Kubernetes cluster +# none of these modules are in use. +# +# Reverting once upstream kernel patches roll out: +# 1. Run a cleanup pass first to remove the modprobe drop-in from live +# nodes (the init container's CLEANUP_MODE branch removes the file +# and reloads modprobe state): +# +# kubectl -n kube-system set env ds/dirtyfrag-mitigation CLEANUP_MODE=true +# kubectl -n kube-system rollout restart ds/dirtyfrag-mitigation +# kubectl -n kube-system rollout status ds/dirtyfrag-mitigation +# +# 2. Then delete the resources: +# +# kubectl delete -f dirtyfrag-mitigation.yaml +# +# If you skip step 1, the modprobe drop-in remains on existing nodes until +# each is recycled (node-image upgrade, scale-down, or manual drain+delete). +# +# Tested with Kubernetes 1.27+ on AKS, EKS, and GKE (Linux nodes only). +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: dirtyfrag-mitigation + namespace: kube-system + labels: + app.kubernetes.io/name: dirtyfrag-mitigation + app.kubernetes.io/component: cve-mitigation +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: dirtyfrag-mitigation + labels: + app.kubernetes.io/name: dirtyfrag-mitigation + app.kubernetes.io/component: cve-mitigation +rules: + # Read node metadata so we can address Events to the running node. + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get"] + # Emit Warning Events when any module is in use (refcount > 0). + - apiGroups: [""] + resources: ["events"] + verbs: ["create", "patch"] + - apiGroups: ["events.k8s.io"] + resources: ["events"] + verbs: ["create", "patch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: dirtyfrag-mitigation + labels: + app.kubernetes.io/name: dirtyfrag-mitigation + app.kubernetes.io/component: cve-mitigation +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: dirtyfrag-mitigation +subjects: + - kind: ServiceAccount + name: dirtyfrag-mitigation + namespace: kube-system +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: dirtyfrag-mitigation + namespace: kube-system + labels: + app.kubernetes.io/name: dirtyfrag-mitigation + app.kubernetes.io/component: cve-mitigation +spec: + selector: + matchLabels: + app.kubernetes.io/name: dirtyfrag-mitigation + updateStrategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 100% # init container is fast; roll the whole fleet at once + template: + metadata: + labels: + app.kubernetes.io/name: dirtyfrag-mitigation + app.kubernetes.io/component: cve-mitigation + spec: + hostPID: true + priorityClassName: system-node-critical + serviceAccountName: dirtyfrag-mitigation + automountServiceAccountToken: true + # Run on every Linux node, including system/critical pools. + nodeSelector: + kubernetes.io/os: linux + tolerations: + - operator: Exists + terminationGracePeriodSeconds: 5 + initContainers: + - name: apply-mitigation + image: busybox:1.36.1 + imagePullPolicy: IfNotPresent + securityContext: + privileged: true + runAsUser: 0 + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + # Node Events follow the kubelet convention of being created in + # the `default` namespace; cluster-scoped objects like Nodes + # cannot have a namespaced involvedObject reference. + - name: EVENT_NAMESPACE + value: "default" + # Set CLEANUP_MODE=true (e.g. via `kubectl set env`) to flip the + # init container into removing the modprobe drop-in instead of + # writing it. Use this for a full rollout pass before deleting + # the DaemonSet, to clean up live nodes. + - name: CLEANUP_MODE + value: "false" + # Set DROP_CACHES=false to skip `echo 3 > /proc/sys/vm/drop_caches` + # (the page-cache flush after unloading modules). Default true, + # matching the disclosure's recommended mitigation. + - name: DROP_CACHES + value: "true" + # Space-separated list of modules to blacklist + unload. Edit this + # if you need to keep one of these modules available (e.g. IPsec + # via esp4/esp6, AFS via rxrpc). + - name: MODULES + value: "esp4 esp6 rxrpc" + command: ["/bin/sh", "-c"] + args: + - | + set -eu + + MODPROBE_FILE=/etc/modprobe.d/disable-dirtyfrag.conf + + if [ "${CLEANUP_MODE}" = "true" ]; then + echo "[dirtyfrag] CLEANUP mode on node ${NODE_NAME}: removing mitigation" + nsenter -t 1 -m -u -i -n -p -- sh -c "rm -f ${MODPROBE_FILE}; depmod -a 2>/dev/null || true; for m in ${MODULES}; do modprobe -r \$m 2>/dev/null || true; done; true" + echo "[dirtyfrag] cleanup complete on ${NODE_NAME}" + exit 0 + fi + + echo "[dirtyfrag] applying mitigation on node ${NODE_NAME} for modules: ${MODULES}" + + # 1. Persist modprobe blacklist so the modules cannot be loaded on demand. + # Rewrite the file from scratch (idempotent) to keep ordering stable + # and match the disclosure's recommended single-file form. + nsenter -t 1 -m -u -i -n -p -- sh -c " + set -eu + TMP=\$(mktemp ${MODPROBE_FILE}.XXXXXX) + for m in ${MODULES}; do + printf 'install %s /bin/false\n' \"\$m\" >> \"\$TMP\" + done + if [ -f ${MODPROBE_FILE} ] && cmp -s \"\$TMP\" ${MODPROBE_FILE}; then + rm -f \"\$TMP\" + echo '[dirtyfrag] ${MODPROBE_FILE} already up to date' + else + mv \"\$TMP\" ${MODPROBE_FILE} + chmod 0644 ${MODPROBE_FILE} + echo '[dirtyfrag] wrote ${MODPROBE_FILE}' + fi + depmod -a 2>/dev/null || true + " + + # 2. For each module: if currently loaded, try to unload. Track in-use + # modules so we can emit a single aggregated Warning Event. + IN_USE="" + for m in ${MODULES}; do + REFCNT_PATH=/sys/module/${m}/refcnt + if nsenter -t 1 -m -u -i -n -p -- test -f "${REFCNT_PATH}"; then + REFCNT=$(nsenter -t 1 -m -u -i -n -p -- cat "${REFCNT_PATH}") + echo "[dirtyfrag] ${m} is loaded with refcnt=${REFCNT}" + + if [ "${REFCNT}" = "0" ]; then + if nsenter -t 1 -m -u -i -n -p -- modprobe -r ${m} 2>&1; then + echo "[dirtyfrag] successfully unloaded ${m}" + else + echo "[dirtyfrag] WARNING: rmmod ${m} failed despite refcnt=0" + IN_USE="${IN_USE}${IN_USE:+,}${m}(rmmod-failed)" + fi + else + echo "[dirtyfrag] WARNING: ${m} in use (refcnt=${REFCNT}); node ${NODE_NAME} requires drain+reboot for full mitigation" + IN_USE="${IN_USE}${IN_USE:+,}${m}(refcnt=${REFCNT})" + fi + else + echo "[dirtyfrag] ${m} is not loaded; modprobe blacklist will prevent future loads" + fi + done + + # 3. Drop page caches to clear any contaminated cached pages, per the + # disclosure's mitigation guidance. Best-effort. + if [ "${DROP_CACHES}" = "true" ]; then + if nsenter -t 1 -m -u -i -n -p -- sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' 2>/dev/null; then + echo "[dirtyfrag] dropped page caches" + else + echo "[dirtyfrag] WARNING: failed to drop page caches" + fi + fi + + # 4. If any module was in-use, emit a single aggregated Warning Event + # on the Node so operators get an actionable signal. + # Best-effort: do not fail the init container if the API call fails. + # BusyBox `wget --no-check-certificate` is used because BusyBox wget + # does not support `--ca-certificate`; the bearer token still + # authenticates us to the API server, and the endpoint is the + # in-cluster `kubernetes.default.svc` ClusterIP, so skipping TLS + # chain validation is an accepted trade-off for a best-effort emitter. + if [ -n "${IN_USE}" ]; then + TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) + APISERVER=https://kubernetes.default.svc + NODE_UID=$(wget -qO- --no-check-certificate \ + --header="Authorization: Bearer ${TOKEN}" \ + "${APISERVER}/api/v1/nodes/${NODE_NAME}" 2>/dev/null | \ + sed -n 's/.*"uid":[[:space:]]*"\([^"]*\)".*/\1/p' | head -1 || true) + TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) + EVENT_NAME="dirtyfrag-mitigation.${NODE_NAME}.$(date +%s)" + EVENT_BODY=$(cat </dev/null 2>&1; then + echo "[dirtyfrag] emitted Warning Event ${EVENT_NAME} (in-use: ${IN_USE})" + else + echo "[dirtyfrag] WARNING: failed to emit Kubernetes Event" + fi + fi + + echo "[dirtyfrag] mitigation complete on ${NODE_NAME}" + resources: + requests: + cpu: 10m + memory: 16Mi + limits: + cpu: 100m + memory: 64Mi + containers: + # Long-running placeholder so the pod stays Running and the init + # container is re-executed only on pod recreate (i.e. on each new node). + - name: pause + image: registry.k8s.io/pause:3.10.1 + imagePullPolicy: IfNotPresent + resources: + requests: + cpu: 1m + memory: 8Mi + limits: + cpu: 10m + memory: 16Mi + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: ["ALL"]