云原生
在 OpenShift 上运行 Linkerd:SCC、CNI,以及那些会咬人的边角
在 OpenShift 上安装 Linkerd 的实践者指南:为什么默认安装会失败,Security Context Constraints 如何改变全局,以及两种修复方案中哪一种真正值得运维。
Todea Engineering
云原生实践
在原生 Kubernetes 集群上安装 Linkerd 是一项十分钟的工作。在 OpenShift 上安装它则不是。Helm chart 一样,控制平面一样,代理也一样,可是当你对着 OpenShift 跑下第一条 helm install 时,你会得到一个塞满了永远不会扩容到零以上的 deployment 的 linkerd 命名空间:
oc get deployment -n linkerd
NAME READY UP-TO-DATE AVAILABLE AGE
linkerd-destination 0/1 0 0 2m40s
linkerd-identity 0/1 0 0 2m40s
linkerd-proxy-injector 0/1 0 0 2m40s这都不是 bug。OpenShift 的安全模型只是在按照设计正常工作。诀窍在于知道 Linkerd 的哪些部分会和它发生冲突,以及在可用的修复方案里你愿意和哪一个长期共处。
为什么默认安装会失败
OpenShift 通过 Security Context Constraints (SCC) 把守每一个 pod。SCC 是一种集群范围的策略,决定一个 pod 被允许使用哪些 securityContext 设置:可以以哪些 UID 运行、可以请求哪些 Linux capability、可以挂载哪些卷类型、能否访问宿主机网络。
oc get scc
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES
anyuid false <no value> MustRunAs RunAsAny RunAsAny RunAsAny 10 false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
hostaccess false <no value> MustRunAs MustRunAsRange MustRunAs RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","hostPath","persistentVolumeClaim","projected","secret"]
hostmount-anyuid false <no value> MustRunAs RunAsAny RunAsAny RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","hostPath","nfs","persistentVolumeClaim","projected","secret"]
hostmount-anyuid-v2 false <no value> RunAsAny RunAsAny RunAsAny RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","hostPath","nfs","persistentVolumeClaim","projected","secret"]
hostnetwork false <no value> MustRunAs MustRunAsRange MustRunAs MustRunAs <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
hostnetwork-v2 false ["NET_BIND_SERVICE"] MustRunAs MustRunAsRange MustRunAs MustRunAs <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
insights-runtime-extractor-scc true ["CAP_SYS_ADMIN"] RunAsAny RunAsAny RunAsAny RunAsAny <no value> false ["*"]
machine-api-termination-handler false <no value> MustRunAs RunAsAny MustRunAs MustRunAs <no value> false ["downwardAPI","hostPath"]
node-exporter true <no value> RunAsAny RunAsAny RunAsAny RunAsAny <no value> false ["*"]
nonroot false <no value> MustRunAs MustRunAsNonRoot RunAsAny RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
nonroot-v2 false ["NET_BIND_SERVICE"] MustRunAs MustRunAsNonRoot RunAsAny RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
privileged true ["*"] RunAsAny RunAsAny RunAsAny RunAsAny <no value> false ["*"]
privileged-genevalogging true ["*"] RunAsAny RunAsAny RunAsAny RunAsAny <no value> false ["*"]
restricted false <no value> MustRunAs MustRunAsRange MustRunAs RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
restricted-v2 false ["NET_BIND_SERVICE"] MustRunAs MustRunAsRange MustRunAs RunAsAny <no value> false ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]在准入阶段,这个插件会枚举集群上的每一个 SCC,按优先级排序,然后对每一个检查两件事:pod 的 ServiceAccount 是否有 RBAC 权限访问它,以及 pod spec 是否能够通过它的规则。pod 会被准入到第一个两项都通过的 SCC 上。如果一个都通不过,pod 就被拒绝,事件日志里会列出它尝试过的每一个 SCC,以及每一个为什么说不。
当 pod 被拒绝时,事件日志精确地写出了插件尝试了什么:每个 SCC 一行,要么是 Forbidden: not usable by user or serviceaccount(RBAC 检查失败),要么是一段 Invalid value: … 详情(通过了 RBAC、但校验失败)。读这份清单就是调试的方法。
oc get events -n linkerd
LAST SEEN TYPE REASON OBJECT MESSAGE
7m47s Warning FailedCreate replicaset/linkerd-destination-5fd5f7b7f7 Error creating: pods "linkerd-destination-5fd5f7b7f7-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .initContainers[0].runAsUser: Invalid value: 65534: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .initContainers[0].capabilities.add: Invalid value: "NET_ADMIN": capability may not be added, provider restricted-v2: .initContainers[0].capabilities.add: Invalid value: "NET_RAW": capability may not be added, provider restricted-v2: .containers[0].runAsUser: Invalid value: 2102: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 2103: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .containers[2].runAsUser: Invalid value: 2103: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .containers[3].runAsUser: Invalid value: 2103: must be in the ranges: [1000750000, 1000759999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "insights-runtime-extractor-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount, provider "privileged-genevalogging": Forbidden: not usable by user or serviceaccount]
106s Warning FailedCreate job/linkerd-heartbeat-29446087 Error creating: pods "linkerd-heartbeat-29446087-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .containers[0].runAsUser: Invalid value: 2103: must be in the ranges: [1000750000, 1000759999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "insights-runtime-extractor-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount, provider "privileged-genevalogging": Forbidden: not usable by user or serviceaccount]
7m47s Warning FailedCreate replicaset/linkerd-identity-688fff88b4 Error creating: pods "linkerd-identity-688fff88b4-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .initContainers[0].runAsUser: Invalid value: 65534: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .initContainers[0].capabilities.add: Invalid value: "NET_ADMIN": capability may not be added, provider restricted-v2: .initContainers[0].capabilities.add: Invalid value: "NET_RAW": capability may not be added, provider restricted-v2: .containers[0].runAsUser: Invalid value: 2103: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 2102: must be in the ranges: [1000750000, 1000759999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "insights-runtime-extractor-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount, provider "privileged-genevalogging": Forbidden: not usable by user or serviceaccount]
7m47s Warning FailedCreate replicaset/linkerd-proxy-injector-5f654db4db Error creating: pods "linkerd-proxy-injector-5f654db4db-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .initContainers[0].runAsUser: Invalid value: 65534: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .initContainers[0].capabilities.add: Invalid value: "NET_ADMIN": capability may not be added, provider restricted-v2: .initContainers[0].capabilities.add: Invalid value: "NET_RAW": capability may not be added, provider restricted-v2: .containers[0].runAsUser: Invalid value: 2102: must be in the ranges: [1000750000, 1000759999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 2103: must be in the ranges: [1000750000, 1000759999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "insights-runtime-extractor-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount, provider "privileged-genevalogging": Forbidden: not usable by user or serviceaccount]Linkerd 安装时不带任何自定义 SCC 绑定,所以它的 ServiceAccount 只继承了 OpenShift 4.11+ 给所有已认证 SA 的那一份集群范围 RBAC:访问 restricted-v2,仅此而已。事件日志印证了这一点。其余每一个 provider 都返回 Forbidden: not usable by user or serviceaccount。Linkerd 的 pod 只对着 restricted-v2 单独做校验,并且失败。
正如名字所示,restricted-v2 是严格的:
- 不允许添加任何 Linux capability。
- pod 必须以项目分配的 UID 范围内的非 root UID 运行。每个 OpenShift 项目通过
openshift.io/sa.scc.uid-range命名空间注解获得自己的范围,pod 必须落在其中。 - 不能用 host path、不能用宿主机网络、不能特权升级。
Linkerd 的 pod 在两件事上违反了这些规则。第一件影响安装中的每一个容器;第二件只针对 linkerd-init:
- UID 不在项目范围内。
restricted-v2要求每一个 UID 都落在项目分配的范围内:在这个例子里是[1000750000, 1000759999],每个项目都是不同的窗口。linkerd-init以65534(nobody)运行,控制平面容器和代理 sidecar 默认是2102和2103。一个都不在范围里,而通过 Helm 去追这些值并不能解决问题:每次重建命名空间范围就会被重新分配,而被纳入网格的应用 pod 又住在另外的项目里,那些项目又有各自的范围。没有一个 UID 能放之四海皆准。 linkerd-init需要restricted-v2不会授予的 capability。 它的全部职责就是改写 iptables、把 pod 的流量重定向到 sidecar,这需要NET_ADMIN和NET_RAW。两者都不在白名单上。
两条出路
有两条受支持的路径。它们从相反的方向解决同一个问题,而这个选择会带来真实的运维后果。
路径 1:Linkerd CNI(推荐)
Linkerd CNI 插件把 iptables 设置移出 pod、放进节点级的 CNI 链里。linkerd-cni DaemonSet 把一个二进制文件和一个配置投放到节点的 CNI 目录,Multus 会把它串进每一个 pod 的 CNI 设置。对于没有被纳入网格的 pod,这个插件是一个空操作。pod 本身不再需要 NET_ADMIN 或 NET_RAW。
CNI DaemonSet 仍然要在节点上做特权工作,所以它的 ServiceAccount 需要一个允许这一点的 SCC:
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
name: linkerd-cni-scc
allowPrivilegedContainer: true
allowPrivilegeEscalation: true
defaultAllowPrivilegeEscalation: true
allowHostNetwork: false
allowHostPorts: false
allowHostPID: false
allowHostIPC: false
allowHostDirVolumePlugin: true
volumes:
- hostPath
- configMap
- projected
- downwardAPI
- emptyDir
seccompProfiles:
- '*'
runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
fsGroup:
type: RunAsAny
supplementalGroups:
type: RunAsAny
users:
- system:serviceaccount:linkerd-cni:linkerd-cni还有一个 OpenShift 特有的细节:CNI 路径并不是上游 Kubernetes 的默认值。OpenShift 把 CNI 二进制放在 /var/lib/cni/bin 下,把配置放在 /etc/kubernetes/cni/net.d 下,Linkerd CNI 必须把它的文件放对位置,否则 Multus 永远看不到它。安装时把 Helm 值设好:
helm install linkerd2-cni linkerd2-edge/linkerd2-cni \
--namespace linkerd-cni \
--set destCNIBinDir=/var/lib/cni/bin \
--set destCNINetDir=/etc/kubernetes/cni/net.d \
--set privileged=trueDaemonSet 健康之后,用 --set cniEnabled=true 安装控制平面。这样每个被纳入网格的 pod 里的 linkerd-init 容器都会被省略,代理 sidecar 也不带提升的 capability 运行。
CNI 把特权工作从 pod 转移到了节点,但它并没有解决控制平面自身的 UID 问题。把 Linkerd 控制平面的 ServiceAccount 绑定到一个最小的、非特权的 SCC:
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
name: linkerd-scc
allowPrivilegedContainer: false
allowPrivilegeEscalation: false
defaultAllowPrivilegeEscalation: false
allowHostNetwork: false
allowHostPorts: false
allowHostPID: false
allowHostIPC: false
allowHostDirVolumePlugin: false
volumes:
- configMap
- projected
- downwardAPI
- emptyDir
- secret
seccompProfiles:
- '*'
runAsUser:
type: MustRunAsNonRoot
seLinuxContext:
type: RunAsAny
fsGroup:
type: RunAsAny
supplementalGroups:
type: RunAsAny
users:
- system:serviceaccount:linkerd:linkerd-destination
- system:serviceaccount:linkerd:linkerd-identity
- system:serviceaccount:linkerd:linkerd-proxy-injector
- system:serviceaccount:linkerd:linkerd-heartbeat被纳入网格的应用 workload 也带着一个以 UID 2102 运行的 linkerd-proxy sidecar,所以它们的 ServiceAccount 同样需要这份授权。把每一个都加进 users: 列表,或者执行 oc adm policy add-scc-to-user linkerd-scc -z <sa> -n <namespace>。
路径 2:为 proxy-init 准备一个自定义 SCC
如果你想保留 linkerd-init,部署下面这个自定义 SCC:
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
name: linkerd-scc
allowPrivilegedContainer: false
allowPrivilegeEscalation: false
defaultAllowPrivilegeEscalation: false
allowHostNetwork: false
allowHostPorts: false
allowHostPID: false
allowHostIPC: false
allowHostDirVolumePlugin: false
requiredDropCapabilities:
- ALL
allowedCapabilities:
- NET_ADMIN
- NET_RAW
volumes:
- configMap
- projected
- downwardAPI
- emptyDir
- secret
seccompProfiles:
- '*'
runAsUser:
type: MustRunAsNonRoot
seLinuxContext:
type: RunAsAny
fsGroup:
type: RunAsAny
supplementalGroups:
type: RunAsAny
users:
- system:serviceaccount:linkerd:linkerd-destination
- system:serviceaccount:linkerd:linkerd-identity
- system:serviceaccount:linkerd:linkerd-proxy-injector
- system:serviceaccount:linkerd:linkerd-heartbeat跟前一条路径的区别在于,这个 SCC 把 NET_ADMIN 和 NET_RAW 授予了控制平面的 ServiceAccount。授权只对 SCC 中列出的用户生效,所以每接入一个新的应用 ServiceAccount,你都得把它加进去。
这看起来似乎是有限的,但它的爆炸半径比看上去要大。linkerd-init 不只在控制平面那几个 pod 里跑,而是在每一个被纳入网格的 pod 里都要跑,所以每一个被纳入网格的命名空间都需要让自己的 ServiceAccount 拥有那些 capability。实际操作中,你最终会把这个 SCC(或它的兄弟)绑定到每一个新接入的应用命名空间的 system:serviceaccounts:<app-namespace> 上。这笔运维税永远不会真正消失。
还有一个陷阱:策略控制器的 lease
无论你选哪条路径,还有一个 OpenShift 特有的细节会绊倒人。OpenShift 默认启用了 OwnerReferencesPermissionEnforcement 准入插件,而原生 Kubernetes 集群上则不会。这个插件要求:任何要给一个对象设置 ownerReference 的人,必须同时拥有该对象的 delete 权限,这样将来垃圾回收才能把它清掉。当策略控制器试图占用 leases.coordination.k8s.io 中的 policy-controller-write lease 并附上 ownerRef 时,调用就失败了:默认的 linkerd-policy ClusterRole 既没有给 lease 授予 update,也没有授予 delete。把缺失的动词补丁打进去:
oc get clusterrole linkerd-policy -o json \
| jq '(.rules[] | select(.apiGroups==["coordination.k8s.io"] and .resources==["leases"]) | .verbs) |= (. + ["update","delete"] | unique)' \
| oc apply -f -当 xtables 模块不可用的时候
在某些 OpenShift 集群上,RHCOS 内核并不会自动加载 linkerd-init 依赖的所有 xtables 兼容模块。这种情况下,linkerd-init 在插入它的 iptables 规则时就会失败,控制平面组件最终会卡在 Init:CrashLoopBackOff。
oc logs -n linkerd deploy/linkerd-destination -c linkerd-init --previous
time="2025-12-27T04:59:58Z" level=info msg="/usr/sbin/iptables-nft-save -t nat"
time="2025-12-27T04:59:58Z" level=info msg="# Generated by iptables-nft-save v1.8.11 (nf_tables) on Sat Dec 27 04:59:58 2025\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\nCOMMIT\n# Completed on Sat Dec 27 04:59:58 2025\n"
time="2025-12-27T04:59:58Z" level=info msg="/usr/sbin/iptables-nft -t nat -F PROXY_INIT_REDIRECT"
time="2025-12-27T04:59:58Z" level=info msg="/usr/sbin/iptables-nft -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"
time="2025-12-27T04:59:58Z" level=info msg="Warning: Extension multiport revision 0 not supported, missing kernel module?\niptables v1.8.11 (nf_tables): RULE_APPEND failed (No such file or directory): rule in chain PROXY_INIT_REDIRECT\n"
Error: exit status 4iptables-nft 并不是一个纯粹的 nftables 翻译器:像 -m multiport、-m owner、-m comment 这样的 matcher,以及 REDIRECT 这个 target,都要经过 nft_compat,而后者需要那些遗留的 xt_* 模块可被加载。只要其中任何一个模块不可用,规则插入就会失败。要保证它们在启动时就被加载,使用下面这个 MachineConfig:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: load-linkerd-xt-modules
labels:
machineconfiguration.openshift.io/role: worker
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: /etc/modules-load.d/linkerd-xt.conf
mode: 0644
overwrite: true
contents:
source: data:,xt_multiport%0Axt_comment%0Axt_REDIRECT%0Axt_ownerdata:, URL 是把四个模块名用 URL 编码的换行符连起来,正是 systemd-modules-load 期望的格式。应用之后会把 worker MachineConfigPool 里的每个节点都 drain 并重启。
该选哪一个
真正要做的选择是:特权放在哪里,以及你要维护多少个 SCC。
CNI 路径把特权集中在一处:一个专属的 linkerd-cni 项目里有一个 DaemonSet,由一个 linkerd-cni-scc 治理,绑在一个 ServiceAccount 上。控制平面拿到的是一个最小、非特权的 linkerd-scc。应用 workload 为它们的代理 sidecar 拿到同一个最小 SCC,但集群里没有任何东西在应用 pod 内部需要 NET_ADMIN 或 NET_RAW。
proxy-init 路径把特权摊得到处都是:每一个被纳入网格的应用命名空间,都需要把它的 ServiceAccount 绑到一个授予 NET_ADMIN 和 NET_RAW 的 SCC 上。每接入一个新的应用团队,你都得永远把这份授权运维下去。
除非你的集群管理员明令禁止往宿主机 CNI 路径写文件的 DaemonSet,否则就选 CNI。如果真有这种禁令,那带自定义 SCC 的 proxy-init 是诚实的备选。只是请从第一天开始,就把"按命名空间授予 SCC"这件事做进你的命名空间发放自动化里,否则六个月后它就会咬你一口。