Traditionally servers are monitored with Icinga, Nagios, Prtg, etc. Kubernetes though is more convinient to monitor using Prometheus. All is needed are a number of metric collectors and exporters, namely
Once these components are installed and ACLs are in place, endpoints should be provided to Prometheus.
This page focuses on the installation of the above components in a k8s cluster. As in most k8s installations, they are as simple as running “setup.exe” but intead
kubectl apply -f <manifest.yaml>
1. Review the metrics-server.yaml manifest:
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:aggregated-metrics-reader
labels:
rbac.authorization.k8s.io/aggregate-to-view: "true"
rbac.authorization.k8s.io/aggregate-to-edit: "true"
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rules:
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: metrics-server:system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
name: metrics-server-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:metrics-server
rules:
- apiGroups:
- ""
resources:
- pods
- nodes
- nodes/stats
- namespaces
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:metrics-server
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:metrics-server
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
name: v1beta1.metrics.k8s.io
spec:
service:
name: metrics-server
namespace: kube-system
group: metrics.k8s.io
version: v1beta1
insecureSkipTLSVerify: true
groupPriorityMinimum: 100
versionPriority: 100
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
- name: tmp-dir
emptyDir: {}
- name: ca-cert
hostPath:
path: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
type: File
containers:
- name: metrics-server
image: gcr.io/google_containers/metrics-server-amd64:v0.3.4
imagePullPolicy: IfNotPresent
command:
- /metrics-server
- --kubelet-insecure-tls
volumeMounts:
- name: tmp-dir
mountPath: /tmp
- name: ca-cert
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
---
apiVersion: v1
kind: Service
metadata:
name: metrics-server
namespace: kube-system
labels:
kubernetes.io/name: "Metrics-server"
kubernetes.io/cluster-service: "true"
spec:
selector:
k8s-app: metrics-server
ports:
- port: 443
protocol: TCP
targetPort: 4432. Apply the manifest
kubectl apply -f metrics-server.yaml
3. Verify that metrics are available
kubectl top nodes kubectl top pods curl --cacert .kube/ca.crt --cert .kube/client.pem https://cluster:6443/apis/metrics.k8s.io/v1beta1/nodes curl --cacert .kube/ca.crt --cert .kube/client.pem https://cluster:6443/apis/metrics.k8s.io/v1beta1/pods
1. Review the cadvisor.yaml manifest:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cadvisor
rules:
- apiGroups:
- ''
resources:
- 'pods'
- 'resourcequotas'
verbs:
- 'get'
- 'list'
- apiGroups:
- 'metrics.k8s.io'
resources:
- 'pods'
verbs:
- 'get'
- 'list'
resourceNames:
- cadvisor
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cadvisor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cadvisor
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cadvisor
subjects:
- kind: ServiceAccount
name: cadvisor
namespace: default
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: cadvisor
spec:
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:
- '*'
allowedHostPaths:
- pathPrefix: "/"
- pathPrefix: "/var/run"
- pathPrefix: "/sys"
- pathPrefix: "/var/lib/docker"
- pathPrefix: "/dev/disk"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cadvisor
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
selector:
matchLabels:
name: cadvisor
template:
metadata:
labels:
name: cadvisor
spec:
serviceAccountName: cadvisor
containers:
- name: cadvisor
image: gcr.io/google-containers/cadvisor:v0.30.2
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
resources:
requests:
memory: 2000Mi
cpu: 1000m
limits:
memory: 4000Mi
cpu: 1000m
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: var-run
mountPath: /var/run
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: docker
mountPath: /var/lib/docker
readOnly: true
- name: disk
mountPath: /dev/disk
readOnly: true
ports:
- name: http
containerPort: 8080
hostPort: 8080
protocol: TCP
automountServiceAccountToken: false
terminationGracePeriodSeconds: 30
volumes:
- name: rootfs
hostPath:
path: /
- name: var-run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /var/lib/docker
- name: disk
hostPath:
path: /dev/disk
tolerations:
- effect: NoSchedule
operator: Exists
dnsPolicy: ClusterFirst
hostNetwork: true
hostPID: true
restartPolicy: Always2. Apply the manifest
kubectl apply -f cavisor.yaml
3. Open port 8080 on all k8s nodes
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 8080 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Request ACLs to be open from prometheus servers to k8s nodes
5. Create a config file for prometheus (just an example)
vi cadvisor.yaml
- job_name: 'cadvisor'
scrape_interval: 30s
metrics_path: '/metrics'
honor_labels: true static_configs:
- targets:
- 'kube-server-1:8080'
- 'kube-server-2:8080'
- 'kube-server-3:8080'
- 'kube-minion-1:8080'
- 'kube-minion-2:8080'
- 'kube-minion-3:8080' metric_relabel_configs:
- source_labels: [__name__]
regex: ([a-z].*)
target_label: channel
replacement: new_channel
action: replace
- source_labels: [__name__]
regex: ([a-z].*)
target_label: environment
replacement: env
action: replace
- source_labels: [ __name__ ]
regex: go_(.*)
action: drop6. Check that metrics are available
curl http://kube-server-1:8080/metrics curl http://kube-server-2:8080/metrics curl http://kube-server-3:8080/metrics curl http://kube-minion-1:8080/metrics curl http://kube-minion-2:8080/metrics curl http://kube-minion-3:8080/metrics
1. Review the kube-state-metrics.yaml manifest:
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/instance: kube-state-metrics
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
rules:
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
verbs:
- list
- watch
- apiGroups:
- extensions
- apps
resources:
- daemonsets
verbs:
- list
- watch
- apiGroups:
- extensions
- apps
resources:
- deployments
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- endpoints
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- extensions
- networking.k8s.io
resources:
- ingresses
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- limitranges
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- namespaces
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- list
- watch
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- pods
verbs:
- list
- watch
- apiGroups:
- extensions
- apps
resources:
- replicasets
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- replicationcontrollers
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- resourcequotas
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- secrets
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- services
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/instance: kube-state-metrics
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: default
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: "true"
labels:
app.kubernetes.io/instance: kube-state-metrics
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
namespace: default
spec:
clusterIP: 10.97.255.27
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 30800
port: 8080
protocol: TCP
targetPort: 8080
selector:
app.kubernetes.io/instance: kube-state-metrics
app.kubernetes.io/name: kube-state-metrics
sessionAffinity: None
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
generation: 3
labels:
app.kubernetes.io/instance: kube-state-metrics
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: kube-state-metrics
helm.sh/chart: kube-state-metrics-2.3.1
name: kube-state-metrics
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: kube-state-metrics
app.kubernetes.io/name: kube-state-metrics
spec:
containers:
- args:
- --collectors=certificatesigningrequests
- --collectors=configmaps
- --collectors=cronjobs
- --collectors=daemonsets
- --collectors=deployments
- --collectors=endpoints
- --collectors=horizontalpodautoscalers
- --collectors=ingresses
- --collectors=jobs
- --collectors=limitranges
- --collectors=namespaces
- --collectors=nodes
- --collectors=persistentvolumeclaims
- --collectors=persistentvolumes
- --collectors=poddisruptionbudgets
- --collectors=pods
- --collectors=replicasets
- --collectors=replicationcontrollers
- --collectors=resourcequotas
- --collectors=secrets
- --collectors=services
- --collectors=statefulsets
- --collectors=storageclasses
image: quay.io/coreos/kube-state-metrics:v1.9.5
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeName: ds-kube-minion-1
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsUser: 65534
serviceAccount: kube-state-metrics
serviceAccountName: kube-state-metrics
terminationGracePeriodSeconds: 302. Apply the manifest
kubectl apply -f kube-state-metrics.yaml
3. Openport 30800 on kube-minion-1
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 30800 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Create ACL request to allow prometheus servers to scrape kube-minion-1:30800
5. Create a yaml config file for prometheus following the same approach as in cadvisor.
6. Verify the kube-state-metrics are available
curl http://kube-minion-1:30800/metrics
1. Review the manifest prometheus-node-exporter.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: prometheus-node-exporter
name: prometheus-node-exporter
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app: prometheus-node-exporter
name: prometheus-node-exporter
rules:
- apiGroups:
- extensions
resourceNames:
- prometheus-node-exporter
resources:
- podsecuritypolicies
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app: prometheus-node-exporter
name: prometheus-node-exporter
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-node-exporter
subjects:
- kind: ServiceAccount
name: prometheus-node-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: prometheus-node-exporter
name: prometheus-node-exporter
spec:
selector:
matchLabels:
app: prometheus-node-exporter
template:
metadata:
labels:
app: prometheus-node-exporter
spec:
containers:
- args:
- --no-collector.arp
- --no-collector.bcache
- --no-collector.bonding
- --no-collector.conntrack
- --no-collector.cpufreq
- --no-collector.entropy
- --no-collector.filefd
- --no-collector.hwmon
- --no-collector.infiniband
- --no-collector.ipvs
- --no-collector.netclass
- --no-collector.netdev
- --no-collector.nfsd
- --no-collector.pressure
- --no-collector.sockstat
- --no-collector.stat
- --no-collector.textfile
- --no-collector.time
- --no-collector.timex
- --no-collector.xfs
- --no-collector.zfs
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --web.listen-address=0.0.0.0:9100
image: quay.io/prometheus/node-exporter:v0.18.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
readOnly: true
- mountPath: /host/sys
name: sys
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
serviceAccount: node-exporter
serviceAccountName: node-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /proc
type: ""
name: proc
- hostPath:
path: /sys
type: ""
name: sys
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate2. Apply the manifest prometheus-node-exporter.yaml
kubectl apply -f prometheus-node-exporter.yaml
3. Openport 9100 on all k8s nodes
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 9100 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Create ACL request to allow prometheus servers to scrape all k8s nodes on port 9100
5. Create a yaml config file for prometheus following the same approach as in cadvisor.
6. Verify that metrics are available
curl http://kube-server-1:9100/metrics curl http://kube-server-2:9100/metrics curl http://kube-server-3:9100/metrics curl http://kube-minion-1:9100/metrics curl http://kube-minion-2:9100/metrics curl http://kube-minion-3:9100/metrics
1. Review prometheus-push-gateway.yaml manifest
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: prometheus-pushgateway
name: prometheus-pushgateway
---
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus-pushgateway
name: prometheus-pushgateway
spec:
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 30091
port: 9091
protocol: TCP
targetPort: 9091
selector:
app: prometheus-pushgateway
sessionAffinity: None
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: prometheus-pushgateway
name: prometheus-pushgateway
spec:
progressDeadlineSeconds: 600
replicas: 1
selector:
matchLabels:
app: prometheus-pushgateway
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: prometheus-pushgateway
spec:
containers:
- image: prom/pushgateway:v1.2.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /#/status
port: 9091
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: pushgateway
ports:
- containerPort: 9091
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /#/status
port: 9091
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeName: ds-kube-minion-1
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: prometheus-pushgateway
serviceAccountName: prometheus-pushgateway
terminationGracePeriodSeconds: 302. Apply it
kubectl apply -f prometheus-push-gateway.yaml
3. Openport 30091 on kube-minion-1
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 30091 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Create ACL request to allow prometheus servers to scrape kube-minion-1:30091
5. Create a yaml config file for prometheus following the same approach as in cadvisor.
6. Verify that metrics are available
curl http://kube-minion-1:30091/metrics
In case you wonder where client.pem and ca.crt come from, as one can guess, from the /.kube/config
kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://kube-cluster:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: kubernetes-admin
name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
user:
client-certificate-data: REDACTED
client-key-data: REDACTEDAs you see, there are certificate-authority-data and client-certificate-data. They are base64 encoded and can easily be decoded with
grep certificate-authority-data .kube/config |cut -d ' ' -f6|base64 -d >> ca.crt grep client-certificate-data .kube/config |cut -d ' ' -f6|base64 -d >> client.pem
And the only way to figure out permissions for kubernetes-admin user is by looking at the organization field (‘O’) in the subject of its certificate. One would need to search through clusterrolebindings to find cluster-admin one, which binds cluster-admin role to system:masters group. Group membership is not available though as groups and users are not kubernetes objects. Another words, it can be any user as long as its certificate has ‘O=system:masters’ in its subject and is signed by the cluster CA. That ‘any user’ will hold the cluster-admin role privileges.
openssl x509 -in .kube/client.pem -txt|grep Subject: Subject: O=system:masters, CN=kubernetes-admin
kubectl get clusterrolebindings cluster-admin -o yaml ... roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-admin subjects: - apiGroup: rbac.authorization.k8s.io kind: Group name: system:masters
kubectl get clusterrole cluster-admin -o yaml ... rules: - apiGroups: - '*' resources: - '*' verbs: - '*' - nonResourceURLs: - '*' verbs: - '*'