Traditionally servers are monitored with Icinga, Nagios, Prtg, etc. Kubernetes though is more convinient to monitor using Prometheus. All is needed are a number of metric collectors and exporters, namely
Once these components are installed and ACLs are in place, endpoints should be provided to Prometheus.
This page focuses on the installation of the above components in a k8s cluster. As in most k8s installations, they are as simple as running “setup.exe” but intead
kubectl apply -f <manifest.yaml>
1. Review the metrics-server.yaml manifest:
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: system:aggregated-metrics-reader labels: rbac.authorization.k8s.io/aggregate-to-view: "true" rbac.authorization.k8s.io/aggregate-to-edit: "true" rbac.authorization.k8s.io/aggregate-to-admin: "true" rules: - apiGroups: ["metrics.k8s.io"] resources: ["pods", "nodes"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: metrics-server:system:auth-delegator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:auth-delegator subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: RoleBinding metadata: name: metrics-server-auth-reader namespace: kube-system roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: extension-apiserver-authentication-reader subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: system:metrics-server rules: - apiGroups: - "" resources: - pods - nodes - nodes/stats - namespaces verbs: - get - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: system:metrics-server roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:metrics-server subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system --- apiVersion: apiregistration.k8s.io/v1beta1 kind: APIService metadata: name: v1beta1.metrics.k8s.io spec: service: name: metrics-server namespace: kube-system group: metrics.k8s.io version: v1beta1 insecureSkipTLSVerify: true groupPriorityMinimum: 100 versionPriority: 100 --- apiVersion: v1 kind: ServiceAccount metadata: name: metrics-server namespace: kube-system --- apiVersion: apps/v1 kind: Deployment metadata: name: metrics-server namespace: kube-system labels: k8s-app: metrics-server spec: selector: matchLabels: k8s-app: metrics-server template: metadata: name: metrics-server labels: k8s-app: metrics-server spec: serviceAccountName: metrics-server volumes: - name: tmp-dir emptyDir: {} - name: ca-cert hostPath: path: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem type: File containers: - name: metrics-server image: gcr.io/google_containers/metrics-server-amd64:v0.3.4 imagePullPolicy: IfNotPresent command: - /metrics-server - --kubelet-insecure-tls volumeMounts: - name: tmp-dir mountPath: /tmp - name: ca-cert mountPath: /etc/ssl/certs/ca-certificates.crt readOnly: true --- apiVersion: v1 kind: Service metadata: name: metrics-server namespace: kube-system labels: kubernetes.io/name: "Metrics-server" kubernetes.io/cluster-service: "true" spec: selector: k8s-app: metrics-server ports: - port: 443 protocol: TCP targetPort: 443
2. Apply the manifest
kubectl apply -f metrics-server.yaml
3. Verify that metrics are available
kubectl top nodes kubectl top pods curl --cacert .kube/ca.crt --cert .kube/client.pem https://cluster:6443/apis/metrics.k8s.io/v1beta1/nodes curl --cacert .kube/ca.crt --cert .kube/client.pem https://cluster:6443/apis/metrics.k8s.io/v1beta1/pods
1. Review the cadvisor.yaml manifest:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: cadvisor rules: - apiGroups: - '' resources: - 'pods' - 'resourcequotas' verbs: - 'get' - 'list' - apiGroups: - 'metrics.k8s.io' resources: - 'pods' verbs: - 'get' - 'list' resourceNames: - cadvisor --- apiVersion: v1 kind: ServiceAccount metadata: name: cadvisor --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: cadvisor roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cadvisor subjects: - kind: ServiceAccount name: cadvisor namespace: default --- apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: cadvisor spec: seLinux: rule: RunAsAny supplementalGroups: rule: RunAsAny runAsUser: rule: RunAsAny fsGroup: rule: RunAsAny volumes: - '*' allowedHostPaths: - pathPrefix: "/" - pathPrefix: "/var/run" - pathPrefix: "/sys" - pathPrefix: "/var/lib/docker" - pathPrefix: "/dev/disk" --- apiVersion: apps/v1 kind: DaemonSet metadata: name: cadvisor annotations: seccomp.security.alpha.kubernetes.io/pod: 'docker/default' spec: selector: matchLabels: name: cadvisor template: metadata: labels: name: cadvisor spec: serviceAccountName: cadvisor containers: - name: cadvisor image: gcr.io/google-containers/cadvisor:v0.30.2 imagePullPolicy: IfNotPresent securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File resources: requests: memory: 2000Mi cpu: 1000m limits: memory: 4000Mi cpu: 1000m volumeMounts: - name: rootfs mountPath: /rootfs readOnly: true - name: var-run mountPath: /var/run readOnly: true - name: sys mountPath: /sys readOnly: true - name: docker mountPath: /var/lib/docker readOnly: true - name: disk mountPath: /dev/disk readOnly: true ports: - name: http containerPort: 8080 hostPort: 8080 protocol: TCP automountServiceAccountToken: false terminationGracePeriodSeconds: 30 volumes: - name: rootfs hostPath: path: / - name: var-run hostPath: path: /var/run - name: sys hostPath: path: /sys - name: docker hostPath: path: /var/lib/docker - name: disk hostPath: path: /dev/disk tolerations: - effect: NoSchedule operator: Exists dnsPolicy: ClusterFirst hostNetwork: true hostPID: true restartPolicy: Always
2. Apply the manifest
kubectl apply -f cavisor.yaml
3. Open port 8080 on all k8s nodes
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 8080 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Request ACLs to be open from prometheus servers to k8s nodes
5. Create a config file for prometheus (just an example)
vi cadvisor.yaml - job_name: 'cadvisor' scrape_interval: 30s metrics_path: '/metrics' honor_labels: true
static_configs: - targets: - 'kube-server-1:8080' - 'kube-server-2:8080' - 'kube-server-3:8080' - 'kube-minion-1:8080' - 'kube-minion-2:8080' - 'kube-minion-3:8080'
metric_relabel_configs: - source_labels: [__name__] regex: ([a-z].*) target_label: channel replacement: new_channel action: replace - source_labels: [__name__] regex: ([a-z].*) target_label: environment replacement: env action: replace - source_labels: [ __name__ ] regex: go_(.*) action: drop
6. Check that metrics are available
curl http://kube-server-1:8080/metrics curl http://kube-server-2:8080/metrics curl http://kube-server-3:8080/metrics curl http://kube-minion-1:8080/metrics curl http://kube-minion-2:8080/metrics curl http://kube-minion-3:8080/metrics
1. Review the kube-state-metrics.yaml manifest:
apiVersion: v1 kind: ServiceAccount metadata: labels: app.kubernetes.io/name: kube-state-metrics name: kube-state-metrics --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/instance: kube-state-metrics app.kubernetes.io/name: kube-state-metrics name: kube-state-metrics rules: - apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch - apiGroups: - "" resources: - configmaps verbs: - list - watch - apiGroups: - batch resources: - cronjobs verbs: - list - watch - apiGroups: - extensions - apps resources: - daemonsets verbs: - list - watch - apiGroups: - extensions - apps resources: - deployments verbs: - list - watch - apiGroups: - "" resources: - endpoints verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - extensions - networking.k8s.io resources: - ingresses verbs: - list - watch - apiGroups: - batch resources: - jobs verbs: - list - watch - apiGroups: - "" resources: - limitranges verbs: - list - watch - apiGroups: - "" resources: - namespaces verbs: - list - watch - apiGroups: - "" resources: - nodes verbs: - list - watch - apiGroups: - "" resources: - persistentvolumeclaims verbs: - list - watch - apiGroups: - "" resources: - persistentvolumes verbs: - list - watch - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - "" resources: - pods verbs: - list - watch - apiGroups: - extensions - apps resources: - replicasets verbs: - list - watch - apiGroups: - "" resources: - replicationcontrollers verbs: - list - watch - apiGroups: - "" resources: - resourcequotas verbs: - list - watch - apiGroups: - "" resources: - secrets verbs: - list - watch - apiGroups: - "" resources: - services verbs: - list - watch - apiGroups: - apps resources: - statefulsets verbs: - list - watch - apiGroups: - storage.k8s.io resources: - storageclasses verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app.kubernetes.io/instance: kube-state-metrics app.kubernetes.io/name: kube-state-metrics name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: default --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: "true" labels: app.kubernetes.io/instance: kube-state-metrics app.kubernetes.io/name: kube-state-metrics name: kube-state-metrics namespace: default spec: clusterIP: 10.97.255.27 externalTrafficPolicy: Cluster ports: - name: http nodePort: 30800 port: 8080 protocol: TCP targetPort: 8080 selector: app.kubernetes.io/instance: kube-state-metrics app.kubernetes.io/name: kube-state-metrics sessionAffinity: None type: NodePort --- apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "3" generation: 3 labels: app.kubernetes.io/instance: kube-state-metrics app.kubernetes.io/managed-by: Tiller app.kubernetes.io/name: kube-state-metrics helm.sh/chart: kube-state-metrics-2.3.1 name: kube-state-metrics namespace: default spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app.kubernetes.io/instance: kube-state-metrics app.kubernetes.io/name: kube-state-metrics spec: containers: - args: - --collectors=certificatesigningrequests - --collectors=configmaps - --collectors=cronjobs - --collectors=daemonsets - --collectors=deployments - --collectors=endpoints - --collectors=horizontalpodautoscalers - --collectors=ingresses - --collectors=jobs - --collectors=limitranges - --collectors=namespaces - --collectors=nodes - --collectors=persistentvolumeclaims - --collectors=persistentvolumes - --collectors=poddisruptionbudgets - --collectors=pods - --collectors=replicasets - --collectors=replicationcontrollers - --collectors=resourcequotas - --collectors=secrets - --collectors=services - --collectors=statefulsets - --collectors=storageclasses image: quay.io/coreos/kube-state-metrics:v1.9.5 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: / port: 8080 scheme: HTTP initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst nodeName: ds-kube-minion-1 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 65534 runAsUser: 65534 serviceAccount: kube-state-metrics serviceAccountName: kube-state-metrics terminationGracePeriodSeconds: 30
2. Apply the manifest
kubectl apply -f kube-state-metrics.yaml
3. Openport 30800 on kube-minion-1
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 30800 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Create ACL request to allow prometheus servers to scrape kube-minion-1:30800
5. Create a yaml config file for prometheus following the same approach as in cadvisor.
6. Verify the kube-state-metrics are available
curl http://kube-minion-1:30800/metrics
1. Review the manifest prometheus-node-exporter.yaml
apiVersion: v1 kind: ServiceAccount metadata: labels: app: prometheus-node-exporter name: prometheus-node-exporter --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app: prometheus-node-exporter name: prometheus-node-exporter rules: - apiGroups: - extensions resourceNames: - prometheus-node-exporter resources: - podsecuritypolicies verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app: prometheus-node-exporter name: prometheus-node-exporter roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus-node-exporter subjects: - kind: ServiceAccount name: prometheus-node-exporter --- apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: prometheus-node-exporter name: prometheus-node-exporter spec: selector: matchLabels: app: prometheus-node-exporter template: metadata: labels: app: prometheus-node-exporter spec: containers: - args: - --no-collector.arp - --no-collector.bcache - --no-collector.bonding - --no-collector.conntrack - --no-collector.cpufreq - --no-collector.entropy - --no-collector.filefd - --no-collector.hwmon - --no-collector.infiniband - --no-collector.ipvs - --no-collector.netclass - --no-collector.netdev - --no-collector.nfsd - --no-collector.pressure - --no-collector.sockstat - --no-collector.stat - --no-collector.textfile - --no-collector.time - --no-collector.timex - --no-collector.xfs - --no-collector.zfs - --path.procfs=/host/proc - --path.sysfs=/host/sys - --web.listen-address=0.0.0.0:9100 image: quay.io/prometheus/node-exporter:v0.18.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: / port: 9100 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: node-exporter ports: - containerPort: 9100 hostPort: 9100 name: metrics protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: / port: 9100 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/proc name: proc readOnly: true - mountPath: /host/sys name: sys readOnly: true dnsPolicy: ClusterFirst hostNetwork: true hostPID: true restartPolicy: Always schedulerName: default-scheduler securityContext: runAsNonRoot: true runAsUser: 65534 serviceAccount: node-exporter serviceAccountName: node-exporter terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule operator: Exists volumes: - hostPath: path: /proc type: "" name: proc - hostPath: path: /sys type: "" name: sys updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate
2. Apply the manifest prometheus-node-exporter.yaml
kubectl apply -f prometheus-node-exporter.yaml
3. Openport 9100 on all k8s nodes
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 9100 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Create ACL request to allow prometheus servers to scrape all k8s nodes on port 9100
5. Create a yaml config file for prometheus following the same approach as in cadvisor.
6. Verify that metrics are available
curl http://kube-server-1:9100/metrics curl http://kube-server-2:9100/metrics curl http://kube-server-3:9100/metrics curl http://kube-minion-1:9100/metrics curl http://kube-minion-2:9100/metrics curl http://kube-minion-3:9100/metrics
1. Review prometheus-push-gateway.yaml manifest
apiVersion: v1 kind: ServiceAccount metadata: labels: app: prometheus-pushgateway name: prometheus-pushgateway --- apiVersion: v1 kind: Service metadata: labels: app: prometheus-pushgateway name: prometheus-pushgateway spec: externalTrafficPolicy: Cluster ports: - name: http nodePort: 30091 port: 9091 protocol: TCP targetPort: 9091 selector: app: prometheus-pushgateway sessionAffinity: None type: NodePort --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: prometheus-pushgateway name: prometheus-pushgateway spec: progressDeadlineSeconds: 600 replicas: 1 selector: matchLabels: app: prometheus-pushgateway strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: app: prometheus-pushgateway spec: containers: - image: prom/pushgateway:v1.2.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /#/status port: 9091 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 name: pushgateway ports: - containerPort: 9091 name: metrics protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /#/status port: 9091 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst nodeName: ds-kube-minion-1 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: prometheus-pushgateway serviceAccountName: prometheus-pushgateway terminationGracePeriodSeconds: 30
2. Apply it
kubectl apply -f prometheus-push-gateway.yaml
3. Openport 30091 on kube-minion-1
sudo firewall-cmd --permanent --direct --add-rule ipv4 filter IN_public_allow 0 -p tcp --dport 30091 -m conntrack --ctstate NEW -j ACCEPT sudo firewall-cmd --reload
4. Create ACL request to allow prometheus servers to scrape kube-minion-1:30091
5. Create a yaml config file for prometheus following the same approach as in cadvisor.
6. Verify that metrics are available
curl http://kube-minion-1:30091/metrics
In case you wonder where client.pem and ca.crt come from, as one can guess, from the /.kube/config
kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED server: https://kube-cluster:6443 name: kubernetes contexts: - context: cluster: kubernetes user: kubernetes-admin name: kubernetes-admin@kubernetes current-context: kubernetes-admin@kubernetes kind: Config preferences: {} users: - name: kubernetes-admin user: client-certificate-data: REDACTED client-key-data: REDACTED
As you see, there are certificate-authority-data and client-certificate-data. They are base64 encoded and can easily be decoded with
grep certificate-authority-data .kube/config |cut -d ' ' -f6|base64 -d >> ca.crt grep client-certificate-data .kube/config |cut -d ' ' -f6|base64 -d >> client.pem
And the only way to figure out permissions for kubernetes-admin user is by looking at the organization field (‘O’) in the subject of its certificate. One would need to search through clusterrolebindings to find cluster-admin one, which binds cluster-admin role to system:masters group. Group membership is not available though as groups and users are not kubernetes objects. Another words, it can be any user as long as its certificate has ‘O=system:masters’ in its subject and is signed by the cluster CA. That ‘any user’ will hold the cluster-admin role privileges.
openssl x509 -in .kube/client.pem -txt|grep Subject: Subject: O=system:masters, CN=kubernetes-admin
kubectl get clusterrolebindings cluster-admin -o yaml ... roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-admin subjects: - apiGroup: rbac.authorization.k8s.io kind: Group name: system:masters
kubectl get clusterrole cluster-admin -o yaml ... rules: - apiGroups: - '*' resources: - '*' verbs: - '*' - nonResourceURLs: - '*' verbs: - '*'