OpenShift のアラート通知の設定

OpenShift のアラート通知の設定

はじめに

この時期にのみ記事を書いている気がしますが、OpenShift Advent Calendar 2024 の 12月5日の記事です。サンタクロースへの手紙をそろそろ書かないと欲しいプレゼントが届かない可能性がでてくる時期になりました。正しく届けないと悲しいクリスマスになるかもしれません。 OpenShift クラスターにおいても、アラートの届け先が大事です。正しく届けないと休みを返上することになるかもしれません。

qiita.com

ここでは OpenShift の Cluster Monitoring における AlertManager の利用について整理します。

OpenShift のモニタリングスタック

下の図は Cluster Monitoring 及び User Workload Monitoring の構成を表したものです。赤の点線で optional と書かれている部分は利用者の設定により追加することが可能なコンポーネントになります。

ここで簡単ですが、Cluster Monitoring と User Workload Monitoring が何かを確認しておきます。

Cluster Monitoring

製品ドキュメントの記述をそのままもってきますと以下のように記載があります。

A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.

These components are illustrated in the Installed by default section in the following diagram.

OpenShift インストール時にデフォルトで openshift-monitoring プロジェクトにインストールされるプラットフォームのモニタリングコンポーネントのセット。Kubernetes サービスを含む OpenShift のコアコンポーネントのモニタリングを提供します。デフォルトで用意されているモニタリングスタックはリモートからクラスタの状態をモニタリングする機能も備えています。

User Workload Monitoring

同様に、User Workload Monitoring についてです。

After optionally enabling monitoring for user-defined projects, additional monitoring components are installed in the openshift-user-workload-monitoring project. This provides monitoring for user-defined projects. These components are illustrated in the User section in the following diagram.

ユーザ定義プロジェクト用のモニタリングを有効にしたあと、openshift-user-workload-monitoring プロジェクトに追加のモニタリングコンポーネントがインストールされます。ユーザが定義するプロジェクトのモニタリングを提供します。

文章では触れていませんが、User Workload Monitoring のコンポーネントにも AlertManager が登場します。OpenShift では AlertManager が2つ利用することができます。

ここでは、Cluster Monitoring と User Workload Monitoring の用途を言いやすくするために、管理者用の（Cluster Monitoring）と利用者用の（User Workload Monitoring）という言葉をつかって以降記載していきます。

アラート通知のコントロール

管理者はクラスターの状況を監視しますが、利用者も同様に運用するアプリケーションの監視を行います。アラートの通知を行いたいとか、通知先を個別に管理したいとか要望があるかと思いますが、OpenShift では以下のケースに対応することができます。

利用者はアラート通知を行いたいが、通知先は管理者により管理される
利用者はアラート通知を行いたい、かつ通知先も利用者が管理したい
管理者もカスタムでアラートを追加するため利用者と負荷を分散したい（利用者は通知先の設定を自分で管理しなければならなくなる）

このあと、一つ一つ設定と実機による確認をしていきます。実機確認は OpenShift 4.17 を利用し、管理者用の AlertManager の通知先は Slack としています。利用者用の通知先も同じく Slack を利用します。センシティブな情報については、マスクを掛けますので試す場合は個々の環境に読み替えてください。

ケース1 利用者はアラートの設定だけ行いたい

OpenShift では利用者用のMonitoringを有効にすることで、利用者によるアラート設定が行えます。この場合は通常 AlertManager は管理者用のものだけになるため、アラートの通知先は管理者によってのみ管理されます。検証に使うアラートは alert-sample プロジェクトに ExampleUserAlert という名前のアラートを定義して確認します。管理者用と利用者用の Monitoring の設定は次のようになります。

管理者用の Monitoring の設定

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

利用者用の Monitoring の設定はアラート通知については不要です。

この設定で、利用者がアラートを設定しアラートが通知されると管理者用の AlertManager に通知が届きます。雑ですが、以下のスクリプトで状況を確認していきます。

echo "User: alerts"
oc exec -it alertmanager-user-workload-0 -n openshift-user-workload-monitoring -- amtool alert query --alertmanager.url http://localhost:9093

echo "Cluster: alerts"
oc exec -it alertmanager-main-1 -n openshift-monitoring -- amtool alert query --alertmanager.url http://localhost:9093

echo "User workload: alertmanager.yaml"
oc exec -it alertmanager-user-workload-0 -n openshift-user-workload-monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml; echo ""

echo "Cluster: alertmanager.yaml"
oc exec -it alertmanager-main-1 -n openshift-monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml; echo ""

確認結果は以下の通りとなります。管理者用の AlertManager しかないこともありますが、管理者用の AlertManager に通知が上がっていることが確認できます。そして、通知先は管理者が設定したSlackのみが指定されています。

User: alerts
Error from server (NotFound): pods "alertmanager-user-workload-0" not found

Cluster: alerts
Alertname                            Starts At                Summary                                                                                                State
Watchdog                             2024-12-05 23:06:52 UTC  An alert that should always be firing to certify that Alertmanager is working properly.                active
UpdateAvailable                      2024-12-05 23:07:36 UTC  Your upstream update recommendation service recommends you update your cluster.                        active
PrometheusOperatorRejectedResources  2024-12-05 23:12:32 UTC  Resources rejected by Prometheus operator                                                              active
InsightsRecommendationActive         2024-12-05 23:15:03 UTC  An Insights recommendation is active for this cluster.                                                 active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
ClusterNotUpgradeable                2024-12-06 00:07:40 UTC  One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.  active
PrometheusDuplicateTimestamps        2024-12-06 00:07:55 UTC  Prometheus is dropping samples with duplicate timestamps.                                              active
PrometheusDuplicateTimestamps        2024-12-06 00:07:55 UTC  Prometheus is dropping samples with duplicate timestamps.                                              active
PodDisruptionBudgetAtLimit           2024-12-06 00:09:08 UTC  The pod disruption budget is preventing further disruption to pods.                                    active
ExampleUserAlert                     2024-12-06 12:19:23 UTC  This is sample summary                                                                                 active

User workload: alertmanager.yaml
Error from server (NotFound): pods "alertmanager-user-workload-0" not found

Cluster: alertmanager.yaml
inhibit_rules:
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = critical
    target_matchers:
      - severity =~ warning|info
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = warning
    target_matchers:
      - severity = info
receivers:
  - name: Critical
  - name: Default
    slack_configs:
      - channel: '#openshift-on-kvm'
        api_url: >-
          https://hooks.slack.com/services/XXXXXX
  - name: Watchdog
route:
  group_by:
    - namespace
  group_interval: 5m
  group_wait: 30s
  receiver: Default
  repeat_interval: 12h
  routes:
    - matchers:
        - alertname = Watchdog
      receiver: Watchdog
    - matchers:
        - severity = critical
      receiver: Critical

ケース2 利用者はアラートの設定も通知先の設定も行いたい

利用者がアラートの通知先を設定するには AlertmanagerConfig リソースを設定しその設定を管理者用の AlertManager に設定する必要があります。利用者が作成する AlertManagerConfig リソースは、Namespace スコープのリソースとなるため、Namespace/Project ごとに設定が必要となります。

設定は次のとおりとなります。

管理者用の Monitoring の設定

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true 
    alertmanagerMain:
      enableUserAlertmanagerConfig: true

利用者用の Monitoring の設定はアラート通知については不要です。

この設定を実施した状態で、AlertmanagerConfig を設定します。通知先である Slack の URL は webhook Secret の url Key として定義しています。

apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: slack-routing
spec:
  route:
    receiver: sample
  receivers:
  - name: sample
    slackConfigs:
     - channel: '#openshift-on-kvm'
       apiURL:
         name: webhook
         key: url

サンプルのアラートを設定してしばらくたったあとの確認結果は次のとおりとなります。アラートは管理者用の AlertManager に通知され、かつ、alert-sample プロジェクト用の通知先が設定されているのが確認できます。

User: alerts
Error from server (NotFound): pods "alertmanager-user-workload-0" not found

Cluster: alerts
Alertname                            Starts At                Summary                                                                                                State
Watchdog                             2024-12-05 23:06:52 UTC  An alert that should always be firing to certify that Alertmanager is working properly.                active
UpdateAvailable                      2024-12-05 23:07:36 UTC  Your upstream update recommendation service recommends you update your cluster.                        active
PrometheusOperatorRejectedResources  2024-12-05 23:12:32 UTC  Resources rejected by Prometheus operator                                                              active
InsightsRecommendationActive         2024-12-05 23:15:03 UTC  An Insights recommendation is active for this cluster.                                                 active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
ClusterNotUpgradeable                2024-12-06 00:07:40 UTC  One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.  active
PrometheusDuplicateTimestamps        2024-12-06 00:07:55 UTC  Prometheus is dropping samples with duplicate timestamps.                                              active
PrometheusDuplicateTimestamps        2024-12-06 00:07:55 UTC  Prometheus is dropping samples with duplicate timestamps.                                              active
PodDisruptionBudgetAtLimit           2024-12-06 00:09:08 UTC  The pod disruption budget is preventing further disruption to pods.                                    active
ExampleUserAlert                     2024-12-06 12:19:23 UTC  This is sample summary                                                                                 active

User workload: alertmanager.yaml
Error from server (NotFound): pods "alertmanager-user-workload-0" not found


Cluster: alertmanager.yaml
route:
  receiver: Default
  group_by:
  - namespace
  routes:
  - receiver: alert-sample/slack-routing/sample
    matchers:
    - namespace="alert-sample"
    continue: true
  - receiver: Watchdog
    matchers:
    - alertname = Watchdog
  - receiver: Critical
    matchers:
    - severity = critical
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
inhibit_rules:
- target_matchers:
  - severity =~ warning|info
  source_matchers:
  - severity = critical
  equal:
  - namespace
  - alertname
- target_matchers:
  - severity = info
  source_matchers:
  - severity = warning
  equal:
  - namespace
  - alertname
receivers:
- name: Critical
- name: Default
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXXXX
    channel: '#openshift-on-kvm'
- name: Watchdog
- name: alert-sample/slack-routing/sample
  slack_configs:
  - api_url: https://hooks.slack.com/services/YYYYYY
    channel: '#openshift-on-kvm'
templates: []

AlertManager のレシーバーの名前が alert-sample/slack-routing/sample と長いですが、Namespace が alert-sample の場合はこちらに通知されます。

ケース3 管理者も多くのアラート通知を行うので、利用者の負荷と分散したい場合

こちらは利用者の要望ではなく、管理者側の要望で、利用者のアラート通知と管理者のアラート通知の負荷を分散するために、個別に AlertManager を利用することを考えます。

設定は次のとおりとなり、利用者用の Monitoring に Alertmanager の有効化と、AlertmanagerConfig の有効化が設定されます。

管理者用の Monitoring の設定

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

利用者用の Monitoring の設定

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true

この設定を行った状態で、AlertmanagerConfig を設定します。設定内容は前回と同様に次のとおりです

apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: slack-routing
spec:
  route:
    receiver: sample
  receivers:
  - name: sample
    slackConfigs:
     - channel: '#openshift-on-kvm'
       apiURL:
         name: webhook
         key: url

サンプルのアラートを設定してしばらくたったあとの確認結果は次のとおりとなります。アラートは利用者用の AlertManager に通知され、かつ、alert-sample プロジェクト用の通知先も利用者用のAlertManager に設定されているのが確認できます。

User: alerts
Alertname         Starts At                Summary                 State
ExampleUserAlert  2024-12-06 12:47:43 UTC  This is sample summary  active

Cluster: alerts
Alertname                            Starts At                Summary                                                                                                State
Watchdog                             2024-12-05 23:06:52 UTC  An alert that should always be firing to certify that Alertmanager is working properly.                active
UpdateAvailable                      2024-12-05 23:07:36 UTC  Your upstream update recommendation service recommends you update your cluster.                        active
PrometheusOperatorRejectedResources  2024-12-05 23:12:32 UTC  Resources rejected by Prometheus operator                                                              active
InsightsRecommendationActive         2024-12-05 23:15:03 UTC  An Insights recommendation is active for this cluster.                                                 active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetMisScheduled            2024-12-05 23:22:49 UTC  DaemonSet pods are misscheduled.                                                                       active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
KubeDaemonSetRolloutStuck            2024-12-05 23:37:49 UTC  DaemonSet rollout is stuck.                                                                            active
ClusterNotUpgradeable                2024-12-06 00:07:40 UTC  One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.  active
PrometheusDuplicateTimestamps        2024-12-06 00:07:55 UTC  Prometheus is dropping samples with duplicate timestamps.                                              active
PrometheusDuplicateTimestamps        2024-12-06 00:07:55 UTC  Prometheus is dropping samples with duplicate timestamps.                                              active
PodDisruptionBudgetAtLimit           2024-12-06 00:09:08 UTC  The pod disruption budget is preventing further disruption to pods.                                    active

User workload: alertmanager.yaml
route:
  receiver: Default
  group_by:
  - namespace
  routes:
  - receiver: alert-sample/slack-routing/sample
    matchers:
    - namespace="alert-sample"
    continue: true
receivers:
- name: Default
- name: alert-sample/slack-routing/sample
  slack_configs:
  - api_url: https://hooks.slack.com/services/YYYYY
    channel: '#openshift-on-kvm'
templates: []


Cluster: alertmanager.yaml
inhibit_rules:
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = critical
    target_matchers:
      - severity =~ warning|info
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = warning
    target_matchers:
      - severity = info
receivers:
  - name: Critical
  - name: Default
    slack_configs:
      - channel: '#openshift-on-kvm'
        api_url: >-
          https://hooks.slack.com/services/XXXXX
  - name: Watchdog
route:
  group_by:
    - namespace
  group_interval: 5m
  group_wait: 30s
  receiver: Default
  repeat_interval: 12h
  routes:
    - matchers:
        - alertname = Watchdog
      receiver: Watchdog
    - matchers:
        - severity = critical
      receiver: Critical

まとめ

アラートの通知に関して、管理者用と利用者用の設定を変更することで、利用者に通知先の設定を移譲したりと様々な設定ができることが確認できました。これで安心してアラート設計が行えると思います。では、最後に ChatGPT になぞかけをひとつ作ってもらいましたので、最後の締めとして残します。

サンタクロースへの手紙とかけて、アラートの通知先と解きます。その心は？

どちらも「届ける相手」が重要です！ 🎅📩

おつかれさまでした。