可観測性ベストプラクティス

本番規模の監視のためのPrometheusの使用

Prometheusを使用したIstioメッシュの本番規模の監視に推奨されるアプローチは、階層型フェデレーションと、レコーディングルールの集合を組み合わせることです。

IstioのインストールではPrometheusはデフォルトでデプロイされませんが、はじめにの指示に従って、Prometheus統合ガイドに記載されているオプション1：クイックスタートのPrometheusデプロイメントがインストールされます。このPrometheusデプロイメントは、意図的に非常に短い保持期間（6時間）で設定されています。クイックスタートのPrometheusデプロイメントは、メッシュ内で実行されている各Envoyプロキシからメトリクスを収集し、各メトリクスにその起源に関する一連のラベル（instance、pod、namespace）を追加するように構成されています。

Architecture for production monitoring of Istio using Prometheus. — Istioを使用した本番規模のIstio監視

レコーディングルールによるワークロードレベルの集計

インスタンスとポッド全体でメトリクスを集計するには、次のレコーディングルールを使用してデフォルトのPrometheus設定を更新します。

groups:
- name: "istio.recording-rules"
  interval: 5s
  rules:
  - record: "workload:istio_requests_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)

  - record: "workload:istio_request_duration_milliseconds_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)

  - record: "workload:istio_request_duration_milliseconds_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)

  - record: "workload:istio_request_duration_milliseconds_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)

  - record: "workload:istio_request_bytes_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)

  - record: "workload:istio_request_bytes_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)

  - record: "workload:istio_request_bytes_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)

  - record: "workload:istio_response_bytes_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_count)

  - record: "workload:istio_response_bytes_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)

  - record: "workload:istio_response_bytes_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)

  - record: "workload:istio_tcp_sent_bytes_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total)

  - record: "workload:istio_tcp_received_bytes_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total)

  - record: "workload:istio_tcp_connections_opened_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)

  - record: "workload:istio_tcp_connections_closed_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-metrics-aggregation
  labels:
    app.kubernetes.io/name: istio-prometheus
spec:
  groups:
  - name: "istio.metricsAggregation-rules"
    interval: 5s
    rules:
    - record: "workload:istio_requests_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)"

    - record: "workload:istio_request_duration_milliseconds_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)"
    - record: "workload:istio_request_duration_milliseconds_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)"
    - record: "workload:istio_request_duration_milliseconds_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)"

    - record: "workload:istio_request_bytes_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)"
    - record: "workload:istio_request_bytes_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)"
    - record: "workload:istio_request_bytes_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)"

    - record: "workload:istio_response_bytes_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_count)"
    - record: "workload:istio_response_bytes_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)"
    - record: "workload:istio_response_bytes_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)"

    - record: "workload:istio_tcp_sent_bytes_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total)"
    - record: "workload:istio_tcp_received_bytes_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total)"
    - record: "workload:istio_tcp_connections_opened_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)"
    - record: "workload:istio_tcp_connections_closed_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)"

ワークロードレベルの集計メトリクスを使用したフェデレーション

Prometheusフェデレーションを確立するには、本番環境にデプロイされたPrometheusの設定を変更して、Istio Prometheusのフェデレーションエンドポイントをスクレイピングするようにします。

設定に次のジョブを追加します。

- job_name: 'istio-prometheus'
  honor_labels: true
  metrics_path: '/federate'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: ['istio-system']
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'workload:(.*)'
    target_label: __name__
    action: replace
  params:
    'match[]':
    - '{__name__=~"workload:(.*)"}'
    - '{__name__=~"pilot(.*)"}'

Prometheus Operatorを使用している場合は、代わりに次の設定を使用します。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istio-federation
  labels:
    app.kubernetes.io/name: istio-prometheus
spec:
  namespaceSelector:
    matchNames:
    - istio-system
  selector:
    matchLabels:
      app: prometheus
  endpoints:
  - interval: 30s
    scrapeTimeout: 30s
    params:
      'match[]':
      - '{__name__=~"workload:(.*)"}'
      - '{__name__=~"pilot(.*)"}'
    path: /federate
    targetPort: 9090
    honorLabels: true
    metricRelabelings:
    - sourceLabels: ["__name__"]
      regex: 'workload:(.*)'
      targetLabel: "__name__"
      action: replace

フェデレーション設定の鍵となるのは、Istio標準メトリクスを収集しているIstioデプロイ済みPrometheusのジョブに一致させ、ワークロードレベルのレコーディングルール（workload:）で使用されているプレフィックスを削除して収集されたメトリクス名を変更することです。これにより、本番Prometheusインスタンス（Istioインスタンスから切り替えた場合）を対象とした既存のダッシュボードとクエリはシームレスに動作し続けます。

フェデレーションを設定する際に、追加のメトリクス（Envoy、Goなど）を含めることもできます。

コントロールプレーンのメトリクスも収集され、本番Prometheusにフェデレートされます。

レコーディングルールによるメトリクス収集の最適化

レコーディングルールを使用してPodとインスタンスを集計するだけでなく、既存のダッシュボードとアラートに合わせて調整された集計メトリクスを生成するためにレコーディングルールを使用することを検討する必要があるかもしれません。このように収集を最適化することで、本番Prometheusインスタンスのリソース消費量の削減とクエリパフォーマンスの向上という大きなメリットが得られます。

たとえば、次のPrometheusクエリを使用するカスタム監視ダッシュボードがあるとします。

宛先サービス名と名前空間別に、過去1分間の平均リクエスト総レート

sum(irate(istio_requests_total{reporter="source"}[1m]))
by (
    destination_canonical_service,
    destination_workload_namespace
)

ソースと宛先サービス名、名前空間別に、過去1分間の平均P95クライアントレイテンシ

histogram_quantile(0.95,
  sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
  by (
    destination_canonical_service,
    destination_workload_namespace,
    source_canonical_service,
    source_workload_namespace,
    le
  )
)

フェデレーションのためのこれらのメトリクスの識別を容易にするために、istioプレフィックスを使用して、次のレコーディングルールセットをIstio Prometheus設定に追加できます。

groups:
- name: "istio.recording-rules"
  interval: 5s
  rules:
  - record: "istio:istio_requests:by_destination_service:rate1m"
    expr: |
      sum(irate(istio_requests_total{reporter="destination"}[1m]))
      by (
        destination_canonical_service,
        destination_workload_namespace
      )
  - record: "istio:istio_request_duration_milliseconds_bucket:p95:rate1m"
    expr: |
      histogram_quantile(0.95,
        sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
        by (
          destination_canonical_service,
          destination_workload_namespace,
          source_canonical_service,
          source_workload_namespace,
          le
        )
      )

その後、本番Prometheusインスタンスを更新して、Istioインスタンスからフェデレートします。

{__name__=~"istio:(.*)"}のマッチ句
メトリックリラベリング設定：regex: "istio:(.*)"

元のクエリは、次に置き換えられます。

istio_requests:by_destination_service:rate1m
avg(istio_request_duration_milliseconds_bucket:p95:rate1m)