kube-prometheus配置webhook报警
这里使用telegram的bot来发送报警信息,首先搜索BotFather申请一个bot,然后查看他的api token 记住这个token,然后获取一个频道chat id,可以搜索Get My ID这个机器人,把机器人拉进群组就会自动生成频道id 记住这个id,然后准备一个python脚本,替换里面的token和频道id
import json
import time
from flask import Flask, request
import requests
app = Flask(__name__)
TELEGRAM_BOT_TOKEN = '机器人的api token'
TELEGRAM_CHAT_ID = '频道id,-开头'
@app.route('/alert', methods=['POST'])
def post_alert():
alert = request.get_json()
send_telegram_message(alert)
return '', 200
def send_telegram_message(message):
receiver = message['receiver']
status = message['status']
# 获取所有警报
alerts = message['alerts']
messages = []
for alert in alerts:
alert_status = alert['status']
alert_name = alert['labels']['alertname']
severity = alert['labels']['severity']
description = alert['annotations']['description']
summary = alert['annotations']['summary']
runbook_url = alert['annotations'].get('runbook_url', '')
# 组合为一条消息
message = (f"接收者:{receiver}"
f"\n状态:{status}"
f"\n警报名称:{alert_name}"
f"\n警报状态:{alert_status}"
f"\n严重级别:{severity}"
f"\n描述:{description}"
f"\n概括:{summary}"
f"\n参考方法: {runbook_url}")
messages.append(message)
# 合并所有消息为一个字符串
alerts_info = "\n---\n".join(messages)
url = f'https://api.telegram.org/bot{TELEGRAM_BOT_TOKEN}/sendMessage'
payload = {
'chat_id': TELEGRAM_CHAT_ID,
'text': alerts_info,
'parse_mode': 'Markdown'
}
while True:
try:
requests.post(url, json=payload)
break
except Exception as e:
print(e)
time.sleep(3)
if __name__ == '__main__':
app.run(port=5001, host="0.0.0.0")
端口自定,这里用的5001,host是设置使得服务器能够接收的ip,0.0.0.0代表所有ip 然后运行该脚本测试一下是否可以把消息发送到频道 执行脚本如果报错缺少库,先把库安装一下
pip3 install flask
pip3 install requests
然后准备一个json的内容,通过post请求到脚本的地址加端口
curl -X POST -H "Content-Type: application/json" -d '{"receiver":"telegram-webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"KubeControllerManagerDown","prometheus":"monitoring/k8s","severity":"critical"},"annotations":{"description":"KubeControllerManager has disappeared from Prometheus target discovery.","runbook_url":"https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontrollermanagerdown","summary":"Target disappeared from Prometheus target discovery."},"startsAt":"2024-02-17T02:43:32.85Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus-k8s-0:9090/graph?g0.expr=absent%28up%7Bjob%3D%22kube-controller-manager%22%7D+%3D%3D+1%29&g0.tab=1","fingerprint":"838121cc4ca56ab5"},{"status":"firing","labels":{"alertname":"KubeSchedulerDown","prometheus":"monitoring/k8s","severity":"critical"},"annotations":{"description":"KubeScheduler has disappeared from Prometheus target discovery.","runbook_url":"https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeschedulerdown","summary":"Target disappeared from Prometheus target discovery."},"startsAt":"2024-02-17T02:43:21.848Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus-k8s-0:9090/graph?g0.expr=absent%28up%7Bjob%3D%22kube-scheduler%22%7D+%3D%3D+1%29&g0.tab=1","fingerprint":"92b2f1b7ee31decf"}],"groupLabels":{},"commonLabels":{"prometheus":"monitoring/k8s","severity":"critical"},"commonAnnotations":{"summary":"Target disappeared from Prometheus target discovery."},"externalURL":"http://alertmanager-main-1:9093","version":"4","truncatedAlerts":0}' http://脚本地址:5001/alert
请求路径需要带上/alert 发送请求后telegram的群组就会看到机器人的报警信息
收到信息说明脚本没什么问题,这个脚本只需要放在一台服务器上运行着就ok 接下来配置alertmanager的报警接受者,我用的kube-prometheus套件,直接修改alertmanager的配置yaml文件,cat manifests/alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.25.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 10s
group_interval: 10m
repeat_interval: 1h
receiver: 'telegram-bot-webhook'
receivers:
- name: 'telegram-bot-webhook'
webhook_configs:
- url: 'http://脚本运行的服务器ip:5001/alert'
send_resolved: true
group_wait: 这是当第一个告警触发后,Alertmanager将等待的时间长度,以收集可能属于同一组的其他告警通知。例如,设置为10s意味着当第一个告警触发后,Alertmanager会等待10秒钟,看是否有其他相关的告警也会触发,从而可以一起发送通知,以减少单独告警通知的数量。
group_interval: 一旦告警通知已经发送过一次后,这是Alertmanager在发送同一组告警的下一批通知前将等待的时间。在这个例子中,设置为10m意味着,在一组告警的通知被发送后,即便这组告警中还有告警仍处于触发状态,Alertmanager也会等待10分钟才会再次发送关于这组告警的通知。
repeat_interval: 这是即使没有新的告警加入,已被通知过的告警再次发送通知之前需要等待的时间间隔。在这个例子中设置为1h,意味着对同一告警,即使它持续触发状态不变,Alertmanager也会每小时重复发送一次通知。 这三个参数按照需求修改,改完后apply一下,然后重启alertmanager的pod,触发报警后就会将内容发送给脚本,然后脚本将内容转换发送到telegram的频道了
也可以让脚本作为容器运行在k8s内,创建一个目录,把脚本放到这个目录,在写一个Dockerfile,制作一个镜像
FROM python:3.8-slim
# 设置工作目录
WORKDIR /src
# 复制当前目录添加到容器工作目录
ADD . /src
#下载库
RUN pip install flask requests
# 启动脚本
CMD python app.py
然后构建镜像
docker build -t 镜像名字:版本号 .
执行这个命令构建出来镜像,然后推到本地仓库或者docker hub,云镜像仓库都可以 推上去之后将在集群创建一个deploy+svc,引用这个镜像就可以了
apiVersion: apps/v1
kind: Deployment
metadata:
name: telegarm-hook
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: telegarm-hook
template:
metadata:
labels:
app: telegarm-hook
spec:
containers:
- name: telegarm-hook
image: 脚本镜像
imagePullPolicy: IfNotPresent
ports:
- containerPort: 5001
name: http
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 50m
memory: 100Mi
---
apiVersion: v1
kind: Service
metadata:
name: telegarm-hook
namespace: monitoring
spec:
type: NodePort
selector:
app: telegarm-hook
ports:
- name: hook
port: 5001
targetPort: http
这里提供一个yaml文件,根据集群情况自己修改就行,如果在k8s内运行需要把alertmanager的配置也修改一下
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.25.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 10s
group_interval: 10m
repeat_interval: 1h
receiver: 'telegram-bot-webhook'
receivers:
- name: 'telegram-bot-webhook'
webhook_configs:
- url: 'http://telegarm-hook.monitoring.svc.cluster.local:5001/alert'
send_resolved: true
type: Opaque
同样的文件,url地址直接使用集群内部的调用方式就行,修改完配置文件直接删了alertmanager的3个pod,重启就行