2.4 KiB
2.4 KiB
Duty AI Ops
AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues.
Features
- 🤖 AI-Powered Diagnostics: Uses LLM with function calling to analyze Kubernetes issues
- 🔍 Smart Event Detection: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.)
- 🧠 Correlation Engine: Detects patterns like mass pod failures and diagnoses the root cause (node issues)
- 📱 Telegram Notifications: Optional real-time alerts
How It Works
- Watches Kubernetes cluster for pod/node events
- Correlates issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node)
- Analyzes using AI with access to k8s API tools (get pod details, logs, node status)
- Notifies via Telegram with structured diagnostic reports
Installation
# Clone repository
git clone https://github.com/yourusername/duty-ai-ops
cd duty-ai-ops
# Build
cargo build --release
# Copy and configure
cp config.toml.example config.toml
# Edit config.toml with your settings
Configuration
# API Configuration (OpenAI-compatible)
api_base = "http://localhost:11434/v1" # Ollama, vLLM, etc.
api_key = "your-key"
model = "qwen3-tools:latest" # any model supporting tools
# Concurrency
max_concurrent_diagnoses = 1 # parallel OpenAI API requests
# Telegram (optional)
telegram_bot_token = "your-bot-token"
telegram_chat_id = "your-chat-id"
# System prompt (customize AI behavior)
system_prompt = """..."""
Usage
# Run with kubeconfig
cargo run --release
# Or use specific kubeconfig
KUBECONFIG=/path/to/kubeconfig cargo run --release
Requirements
- Rust 1.70+
- Kubernetes cluster access (via kubeconfig or kubernetes service account)
- OpenAI-compatible LLM endpoint with function calling support
- Tested with: Ollama (qwen3-tools, devstral-small-2)
- Should work with: OpenAI, vLLM, etc.
Example Output
🔍 Node: worker-1
📋 Problem: Node NotReady due to disk pressure
🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet.
12 pods evicted due to insufficient disk space.
Architecture
- Event Handlers: Process pod/node watch events
- Correlation Engine: Detect patterns (mass failures, repeated issues)
- AI Diagnostics: LLM with k8s tools analyzes and generates reports
- Telegram Notifier: Sends/updates notifications
License
WTFPL