# Duty AI Ops AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues. ## Features - 🤖 **AI-Powered Diagnostics**: Uses LLM with function calling to analyze Kubernetes issues - 🔍 **Smart Event Detection**: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.) - 🧠 **Correlation Engine**: Detects patterns like mass pod failures and diagnoses the root cause (node issues) instead of individual symptoms - 📱 **Telegram Notifications**: Optional real-time alerts with automatic "resolved" status updates - ⚡ **Concurrent Processing**: Configurable parallel AI diagnosis requests with semaphore-based rate limiting - 🎯 **Resource-Aware**: Tracks pod CPU/memory requests and limits for better diagnostics ## How It Works 1. **Watches** Kubernetes cluster for pod/node events 2. **Correlates** issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node) 3. **Analyzes** using AI with access to k8s API tools (get pod details, logs, node status) 4. **Notifies** via Telegram with structured diagnostic reports 5. **Tracks** issue resolution and updates notifications automatically ## Installation ```bash # Clone repository git clone https://github.com/yourusername/duty-ai-ops cd duty-ai-ops # Build cargo build --release # Copy and configure cp config.toml.example config.toml # Edit config.toml with your settings ``` ## Configuration ```toml # API Configuration (OpenAI-compatible) api_base = "http://localhost:11434/v1" # Ollama, vLLM, etc. api_key = "your-key" model = "qwen3-tools:latest" # Concurrency max_concurrent_diagnoses = 1 # Telegram (optional) telegram_bot_token = "your-bot-token" telegram_chat_id = "your-chat-id" # System prompt (customize AI behavior) system_prompt = """...""" ``` ## Usage ```bash # Run with kubeconfig cargo run --release # Or use specific kubeconfig KUBECONFIG=/path/to/kubeconfig cargo run --release ``` ## Requirements - Rust 1.70+ - Kubernetes cluster access (via kubeconfig) - OpenAI-compatible LLM endpoint with function calling support - Tested with: Ollama (qwen3-tools, devstral-small-2) - Should work with: OpenAI, vLLM, etc. ## Example Output ``` 🔍 Node: worker-1 📋 Problem: Node NotReady due to disk pressure 🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet. 12 pods evicted due to insufficient disk space. ``` ## Architecture - **Event Handlers**: Process pod/node watch events - **Correlation Engine**: Detect patterns (mass failures, repeated issues) - **AI Diagnostics**: LLM with k8s tools analyzes and generates reports - **Telegram Notifier**: Sends/updates notifications ## License MIT