diff --git a/README.md b/README.md new file mode 100644 index 0000000..e2db74d --- /dev/null +++ b/README.md @@ -0,0 +1,92 @@ +# Duty AI Ops + +AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues. + +## Features + +- 🤖 **AI-Powered Diagnostics**: Uses LLM with function calling to analyze Kubernetes issues +- 🔍 **Smart Event Detection**: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.) +- 🧠 **Correlation Engine**: Detects patterns like mass pod failures and diagnoses the root cause (node issues) instead of individual symptoms +- 📱 **Telegram Notifications**: Optional real-time alerts with automatic "resolved" status updates +- ⚡ **Concurrent Processing**: Configurable parallel AI diagnosis requests with semaphore-based rate limiting +- 🎯 **Resource-Aware**: Tracks pod CPU/memory requests and limits for better diagnostics + +## How It Works + +1. **Watches** Kubernetes cluster for pod/node events +2. **Correlates** issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node) +3. **Analyzes** using AI with access to k8s API tools (get pod details, logs, node status) +4. **Notifies** via Telegram with structured diagnostic reports +5. **Tracks** issue resolution and updates notifications automatically + +## Installation + +```bash +# Clone repository +git clone https://github.com/yourusername/duty-ai-ops +cd duty-ai-ops + +# Build +cargo build --release + +# Copy and configure +cp config.toml.example config.toml +# Edit config.toml with your settings +``` + +## Configuration + +```toml +# API Configuration (OpenAI-compatible) +api_base = "http://localhost:11434/v1" # Ollama, vLLM, etc. +api_key = "your-key" +model = "qwen3-tools:latest" + +# Concurrency +max_concurrent_diagnoses = 1 + +# Telegram (optional) +telegram_bot_token = "your-bot-token" +telegram_chat_id = "your-chat-id" + +# System prompt (customize AI behavior) +system_prompt = """...""" +``` + +## Usage + +```bash +# Run with kubeconfig +cargo run --release + +# Or use specific kubeconfig +KUBECONFIG=/path/to/kubeconfig cargo run --release +``` + +## Requirements + +- Rust 1.70+ +- Kubernetes cluster access (via kubeconfig) +- OpenAI-compatible LLM endpoint with function calling support + - Tested with: Ollama (qwen3-tools, devstral-small-2) + - Should work with: OpenAI, vLLM, etc. + +## Example Output + +``` +🔍 Node: worker-1 +📋 Problem: Node NotReady due to disk pressure +🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet. + 12 pods evicted due to insufficient disk space. +``` + +## Architecture + +- **Event Handlers**: Process pod/node watch events +- **Correlation Engine**: Detect patterns (mass failures, repeated issues) +- **AI Diagnostics**: LLM with k8s tools analyzes and generates reports +- **Telegram Notifier**: Sends/updates notifications + +## License + +MIT