Init

2025-12-24 02:48:54 +00:00
parent 02df63b08b
commit 24797bdd6f
1 changed files with 92 additions and 0 deletions
@@ -0,0 +1,92 @@
+# Duty AI Ops
+
+AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues.
+
+## Features
+
+- 🤖 **AI-Powered Diagnostics**: Uses LLM with function calling to analyze Kubernetes issues
+- 🔍 **Smart Event Detection**: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.)
+- 🧠 **Correlation Engine**: Detects patterns like mass pod failures and diagnoses the root cause (node issues) instead of individual symptoms
+- 📱 **Telegram Notifications**: Optional real-time alerts with automatic "resolved" status updates
+- ⚡ **Concurrent Processing**: Configurable parallel AI diagnosis requests with semaphore-based rate limiting
+- 🎯 **Resource-Aware**: Tracks pod CPU/memory requests and limits for better diagnostics
+
+## How It Works
+
+1. **Watches** Kubernetes cluster for pod/node events
+2. **Correlates** issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node)
+3. **Analyzes** using AI with access to k8s API tools (get pod details, logs, node status)
+4. **Notifies** via Telegram with structured diagnostic reports
+5. **Tracks** issue resolution and updates notifications automatically
+
+## Installation
+
+```bash
+# Clone repository
+git clone https://github.com/yourusername/duty-ai-ops
+cd duty-ai-ops
+
+# Build
+cargo build --release
+
+# Copy and configure
+cp config.toml.example config.toml
+# Edit config.toml with your settings
+```
+
+## Configuration
+
+```toml
+# API Configuration (OpenAI-compatible)
+api_base = "http://localhost:11434/v1"  # Ollama, vLLM, etc.
+api_key = "your-key"
+model = "qwen3-tools:latest"
+
+# Concurrency
+max_concurrent_diagnoses = 1
+
+# Telegram (optional)
+telegram_bot_token = "your-bot-token"
+telegram_chat_id = "your-chat-id"
+
+# System prompt (customize AI behavior)
+system_prompt = """..."""
+```
+
+## Usage
+
+```bash
+# Run with kubeconfig
+cargo run --release
+
+# Or use specific kubeconfig
+KUBECONFIG=/path/to/kubeconfig cargo run --release
+```
+
+## Requirements
+
+- Rust 1.70+
+- Kubernetes cluster access (via kubeconfig)
+- OpenAI-compatible LLM endpoint with function calling support
+  - Tested with: Ollama (qwen3-tools, devstral-small-2)
+  - Should work with: OpenAI, vLLM, etc.
+
+## Example Output
+
+```
+🔍 Node: worker-1
+📋 Problem: Node NotReady due to disk pressure
+🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet. 
+   12 pods evicted due to insufficient disk space.
+```
+
+## Architecture
+
+- **Event Handlers**: Process pod/node watch events
+- **Correlation Engine**: Detect patterns (mass failures, repeated issues)
+- **AI Diagnostics**: LLM with k8s tools analyzes and generates reports
+- **Telegram Notifier**: Sends/updates notifications
+
+## License
+
+MIT