2025-12-24 02:48:54 +00:00
|
|
|
# Duty AI Ops
|
|
|
|
|
|
|
|
|
|
AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues.
|
|
|
|
|
|
|
|
|
|
## Features
|
|
|
|
|
|
|
|
|
|
- 🤖 **AI-Powered Diagnostics**: Uses LLM with function calling to analyze Kubernetes issues
|
|
|
|
|
- 🔍 **Smart Event Detection**: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.)
|
2025-12-24 02:52:37 +00:00
|
|
|
- 🧠 **Correlation Engine**: Detects patterns like mass pod failures and diagnoses the root cause (node issues)
|
|
|
|
|
- 📱 **Telegram Notifications**: Optional real-time alerts
|
2025-12-24 02:48:54 +00:00
|
|
|
|
|
|
|
|
## How It Works
|
|
|
|
|
|
|
|
|
|
1. **Watches** Kubernetes cluster for pod/node events
|
|
|
|
|
2. **Correlates** issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node)
|
|
|
|
|
3. **Analyzes** using AI with access to k8s API tools (get pod details, logs, node status)
|
|
|
|
|
4. **Notifies** via Telegram with structured diagnostic reports
|
|
|
|
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Clone repository
|
|
|
|
|
git clone https://github.com/yourusername/duty-ai-ops
|
|
|
|
|
cd duty-ai-ops
|
|
|
|
|
|
|
|
|
|
# Build
|
|
|
|
|
cargo build --release
|
|
|
|
|
|
|
|
|
|
# Copy and configure
|
|
|
|
|
cp config.toml.example config.toml
|
|
|
|
|
# Edit config.toml with your settings
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Configuration
|
|
|
|
|
|
|
|
|
|
```toml
|
|
|
|
|
# API Configuration (OpenAI-compatible)
|
|
|
|
|
api_base = "http://localhost:11434/v1" # Ollama, vLLM, etc.
|
|
|
|
|
api_key = "your-key"
|
2025-12-24 02:52:37 +00:00
|
|
|
model = "qwen3-tools:latest" # any model supporting tools
|
2025-12-24 02:48:54 +00:00
|
|
|
|
|
|
|
|
# Concurrency
|
2025-12-24 02:52:37 +00:00
|
|
|
max_concurrent_diagnoses = 1 # parallel OpenAI API requests
|
2025-12-24 02:48:54 +00:00
|
|
|
|
|
|
|
|
# Telegram (optional)
|
|
|
|
|
telegram_bot_token = "your-bot-token"
|
|
|
|
|
telegram_chat_id = "your-chat-id"
|
|
|
|
|
|
|
|
|
|
# System prompt (customize AI behavior)
|
|
|
|
|
system_prompt = """..."""
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Run with kubeconfig
|
|
|
|
|
cargo run --release
|
|
|
|
|
|
|
|
|
|
# Or use specific kubeconfig
|
|
|
|
|
KUBECONFIG=/path/to/kubeconfig cargo run --release
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Requirements
|
|
|
|
|
|
|
|
|
|
- Rust 1.70+
|
2025-12-24 02:52:37 +00:00
|
|
|
- Kubernetes cluster access (via kubeconfig or kubernetes service account)
|
2025-12-24 02:48:54 +00:00
|
|
|
- OpenAI-compatible LLM endpoint with function calling support
|
|
|
|
|
- Tested with: Ollama (qwen3-tools, devstral-small-2)
|
|
|
|
|
- Should work with: OpenAI, vLLM, etc.
|
|
|
|
|
|
|
|
|
|
## Example Output
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
🔍 Node: worker-1
|
|
|
|
|
📋 Problem: Node NotReady due to disk pressure
|
|
|
|
|
🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet.
|
|
|
|
|
12 pods evicted due to insufficient disk space.
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
|
|
|
|
- **Event Handlers**: Process pod/node watch events
|
|
|
|
|
- **Correlation Engine**: Detect patterns (mass failures, repeated issues)
|
|
|
|
|
- **AI Diagnostics**: LLM with k8s tools analyzes and generates reports
|
|
|
|
|
- **Telegram Notifier**: Sends/updates notifications
|
|
|
|
|
|
|
|
|
|
## License
|
|
|
|
|
|
2025-12-24 02:52:37 +00:00
|
|
|
WTFPL
|