Duty AI Ops

AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues.

Features

🤖 AI-Powered Diagnostics: Uses LLM with function calling to analyze Kubernetes issues
🔍 Smart Event Detection: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.)
🧠 Correlation Engine: Detects patterns like mass pod failures and diagnoses the root cause (node issues)
📱 Telegram Notifications: Optional real-time alerts

How It Works

Watches Kubernetes cluster for pod/node events
Correlates issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node)
Analyzes using AI with access to k8s API tools (get pod details, logs, node status)
Notifies via Telegram with structured diagnostic reports

Installation

# Clone repository
git clone https://github.com/yourusername/duty-ai-ops
cd duty-ai-ops

# Build
cargo build --release

# Copy and configure
cp config.toml.example config.toml
# Edit config.toml with your settings

Configuration

# API Configuration (OpenAI-compatible)
api_base = "http://localhost:11434/v1"  # Ollama, vLLM, etc.
api_key = "your-key"
model = "qwen3-tools:latest" # any model supporting tools

# Concurrency
max_concurrent_diagnoses = 1 # parallel OpenAI API requests

# Telegram (optional)
telegram_bot_token = "your-bot-token"
telegram_chat_id = "your-chat-id"

# System prompt (customize AI behavior)
system_prompt = """..."""

Usage

# Run with kubeconfig
cargo run --release

# Or use specific kubeconfig
KUBECONFIG=/path/to/kubeconfig cargo run --release

Requirements

Rust 1.70+
Kubernetes cluster access (via kubeconfig or kubernetes service account)
OpenAI-compatible LLM endpoint with function calling support
- Tested with: Ollama (qwen3-tools, devstral-small-2)
- Should work with: OpenAI, vLLM, etc.

Example Output

🔍 Node: worker-1
📋 Problem: Node NotReady due to disk pressure
🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet. 
   12 pods evicted due to insufficient disk space.

Architecture

Event Handlers: Process pod/node watch events
Correlation Engine: Detect patterns (mass failures, repeated issues)
AI Diagnostics: LLM with k8s tools analyzes and generates reports
Telegram Notifier: Sends/updates notifications

License

WTFPL