README.md

# Duty AI Ops

AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues.

## Features

- 🤖 **AI-Powered Diagnostics**: Uses LLM with function calling to analyze Kubernetes issues
- 🔍 **Smart Event Detection**: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.)
- 🧠 **Correlation Engine**: Detects patterns like mass pod failures and diagnoses the root cause (node issues)
- 📱 **Telegram Notifications**: Optional real-time alerts

## How It Works

1. **Watches** Kubernetes cluster for pod/node events
2. **Correlates** issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node)
3. **Analyzes** using AI with access to k8s API tools (get pod details, logs, node status)
4. **Notifies** via Telegram with structured diagnostic reports

## Installation

```bash
# Clone repository
git clone https://github.com/yourusername/duty-ai-ops
cd duty-ai-ops

# Build
cargo build --release

# Copy and configure
cp config.toml.example config.toml
# Edit config.toml with your settings
```

## Configuration

```toml
# API Configuration (OpenAI-compatible)
api_base = "http://localhost:11434/v1"  # Ollama, vLLM, etc.
api_key = "your-key"
model = "qwen3-tools:latest" # any model supporting tools

# Concurrency
max_concurrent_diagnoses = 1 # parallel OpenAI API requests

# Telegram (optional)
telegram_bot_token = "your-bot-token"
telegram_chat_id = "your-chat-id"

# System prompt (customize AI behavior)
system_prompt = """..."""
```

## Usage

```bash
# Run with kubeconfig
cargo run --release

# Or use specific kubeconfig
KUBECONFIG=/path/to/kubeconfig cargo run --release
```

## Requirements

- Rust 1.70+
- Kubernetes cluster access (via kubeconfig or kubernetes service account)
- OpenAI-compatible LLM endpoint with function calling support
  - Tested with: Ollama (qwen3-tools, devstral-small-2)
  - Should work with: OpenAI, vLLM, etc.

## Example Output

```
🔍 Node: worker-1
📋 Problem: Node NotReady due to disk pressure
🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet. 
   12 pods evicted due to insufficient disk space.
```

## Architecture

- **Event Handlers**: Process pod/node watch events
- **Correlation Engine**: Detect patterns (mass failures, repeated issues)
- **AI Diagnostics**: LLM with k8s tools analyzes and generates reports
- **Telegram Notifier**: Sends/updates notifications

## License

WTFPL
Init 2025-12-24 02:48:54 +00:00			`# Duty AI Ops`

			`AI-powered Kubernetes monitoring and diagnostics tool that automatically detects and diagnoses cluster issues.`

			`## Features`

			`- 🤖 AI-Powered Diagnostics: Uses LLM with function calling to analyze Kubernetes issues`
			`- 🔍 Smart Event Detection: Monitors pods and nodes for problems (CrashLoopBackOff, NotReady, high restarts, etc.)`
Update README.md 2025-12-24 02:52:37 +00:00			`- 🧠 Correlation Engine: Detects patterns like mass pod failures and diagnoses the root cause (node issues)`
			`- 📱 Telegram Notifications: Optional real-time alerts`
Init 2025-12-24 02:48:54 +00:00
			`## How It Works`

			`1. Watches Kubernetes cluster for pod/node events`
			`2. Correlates issues (e.g., if 5+ pods fail on same node within 60s, diagnose the node)`
			`3. Analyzes using AI with access to k8s API tools (get pod details, logs, node status)`
			`4. Notifies via Telegram with structured diagnostic reports`

			`## Installation`

			```bash
			`# Clone repository`
			`git clone https://github.com/yourusername/duty-ai-ops`
			`cd duty-ai-ops`

			`# Build`
			`cargo build --release`

			`# Copy and configure`
			`cp config.toml.example config.toml`
			`# Edit config.toml with your settings`
			```

			`## Configuration`

			```toml
			`# API Configuration (OpenAI-compatible)`
			`api_base = "http://localhost:11434/v1" # Ollama, vLLM, etc.`
			`api_key = "your-key"`
Update README.md 2025-12-24 02:52:37 +00:00			`model = "qwen3-tools:latest" # any model supporting tools`
Init 2025-12-24 02:48:54 +00:00
			`# Concurrency`
Update README.md 2025-12-24 02:52:37 +00:00			`max_concurrent_diagnoses = 1 # parallel OpenAI API requests`
Init 2025-12-24 02:48:54 +00:00
			`# Telegram (optional)`
			`telegram_bot_token = "your-bot-token"`
			`telegram_chat_id = "your-chat-id"`

			`# System prompt (customize AI behavior)`
			`system_prompt = """..."""`
			```

			`## Usage`

			```bash
			`# Run with kubeconfig`
			`cargo run --release`

			`# Or use specific kubeconfig`
			`KUBECONFIG=/path/to/kubeconfig cargo run --release`
			```

			`## Requirements`

			`- Rust 1.70+`
Update README.md 2025-12-24 02:52:37 +00:00			`- Kubernetes cluster access (via kubeconfig or kubernetes service account)`
Init 2025-12-24 02:48:54 +00:00			`- OpenAI-compatible LLM endpoint with function calling support`
			`- Tested with: Ollama (qwen3-tools, devstral-small-2)`
			`- Should work with: OpenAI, vLLM, etc.`

			`## Example Output`

			```
			`🔍 Node: worker-1`
			`📋 Problem: Node NotReady due to disk pressure`
			`🔎 Root Cause: Node has exceeded 85% disk usage on /var/lib/kubelet.`
			`12 pods evicted due to insufficient disk space.`
			```

			`## Architecture`

			`- Event Handlers: Process pod/node watch events`
			`- Correlation Engine: Detect patterns (mass failures, repeated issues)`
			`- AI Diagnostics: LLM with k8s tools analyzes and generates reports`
			`- Telegram Notifier: Sends/updates notifications`

			`## License`

Update README.md 2025-12-24 02:52:37 +00:00			`WTFPL`