# Metrics

OpenViking provides a machine-oriented metrics system for exposing runtime health, request quality, model usage, resource processing throughput, and probe health states.

Unlike the human-facing `/api/v1/observer/*` endpoints and the analytics-oriented `/api/v1/stats/*` endpoints, Metrics are designed for:

- high-frequency scraping by Prometheus, Grafana Agent, and similar systems
- low-cardinality, aggregatable metric models
- monitoring, alerting, capacity observation, and regression diagnosis

## Overview

### Why Metrics

Metrics are well suited to answer questions like:

- Has HTTP traffic increased abnormally over the last few minutes?
- Are resource ingestion, retrieval, or model calls getting slower?
- Is there queue backlog?
- Are key dependencies such as storage, model providers, VikingDB, encryption, and async systems currently healthy?
- Is a specific tenant showing abnormal traffic or error rates?

Compared with logs and observer snapshots, metrics are better for:

- continuous scraping
- time-series aggregation
- dashboard visualization
- alert rules

### How Metrics Differ from Observer and Stats

| Capability | Best For | Output Format | Typical Usage |
|------------|----------|---------------|---------------|
| `/metrics` | online monitoring, alerting, trend aggregation | Prometheus exposition text | Grafana dashboards, Prometheus scraping |
| `/api/v1/observer/*` | human inspection of component snapshots | JSON / status tables | debugging, health checks |
| `/api/v1/stats/*` | analytics-oriented statistics | JSON | memory health, staleness, session extraction |

The boundary is:

- `/metrics` only carries **low-cardinality, low-cost** metrics
- `/api/v1/stats/*` continues to carry analytics-oriented statistics without being constrained by the Prometheus scraping model

## Metrics Architecture

The current metrics stack in OpenViking has four layers:

```text
Business logic / HTTP requests / background tasks
          │
          ▼
      DataSource
   (event emission / state reads)
          │
          ▼
      Collector
 (semantic routing + labels)
          │
          ▼
    MetricRegistry
   (in-process metric store)
          │
          ▼
      Exporter
 (Prometheus text rendering)
          │
          ▼
       /metrics
```

### DataSource

DataSources provide inputs to the metrics system in two main forms:

- **Event-based**: business code emits events at key points, such as retrieval completion, successful model calls, or resource ingestion stage completion
- **Read-based**: current state is read before `/metrics` export, such as queue state, lock state, or probe state

### Collector

Collectors turn inputs into metric semantics:

- choose which metric to write
- choose which labels to attach
- define how failure is exposed, such as `valid=1/0`

### MetricRegistry

The MetricRegistry is the in-process metric store that keeps the current metric values and serves them to the exporter.

### Exporter

The first exporter implementation is the Prometheus exporter, which renders registry contents into Prometheus exposition text.

## Usage

### Accessing `/metrics`

In the current implementation, `/metrics` is not wired to `get_request_context` or other auth dependencies, so from the code-path perspective it currently behaves as a public scrape endpoint.

```bash
curl http://localhost:1933/metrics
```

If your deployment protects `/metrics` at the gateway, reverse proxy, or service discovery layer, attach auth according to the deployment environment.

### Prometheus Scrape Example

```yaml
scrape_configs:
  - job_name: openviking
    metrics_path: /metrics
    static_configs:
      - targets: ["localhost:1933"]
```

### Understanding Common Labels

| Label | Meaning | Example |
|-------|---------|---------|
| `account_id` | tenant dimension label | `test-account`, `__unknown__`, `__overflow__` |
| `route` | HTTP route template | `/api/v1/search/find` |
| `method` | HTTP method | `GET`, `POST` |
| `status` | request or stage status | `200`, `ok`, `error` |
| `operation` | structured operation name | `search.find`, `resources.add_resource` |
| `context_type` | retrieval context type | `resource` |
| `provider` | model or external service provider | `volcengine` |
| `model_name` | model name | `doubao-seed-1-8-251228` |
| `stage` | stage label (defined by each metric family) | resource stage: `parse`; token attribution stage: `embed_query` |
| `valid` | whether the current sample is fresh and valid | `1` / `0` |

Notes:

- `account_id` is only enabled on controlled allowlisted metric families to prevent high-cardinality growth
- `valid=0` means the current state/probe sample is a fallback or stale value, not that the label itself is malformed
- `stage` semantics depend on the metric family:
  - `openviking_resource_stage_*`: resource ingestion pipeline stages (for example `parse/persist/process`)
  - `openviking_operation_tokens_total`: token attribution stages (for example `embed_query/rerank/vlm`)

## Key Metric Families

The metric summaries below are based on representative metrics currently exposed by the collectors in `openviking/metrics/collectors/`.

### Requests and Operations

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_http_requests_total` | Counter | `account_id, method, route, status` | total HTTP requests |
| `openviking_http_request_duration_seconds` | Histogram | `account_id, method, route, status` | HTTP latency distribution |
| `openviking_http_inflight_requests` | Gauge | `account_id, route` | current inflight requests (in-process approximation) |
| `openviking_operation_requests_total` | Counter | `account_id, operation, status` | total structured operations |
| `openviking_operation_duration_seconds` | Histogram | `account_id, operation, status` | structured operation duration distribution |

Typical usage:

- inspect whether `/api/v1/search/find` or `/api/v1/resources` is slowing down
- inspect whether a specific `operation` has elevated error rates

### Retrieval and Resource Processing

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_retrieval_requests_total` | Counter | `account_id, context_type` | retrieval request count |
| `openviking_retrieval_results_total` | Counter | `account_id, context_type` | total retrieved results |
| `openviking_retrieval_latency_seconds` | Histogram | `account_id, context_type` | retrieval latency distribution |
| `openviking_retrieval_zero_result_total` | Counter | `account_id, context_type` | retrieval zero-result count |
| `openviking_retrieval_rerank_used_total` | Counter | `account_id` | number of retrievals that used rerank |
| `openviking_retrieval_rerank_fallback_total` | Counter | `account_id` | retrieval rerank fallback count |
| `openviking_resource_stage_total` | Counter | `account_id, stage, status` | count of resource ingestion stages |
| `openviking_resource_stage_duration_seconds` | Histogram | `account_id, stage, status` | duration distribution of ingestion stages |
| `openviking_resource_wait_duration_seconds` | Histogram | `account_id, operation` | resource ingestion wait duration distribution (for example queue waiting) |

Typical `stage` values include:

- `request`
- `parse`
- `summarize`
- `persist`
- `finalize`
- `process`

### Vector, Memory, and Semantic Metrics

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_vector_searches_total` | Counter | `operation` | vector search count |
| `openviking_vector_scored_total` | Counter | `operation` | total scored candidates |
| `openviking_vector_passed_total` | Counter | `operation` | total passed candidates |
| `openviking_vector_returned_total` | Counter | `operation` | total returned candidates |
| `openviking_vector_scanned_total` | Counter | `operation` | total scanned candidates |
| `openviking_memory_extracted_total` | Counter | `operation` | total extracted memory items |
| `openviking_semantic_nodes_total` | Counter | `status` | total semantic nodes |

### Model Calls and Tokens

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_model_calls_total` | Counter | `model_type, provider, model_name` | unified model call count |
| `openviking_model_tokens_total` | Counter | `model_type, provider, model_name, token_type` | unified model token count |
| `openviking_vlm_calls_total` | Counter | `account_id, provider, model_name` | VLM call count |
| `openviking_vlm_tokens_input_total` | Counter | `account_id, provider, model_name` | VLM input tokens |
| `openviking_vlm_tokens_output_total` | Counter | `account_id, provider, model_name` | VLM output tokens |
| `openviking_vlm_tokens_total` | Counter | `account_id, provider, model_name` | VLM total tokens |
| `openviking_vlm_call_duration_seconds` | Histogram | `account_id, provider, model_name` | VLM call duration distribution |
| `openviking_embedding_requests_total` | Counter | `account_id, status` | embedding request count |
| `openviking_embedding_latency_seconds` | Histogram | `account_id, status` | embedding latency distribution |
| `openviking_embedding_errors_total` | Counter | `account_id, error_code` | embedding error count |
| `openviking_embedding_calls_total` | Counter | `account_id, provider, model_name` | embedding provider call count (per-call) |
| `openviking_embedding_call_duration_seconds` | Histogram | `account_id, provider, model_name` | embedding provider call duration distribution (per-call) |
| `openviking_embedding_tokens_input_total` | Counter | `account_id, provider, model_name` | embedding input tokens (per-call aggregate) |
| `openviking_embedding_tokens_output_total` | Counter | `account_id, provider, model_name` | embedding output tokens (per-call aggregate; may not appear if always 0) |
| `openviking_embedding_tokens_total` | Counter | `account_id, provider, model_name` | embedding total tokens (per-call aggregate) |
| `openviking_rerank_calls_total` | Counter | `account_id, provider, model_name` | rerank provider call count (per-call) |
| `openviking_rerank_call_duration_seconds` | Histogram | `account_id, provider, model_name` | rerank provider call duration distribution (per-call) |
| `openviking_rerank_tokens_input_total` | Counter | `account_id, provider, model_name` | rerank input tokens (per-call aggregate) |
| `openviking_rerank_tokens_output_total` | Counter | `account_id, provider, model_name` | rerank output tokens (per-call aggregate; may not appear if always 0) |
| `openviking_rerank_tokens_total` | Counter | `account_id, provider, model_name` | rerank total tokens (per-call aggregate) |
| `openviking_operation_tokens_total` | Counter | `account_id, operation, stage, token_type` | operation token aggregation (token attribution stages) |

Notes:

- `openviking_model_*` gives a unified cross-model view for embedding and VLM usage
- `openviking_vlm_*` and `openviking_embedding_*` are better suited for workload-specific dashboards

### Queues, Locks, and Runtime State

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_queue_processed_total` | Counter | `queue` | total processed items per queue |
| `openviking_queue_errors_total` | Counter | `queue` | total error count per queue |
| `openviking_queue_pending` | Gauge | `queue` | pending queue items |
| `openviking_queue_in_progress` | Gauge | `queue` | in-progress queue items |
| `openviking_lock_active` | Gauge | none | current active locks |
| `openviking_lock_waiting` | Gauge | none | locks currently waiting |
| `openviking_lock_stale` | Gauge | none | potentially stale locks |

These help answer:

- Is there queue backlog?
- Is there lock contention or stale locking?

### Tasks and Task Tracker

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_task_pending` | Gauge | `task_type` | pending tasks tracked by task tracker |
| `openviking_task_running` | Gauge | `task_type` | running tasks tracked by task tracker |
| `openviking_task_completed` | Gauge | `task_type` | completed tasks tracked by task tracker |
| `openviking_task_failed` | Gauge | `task_type` | failed tasks tracked by task tracker |

### Cache

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_cache_hits_total` | Counter | `level` | cache hit count |
| `openviking_cache_misses_total` | Counter | `level` | cache miss count |

### Session

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_session_lifecycle_total` | Counter | `account_id, action, status` | session lifecycle event count |
| `openviking_session_contexts_used_total` | Counter | `account_id, action` | session contexts used total |
| `openviking_session_archive_total` | Counter | `account_id, status` | session archive count |

### Probes and Health State

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_service_readiness` | Gauge | may include `valid` | main service readiness |
| `openviking_api_key_manager_readiness` | Gauge | may include `valid` | API key manager readiness |
| `openviking_storage_readiness` | Gauge | `probe, valid` | storage probe, for example `agfs` |
| `openviking_model_provider_readiness` | Gauge | `provider, valid` | model provider readiness |
| `openviking_async_system_readiness` | Gauge | `probe, valid` | async system readiness |
| `openviking_retrieval_backend_readiness` | Gauge | `probe, valid` | retrieval backend readiness |
| `openviking_encryption_component_health` | Gauge | `valid` | overall encryption component health |
| `openviking_encryption_root_key_ready` | Gauge | `valid` | whether the root key is ready |
| `openviking_encryption_kms_provider_ready` | Gauge | `provider, valid` | KMS provider readiness |

Meaning of `valid`:

- `valid="1"`: the sample was produced by a successful refresh
- `valid="0"`: the sample is a fallback or stale value and should be treated with caution

### Encryption (Operational Metrics)

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_encryption_operations_total` | Counter | `account_id, operation, status` | encrypt/decrypt operation count |
| `openviking_encryption_duration_seconds` | Histogram | `account_id, operation, status` | encrypt/decrypt duration distribution |
| `openviking_encryption_bytes_total` | Counter | `account_id, operation` | encrypt/decrypt processed bytes total |
| `openviking_encryption_payload_size_bytes` | Histogram | `account_id, operation` | encrypt/decrypt payload size distribution |
| `openviking_encryption_auth_failed_total` | Counter | `account_id, status` | auth-failed count |
| `openviking_encryption_key_derivation_total` | Counter | `account_id, status` | key derivation count |
| `openviking_encryption_key_derivation_duration_seconds` | Histogram | `account_id, status` | key derivation duration distribution |
| `openviking_encryption_key_load_duration_seconds` | Histogram | `account_id, status, provider` | key load duration distribution |
| `openviking_encryption_key_cache_hits_total` | Counter | `account_id, provider` | key cache hit count |
| `openviking_encryption_key_cache_misses_total` | Counter | `account_id, provider` | key cache miss count |
| `openviking_encryption_key_version_usage_total` | Counter | `account_id, key_version` | key version usage count |

### Component and Observer Aggregate Metrics

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_component_health` | Gauge | `component, valid` | component health state |
| `openviking_component_errors` | Gauge | `component, valid` | component error state |
| `openviking_observer_components_total` | Gauge | `valid` | number of observed components |
| `openviking_observer_components_unhealthy` | Gauge | `valid` | number of unhealthy components |
| `openviking_observer_components_with_errors` | Gauge | `valid` | number of components with errors |

Typical `component` values include:

- `queue`
- `models`
- `lock`
- `retrieval`
- `vikingdb`

### VikingDB and Model Usage Statistics

| Metric Family | Type | Common Labels | Meaning |
|---------------|------|---------------|---------|
| `openviking_vikingdb_collection_health` | Gauge | `collection, valid` | collection health |
| `openviking_vikingdb_collection_vectors` | Gauge | `collection, valid` | current vector count per collection |
| `openviking_model_usage_available` | Gauge | `model_type, valid` | whether model usage statistics are currently available |

Possible `model_type` values include:

- `vlm`
- `embedding`
- `rerank`

## Configuration Example

### Enabling Metrics

In `ov.conf`, the metrics subsystem can be explicitly enabled through `server.observability.metrics`:

```json
{
  "server": {
    "observability": {
      "metrics": {
        "enabled": true,
        "account_dimension": {
          "enabled": true,
          "max_active_accounts": 100,
          "metric_allowlist": [
            "openviking_http_requests_total",
            "openviking_http_request_duration_seconds",
            "openviking_http_inflight_requests",
            "openviking_operation_requests_total",
            "openviking_operation_duration_seconds",
            "openviking_vlm_calls_total",
          "openviking_vlm_call_duration_seconds",
          "openviking_rerank_*"
          ]
        }
      }
    }
  }
}
```

Recommended mental model:

- `server.observability.metrics.enabled`: master switch for the metrics subsystem
- `server.observability.metrics.account_dimension`: controls whether `account_id` labels are enabled and where they are allowed

### Exporters

By default, OpenViking exports metrics via Prometheus exposition format at `/metrics`.
You can also enable additional exporters under `server.observability.metrics.exporters`.

Key fields:

- `server.observability.metrics.exporters.prometheus.enabled`: enable the Prometheus exporter (serves `/metrics`)
- `server.observability.metrics.exporters.otel.enabled`: enable OTLP export from the same in-process registry
- `server.observability.metrics.exporters.otel.protocol`: `"grpc"` or `"http"`
- `server.observability.metrics.exporters.otel.tls.insecure`: OTLP/gRPC only; `true` means plaintext (no TLS)
- `server.observability.metrics.exporters.otel.endpoint`: OTLP endpoint (for gRPC, use `host:4317`; for HTTP, use a full URL)
- `server.observability.metrics.exporters.otel.service_name`: OTLP `service.name` resource attribute (default `"openviking-server"`)
- `server.observability.metrics.exporters.otel.export_interval_ms`: OTLP push interval in milliseconds (default `10000`)

Example:

```json
{
  "server": {
    "observability": {
      "metrics": {
        "enabled": true,
        "exporters": {
          "prometheus": {
            "enabled": true
          },
          "otel": {
            "enabled": true,
            "protocol": "grpc",
            "tls": {
              "insecure": true
            },
            "endpoint": "otel-collector:4317",
            "service_name": "openviking-server",
            "export_interval_ms": 10000
          }
        }
      }
    }
  }
}
```

### Recommended `account_id` Usage

- enabled by default, but only allowlisted metric families will receive tenant ids (empty allowlist still yields `__unknown__`)
- do not turn `user_id`, `session_id`, or `resource_uri` into labels
- only enable tenant dimensions on a small set of critical dashboard and alert metrics
- `metric_allowlist` supports a limited wildcard syntax: only trailing `*` prefix matches (e.g. `openviking_rerank_*`, `openviking_embedding_*`)
- a standalone `*` is not supported, nor full glob/regex patterns

## Related Documentation

- [Architecture Overview](./01-architecture.md) - overall OpenViking architecture
- [Multi-Tenant](./11-multi-tenant.md) - `account/user/agent` isolation model
- [Data Encryption](./10-encryption.md) - storage-layer encryption and isolation
- [Metrics API](../api/09-metrics.md) - `/metrics` endpoint usage
- [Metrics Design](../../design/metric-design.md) - metrics system design details