Metrics
OpenViking provides a machine-oriented metrics system for exposing runtime health, request quality, model usage, resource processing throughput, and probe health states.
Unlike the human-facing /api/v1/observer/* endpoints and the analytics-oriented /api/v1/stats/* endpoints, Metrics are designed for:
- high-frequency scraping by Prometheus, Grafana Agent, and similar systems
- low-cardinality, aggregatable metric models
- monitoring, alerting, capacity observation, and regression diagnosis
Overview
Why Metrics
Metrics are well suited to answer questions like:
- Has HTTP traffic increased abnormally over the last few minutes?
- Are resource ingestion, retrieval, or model calls getting slower?
- Is there queue backlog?
- Are key dependencies such as storage, model providers, VikingDB, encryption, and async systems currently healthy?
- Is a specific tenant showing abnormal traffic or error rates?
Compared with logs and observer snapshots, metrics are better for:
- continuous scraping
- time-series aggregation
- dashboard visualization
- alert rules
How Metrics Differ from Observer and Stats
| Capability | Best For | Output Format | Typical Usage |
|---|---|---|---|
/metrics | online monitoring, alerting, trend aggregation | Prometheus exposition text | Grafana dashboards, Prometheus scraping |
/api/v1/observer/* | human inspection of component snapshots | JSON / status tables | debugging, health checks |
/api/v1/stats/* | analytics-oriented statistics | JSON | memory health, staleness, session extraction |
The boundary is:
/metricsonly carries low-cardinality, low-cost metrics/api/v1/stats/*continues to carry analytics-oriented statistics without being constrained by the Prometheus scraping model
Metrics Architecture
The current metrics stack in OpenViking has four layers:
Business logic / HTTP requests / background tasks
│
▼
DataSource
(event emission / state reads)
│
▼
Collector
(semantic routing + labels)
│
▼
MetricRegistry
(in-process metric store)
│
▼
Exporter
(Prometheus text rendering)
│
▼
/metricsDataSource
DataSources provide inputs to the metrics system in two main forms:
- Event-based: business code emits events at key points, such as retrieval completion, successful model calls, or resource ingestion stage completion
- Read-based: current state is read before
/metricsexport, such as queue state, lock state, or probe state
Collector
Collectors turn inputs into metric semantics:
- choose which metric to write
- choose which labels to attach
- define how failure is exposed, such as
valid=1/0
MetricRegistry
The MetricRegistry is the in-process metric store that keeps the current metric values and serves them to the exporter.
Exporter
The first exporter implementation is the Prometheus exporter, which renders registry contents into Prometheus exposition text.
Usage
Accessing /metrics
In the current implementation, /metrics is not wired to get_request_context or other auth dependencies, so from the code-path perspective it currently behaves as a public scrape endpoint.
curl http://localhost:1933/metricsIf your deployment protects /metrics at the gateway, reverse proxy, or service discovery layer, attach auth according to the deployment environment.
Prometheus Scrape Example
scrape_configs:
- job_name: openviking
metrics_path: /metrics
static_configs:
- targets: ["localhost:1933"]Understanding Common Labels
| Label | Meaning | Example |
|---|---|---|
account_id | tenant dimension label | test-account, __unknown__, __overflow__ |
route | HTTP route template | /api/v1/search/find |
method | HTTP method | GET, POST |
status | request or stage status | 200, ok, error |
operation | structured operation name | search.find, resources.add_resource |
context_type | retrieval context type | resource |
provider | model or external service provider | volcengine |
model_name | model name | doubao-seed-1-8-251228 |
stage | stage label (defined by each metric family) | resource stage: parse; token attribution stage: embed_query |
valid | whether the current sample is fresh and valid | 1 / 0 |
Notes:
account_idis only enabled on controlled allowlisted metric families to prevent high-cardinality growthvalid=0means the current state/probe sample is a fallback or stale value, not that the label itself is malformedstagesemantics depend on the metric family:openviking_resource_stage_*: resource ingestion pipeline stages (for exampleparse/persist/process)openviking_operation_tokens_total: token attribution stages (for exampleembed_query/rerank/vlm)
Key Metric Families
The metric summaries below are based on representative metrics currently exposed by the collectors in openviking/metrics/collectors/.
Requests and Operations
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_http_requests_total | Counter | account_id, method, route, status | total HTTP requests |
openviking_http_request_duration_seconds | Histogram | account_id, method, route, status | HTTP latency distribution |
openviking_http_inflight_requests | Gauge | account_id, route | current inflight requests (in-process approximation) |
openviking_operation_requests_total | Counter | account_id, operation, status | total structured operations |
openviking_operation_duration_seconds | Histogram | account_id, operation, status | structured operation duration distribution |
Typical usage:
- inspect whether
/api/v1/search/findor/api/v1/resourcesis slowing down - inspect whether a specific
operationhas elevated error rates
Retrieval and Resource Processing
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_retrieval_requests_total | Counter | account_id, context_type | retrieval request count |
openviking_retrieval_results_total | Counter | account_id, context_type | total retrieved results |
openviking_retrieval_latency_seconds | Histogram | account_id, context_type | retrieval latency distribution |
openviking_retrieval_zero_result_total | Counter | account_id, context_type | retrieval zero-result count |
openviking_retrieval_rerank_used_total | Counter | account_id | number of retrievals that used rerank |
openviking_retrieval_rerank_fallback_total | Counter | account_id | retrieval rerank fallback count |
openviking_resource_stage_total | Counter | account_id, stage, status | count of resource ingestion stages |
openviking_resource_stage_duration_seconds | Histogram | account_id, stage, status | duration distribution of ingestion stages |
openviking_resource_wait_duration_seconds | Histogram | account_id, operation | resource ingestion wait duration distribution (for example queue waiting) |
Typical stage values include:
requestparsesummarizepersistfinalizeprocess
Vector, Memory, and Semantic Metrics
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_vector_searches_total | Counter | operation | vector search count |
openviking_vector_scored_total | Counter | operation | total scored candidates |
openviking_vector_passed_total | Counter | operation | total passed candidates |
openviking_vector_returned_total | Counter | operation | total returned candidates |
openviking_vector_scanned_total | Counter | operation | total scanned candidates |
openviking_memory_extracted_total | Counter | operation | total extracted memory items |
openviking_semantic_nodes_total | Counter | status | total semantic nodes |
Model Calls and Tokens
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_model_calls_total | Counter | model_type, provider, model_name | unified model call count |
openviking_model_tokens_total | Counter | model_type, provider, model_name, token_type | unified model token count |
openviking_vlm_calls_total | Counter | account_id, provider, model_name | VLM call count |
openviking_vlm_tokens_input_total | Counter | account_id, provider, model_name | VLM input tokens |
openviking_vlm_tokens_output_total | Counter | account_id, provider, model_name | VLM output tokens |
openviking_vlm_tokens_total | Counter | account_id, provider, model_name | VLM total tokens |
openviking_vlm_call_duration_seconds | Histogram | account_id, provider, model_name | VLM call duration distribution |
openviking_embedding_requests_total | Counter | account_id, status | embedding request count |
openviking_embedding_latency_seconds | Histogram | account_id, status | embedding latency distribution |
openviking_embedding_errors_total | Counter | account_id, error_code | embedding error count |
openviking_embedding_calls_total | Counter | account_id, provider, model_name | embedding provider call count (per-call) |
openviking_embedding_call_duration_seconds | Histogram | account_id, provider, model_name | embedding provider call duration distribution (per-call) |
openviking_embedding_tokens_input_total | Counter | account_id, provider, model_name | embedding input tokens (per-call aggregate) |
openviking_embedding_tokens_output_total | Counter | account_id, provider, model_name | embedding output tokens (per-call aggregate; may not appear if always 0) |
openviking_embedding_tokens_total | Counter | account_id, provider, model_name | embedding total tokens (per-call aggregate) |
openviking_rerank_calls_total | Counter | account_id, provider, model_name | rerank provider call count (per-call) |
openviking_rerank_call_duration_seconds | Histogram | account_id, provider, model_name | rerank provider call duration distribution (per-call) |
openviking_rerank_tokens_input_total | Counter | account_id, provider, model_name | rerank input tokens (per-call aggregate) |
openviking_rerank_tokens_output_total | Counter | account_id, provider, model_name | rerank output tokens (per-call aggregate; may not appear if always 0) |
openviking_rerank_tokens_total | Counter | account_id, provider, model_name | rerank total tokens (per-call aggregate) |
openviking_operation_tokens_total | Counter | account_id, operation, stage, token_type | operation token aggregation (token attribution stages) |
Notes:
openviking_model_*gives a unified cross-model view for embedding and VLM usageopenviking_vlm_*andopenviking_embedding_*are better suited for workload-specific dashboards
Queues, Locks, and Runtime State
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_queue_processed_total | Counter | queue | total processed items per queue |
openviking_queue_errors_total | Counter | queue | total error count per queue |
openviking_queue_pending | Gauge | queue | pending queue items |
openviking_queue_in_progress | Gauge | queue | in-progress queue items |
openviking_lock_active | Gauge | none | current active locks |
openviking_lock_waiting | Gauge | none | locks currently waiting |
openviking_lock_stale | Gauge | none | potentially stale locks |
These help answer:
- Is there queue backlog?
- Is there lock contention or stale locking?
Tasks and Task Tracker
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_task_pending | Gauge | task_type | pending tasks tracked by task tracker |
openviking_task_running | Gauge | task_type | running tasks tracked by task tracker |
openviking_task_completed | Gauge | task_type | completed tasks tracked by task tracker |
openviking_task_failed | Gauge | task_type | failed tasks tracked by task tracker |
Cache
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_cache_hits_total | Counter | level | cache hit count |
openviking_cache_misses_total | Counter | level | cache miss count |
Session
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_session_lifecycle_total | Counter | account_id, action, status | session lifecycle event count |
openviking_session_contexts_used_total | Counter | account_id, action | session contexts used total |
openviking_session_archive_total | Counter | account_id, status | session archive count |
Probes and Health State
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_service_readiness | Gauge | may include valid | main service readiness |
openviking_api_key_manager_readiness | Gauge | may include valid | API key manager readiness |
openviking_storage_readiness | Gauge | probe, valid | storage probe, for example agfs |
openviking_model_provider_readiness | Gauge | provider, valid | model provider readiness |
openviking_async_system_readiness | Gauge | probe, valid | async system readiness |
openviking_retrieval_backend_readiness | Gauge | probe, valid | retrieval backend readiness |
openviking_encryption_component_health | Gauge | valid | overall encryption component health |
openviking_encryption_root_key_ready | Gauge | valid | whether the root key is ready |
openviking_encryption_kms_provider_ready | Gauge | provider, valid | KMS provider readiness |
Meaning of valid:
valid="1": the sample was produced by a successful refreshvalid="0": the sample is a fallback or stale value and should be treated with caution
Encryption (Operational Metrics)
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_encryption_operations_total | Counter | account_id, operation, status | encrypt/decrypt operation count |
openviking_encryption_duration_seconds | Histogram | account_id, operation, status | encrypt/decrypt duration distribution |
openviking_encryption_bytes_total | Counter | account_id, operation | encrypt/decrypt processed bytes total |
openviking_encryption_payload_size_bytes | Histogram | account_id, operation | encrypt/decrypt payload size distribution |
openviking_encryption_auth_failed_total | Counter | account_id, status | auth-failed count |
openviking_encryption_key_derivation_total | Counter | account_id, status | key derivation count |
openviking_encryption_key_derivation_duration_seconds | Histogram | account_id, status | key derivation duration distribution |
openviking_encryption_key_load_duration_seconds | Histogram | account_id, status, provider | key load duration distribution |
openviking_encryption_key_cache_hits_total | Counter | account_id, provider | key cache hit count |
openviking_encryption_key_cache_misses_total | Counter | account_id, provider | key cache miss count |
openviking_encryption_key_version_usage_total | Counter | account_id, key_version | key version usage count |
Component and Observer Aggregate Metrics
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_component_health | Gauge | component, valid | component health state |
openviking_component_errors | Gauge | component, valid | component error state |
openviking_observer_components_total | Gauge | valid | number of observed components |
openviking_observer_components_unhealthy | Gauge | valid | number of unhealthy components |
openviking_observer_components_with_errors | Gauge | valid | number of components with errors |
Typical component values include:
queuemodelslockretrievalvikingdb
VikingDB and Model Usage Statistics
| Metric Family | Type | Common Labels | Meaning |
|---|---|---|---|
openviking_vikingdb_collection_health | Gauge | collection, valid | collection health |
openviking_vikingdb_collection_vectors | Gauge | collection, valid | current vector count per collection |
openviking_model_usage_available | Gauge | model_type, valid | whether model usage statistics are currently available |
Possible model_type values include:
vlmembeddingrerank
Configuration Example
Enabling Metrics
In ov.conf, the metrics subsystem can be explicitly enabled through server.observability.metrics:
{
"server": {
"observability": {
"metrics": {
"enabled": true,
"account_dimension": {
"enabled": true,
"max_active_accounts": 100,
"metric_allowlist": [
"openviking_http_requests_total",
"openviking_http_request_duration_seconds",
"openviking_http_inflight_requests",
"openviking_operation_requests_total",
"openviking_operation_duration_seconds",
"openviking_vlm_calls_total",
"openviking_vlm_call_duration_seconds",
"openviking_rerank_*"
]
}
}
}
}
}Recommended mental model:
server.observability.metrics.enabled: master switch for the metrics subsystemserver.observability.metrics.account_dimension: controls whetheraccount_idlabels are enabled and where they are allowed
Exporters
By default, OpenViking exports metrics via Prometheus exposition format at /metrics. You can also enable additional exporters under server.observability.metrics.exporters.
Key fields:
server.observability.metrics.exporters.prometheus.enabled: enable the Prometheus exporter (serves/metrics)server.observability.metrics.exporters.otel.enabled: enable OTLP export from the same in-process registryserver.observability.metrics.exporters.otel.protocol:"grpc"or"http"server.observability.metrics.exporters.otel.tls.insecure: OTLP/gRPC only;truemeans plaintext (no TLS)server.observability.metrics.exporters.otel.endpoint: OTLP endpoint (for gRPC, usehost:4317; for HTTP, use a full URL)server.observability.metrics.exporters.otel.service_name: OTLPservice.nameresource attribute (default"openviking-server")server.observability.metrics.exporters.otel.export_interval_ms: OTLP push interval in milliseconds (default10000)
Example:
{
"server": {
"observability": {
"metrics": {
"enabled": true,
"exporters": {
"prometheus": {
"enabled": true
},
"otel": {
"enabled": true,
"protocol": "grpc",
"tls": {
"insecure": true
},
"endpoint": "otel-collector:4317",
"service_name": "openviking-server",
"export_interval_ms": 10000
}
}
}
}
}
}Recommended account_id Usage
- enabled by default, but only allowlisted metric families will receive tenant ids (empty allowlist still yields
__unknown__) - do not turn
user_id,session_id, orresource_uriinto labels - only enable tenant dimensions on a small set of critical dashboard and alert metrics
metric_allowlistsupports a limited wildcard syntax: only trailing*prefix matches (e.g.openviking_rerank_*,openviking_embedding_*)- a standalone
*is not supported, nor full glob/regex patterns
Related Documentation
- Architecture Overview - overall OpenViking architecture
- Multi-Tenant -
account/user/agentisolation model - Data Encryption - storage-layer encryption and isolation
- Metrics API -
/metricsendpoint usage - Metrics Design - metrics system design details
