# Path Locks and Crash Recovery

OpenViking uses two simple primitives — **path locks** and **redo log** — to protect the consistency of core write operations (`rm`, `mv`, `add_resource`, `session.commit`), ensuring that VikingFS, VectorDB, and QueueManager remain consistent even when failures occur.

## Design Philosophy

OpenViking is a context database where FS is the source of truth and VectorDB is a derived index. A lost index can be rebuilt from source data, but lost source data is unrecoverable. Therefore:

> **Better to miss a search result than to return a bad one.**

## Design Principles

1. **Write-exclusive**: Path locks ensure only one write operation can operate on a path at a time
2. **On by default**: All data operations automatically acquire locks; no extra configuration needed
3. **Lock as protection**: LockContext acquires locks on entry, releases on exit — no undo/journal/commit semantics
4. **Only session_memory needs crash recovery**: RedoLog re-executes memory extraction after a process crash
5. **Queue operations run outside locks**: SemanticQueue/EmbeddingQueue enqueue operations are idempotent and retriable

## Architecture

```
Service Layer (rm / mv / add_resource / session.commit)
    |
    v
+--[LockContext async context manager]--+
|                                       |
|  1. Create LockHandle                 |
|  2. Acquire path lock (poll+timeout)  |
|  3. Execute operations (FS+VectorDB)  |
|  4. Release lock                      |
|                                       |
|  On exception: auto-release lock,     |
|  exception propagates unchanged       |
+---------------------------------------+
    |
    v
Storage Layer (VikingFS, VectorDB, QueueManager)
```

## Two Core Components

### Component 1: PathLock + LockManager + LockContext (Path Lock System)

**PathLock** implements file-based distributed locks with two lock types — POINT and SUBTREE — using fencing tokens to prevent TOCTOU races and automatic stale lock detection and cleanup.

**LockHandle** is a lightweight lock holder token:

```python
@dataclass
class LockHandle:
    id: str          # Unique ID used to generate fencing tokens
    locks: list[str] # Acquired lock file paths
    created_at: float # Handle creation time
    last_active_at: float # Last successful acquire/refresh time
```

**LockManager** is a global singleton managing lock lifecycle:
- Creates/releases LockHandles
- Background cleanup of leaked locks (in-process safety net)
- Executes RedoLog recovery on startup

**LockContext** is an async context manager encapsulating the lock/unlock lifecycle:

```python
from openviking.storage.transaction import LockContext, get_lock_manager

async with LockContext(get_lock_manager(), [path], lock_mode="point") as handle:
    # Perform operations under lock protection
    ...
# Lock automatically released on exit (including exceptions)
```

### Component 2: RedoLog (Crash Recovery)

Used only for the memory extraction phase of `session.commit`. Writes a marker before the operation, deletes it after success, and scans for leftover markers on startup to redo.

```
/local/_system/redo/{task_id}/redo.json
```

Memory extraction is idempotent — re-extracting from the same archive produces the same result.

## Consistency Issues and Solutions

### rm(uri)

| Problem | Solution |
|---------|----------|
| Delete file first, then index -> file gone but index remains -> search returns non-existent file | **Reverse order**: delete index first, then file. Index deletion failure -> both file and index intact |

**Locking strategy** (depends on target type):
- Deleting a **directory**: `lock_mode="subtree"`, locks the directory itself
- Deleting a **file**: `lock_mode="point"`, locks the file's parent directory

Operation flow:

```
1. Check whether target is a directory or file, choose lock mode
2. Acquire lock
3. Delete VectorDB index -> immediately invisible to search
4. Delete FS file
5. Release lock
```

VectorDB deletion fails -> exception thrown, lock auto-released, file and index both intact. FS deletion fails -> VectorDB already deleted but file remains, retry is safe.

### mv(old_uri, new_uri)

| Problem | Solution |
|---------|----------|
| File moved to new path but index points to old path -> search returns old path (doesn't exist) | Copy first then update index; clean up copy on failure |

**Locking strategy** (handled automatically via `lock_mode="mv"`):
- Moving a **directory**: SUBTREE lock on both source path and destination parent
- Moving a **file**: POINT lock on both source's parent and destination parent

Operation flow:

```
1. Check whether source is a directory or file, set src_is_dir
2. Acquire mv lock (internally chooses SUBTREE or POINT based on src_is_dir)
3. Copy to new location (source still intact, safe)
4. If directory, remove the lock file carried over by cp into the copy
5. Update VectorDB URIs
   - Failure -> clean up copy, source and old index intact, consistent state
6. Delete source
7. Release lock
```

### add_resource

| Problem | Solution |
|---------|----------|
| File moved from temp to final directory, then crash -> file exists but never searchable | Two separate paths for first-time add vs incremental update |
| Resource already on disk but rm deletes it while semantic processing / vectorization is still running -> wasted work | Lifecycle SUBTREE lock held from finalization through processing completion |

**First-time add** (target does not exist) — handled in `ResourceProcessor.process_resource` Phase 3.5:

```
1. Acquire POINT lock on parent of final_uri
2. agfs.mv temp directory -> final location
3. Acquire SUBTREE lock on final_uri (inside POINT lock, eliminating race window)
4. Release POINT lock
5. Clean up temp directory
6. Enqueue SemanticMsg(lifecycle_lock_handle_id=...) -> DAG runs on final
7. DAG starts lock refresh loop (refreshes the lock token and updates handle activity every lock_expire/2 seconds)
8. DAG complete + all embeddings done -> release SUBTREE lock
```

During this period, `rm` attempting to acquire a SUBTREE lock on the same path will fail with `ResourceBusyError`.

**Incremental update** (target already exists) — temp stays in place:

```
1. Acquire SUBTREE lock on target_uri (protect existing resource)
2. Enqueue SemanticMsg(uri=temp, target_uri=final, lifecycle_lock_handle_id=...)
3. DAG runs on temp, lock refresh loop active
4. DAG completion triggers sync_diff_callback or move_temp_to_target_callback
5. Callback completes -> release SUBTREE lock
```

Note: DAG callbacks do NOT wrap operations in an outer lock. Each `VikingFS.rm` and `VikingFS.mv` has its own lock internally. An outer lock would conflict with these inner locks causing deadlock.

**Server restart recovery**: SemanticMsg is persisted in QueueFS. On restart, `SemanticProcessor` detects that the `lifecycle_lock_handle_id` handle is missing from the in-memory LockManager and re-acquires a SUBTREE lock.

### session.commit()

| Problem | Solution |
|---------|----------|
| Messages cleared but archive not written -> conversation data lost | Phase 1 without lock (incomplete archive has no side effects) + Phase 2 with RedoLog |

LLM calls have unpredictable latency (5s~60s+) and cannot be inside a lock-holding operation. The design splits into two phases:

```
Phase 1 — Archive (no lock):
  1. Generate archive summary (LLM)
  2. Write archive (history/archive_N/messages.jsonl + summaries)
  3. Clear messages.jsonl
  4. Clear in-memory message list

Phase 2 — Memory extraction + write (RedoLog):
  1. Write redo marker (archive_uri, session_uri, user identity)
  2. Extract memories from archived messages (LLM)
  3. Write current message state
  4. Write relations
  5. Directly enqueue SemanticQueue
  6. Delete redo marker
```

**Crash recovery analysis**:

| Crash point | State | Recovery action |
|------------|-------|----------------|
| During Phase 1 archive write | No marker | Incomplete archive; next commit scans history/ for index, unaffected |
| Phase 1 archive complete but messages not cleared | No marker | Archive complete + messages still present = redundant but safe |
| During Phase 2 memory extraction/write | Redo marker exists | On startup: redo extraction + write + enqueue from archive |
| Phase 2 complete | Redo marker deleted | No recovery needed |

## LockContext

`LockContext` is an **async** context manager that encapsulates lock acquisition and release:

```python
from openviking.storage.transaction import LockContext, get_lock_manager

lock_manager = get_lock_manager()

# Point lock (write operations, semantic processing)
async with LockContext(lock_manager, [path], lock_mode="point"):
    # Perform operations...
    pass

# Subtree lock (delete operations)
async with LockContext(lock_manager, [path], lock_mode="subtree"):
    # Perform operations...
    pass

# MV lock (move operations)
async with LockContext(lock_manager, [src], lock_mode="mv", mv_dst_parent_path=dst):
    # Perform operations...
    pass
```

**Lock modes**:

| lock_mode | Use case | Behavior |
|-----------|----------|----------|
| `point` | Write operations, semantic processing | Lock the specified path; conflicts with any lock on the same path and any SUBTREE lock on ancestors |
| `subtree` | Delete operations | Lock the subtree root; conflicts with any lock on the same path, any lock on descendants, and any SUBTREE lock on ancestors |
| `mv` | Move operations | Directory move: SUBTREE lock on both source and destination; File move: POINT lock on source parent and destination (controlled by `src_is_dir`) |

**Exception handling**: `__aexit__` always releases locks and does not swallow exceptions. Lock acquisition failure raises `LockAcquisitionError`.

## Lock Types (POINT vs SUBTREE)

The lock mechanism uses two lock types to handle different conflict patterns:

| | POINT on same path | SUBTREE on same path | POINT on descendant | SUBTREE on ancestor |
|---|---|---|---|---|
| **POINT** | Conflict | Conflict | — | Conflict |
| **SUBTREE** | Conflict | Conflict | Conflict | Conflict |

- **POINT (P)**: Used for write and semantic-processing operations. Only locks a single directory. Blocks if any ancestor holds a SUBTREE lock.
- **SUBTREE (S)**: Used for rm and mv operations. Logically covers the entire subtree but only writes **one lock file** at the root. Before acquiring, scans all descendants and ancestor directories for conflicting locks.

## Lock Mechanism

### Lock Protocol

Lock file path: `{path}/.path.ovlock`

Lock file content (Fencing Token):
```
{handle_id}:{time_ns}:{lock_type}
```

Where `lock_type` is `P` (POINT) or `S` (SUBTREE).

### Lock Acquisition (POINT mode)

```
loop until timeout (poll interval: 200ms):
    1. Check target directory exists
    2. Check if target directory is locked by another operation
       - Stale lock? -> remove and retry
       - Active lock? -> wait
    3. Check all ancestor directories for SUBTREE locks
       - Stale lock? -> remove and retry
       - Active lock? -> wait
    4. Write POINT (P) lock file
    5. TOCTOU double-check: re-scan ancestors for SUBTREE locks
       - Conflict found: compare (timestamp, handle_id)
       - Later one (larger timestamp/handle_id) backs off (removes own lock) to prevent livelock
       - Wait and retry
    6. Verify lock file ownership (fencing token matches)
    7. Success

Timeout (default 0 = no-wait) raises LockAcquisitionError
```

### Lock Acquisition (SUBTREE mode)

```
loop until timeout (poll interval: 200ms):
    1. Check target directory exists
    2. Check if target directory is locked by another operation
       - Stale lock? -> remove and retry
       - Active lock? -> wait
    3. Check all ancestor directories for SUBTREE locks
       - Stale lock? -> remove and retry
       - Active lock? -> wait
    4. Scan all descendant directories for any locks by other operations
       - Stale lock? -> remove and retry
       - Active lock? -> wait
    5. Write SUBTREE (S) lock file (only one file, at the root path)
    6. TOCTOU double-check: re-scan descendants and ancestors
       - Conflict found: compare (timestamp, handle_id)
       - Later one (larger timestamp/handle_id) backs off (removes own lock) to prevent livelock
       - Wait and retry
    7. Verify lock file ownership (fencing token matches)
    8. Success

Timeout (default 0 = no-wait) raises LockAcquisitionError
```

### Lock Expiry Cleanup

**Stale lock detection**: PathLock checks the fencing token timestamp. Locks older than `lock_expire` (default 300s) are considered stale and are removed automatically during acquisition.

**In-process cleanup**: LockManager checks active LockHandles every 60 seconds. Handles that still own lock files but have been inactive for longer than `lock_expire` are force-released.

**Orphan locks**: Lock files left behind after a process crash are automatically removed via stale lock detection when any operation next attempts to acquire a lock on the same path.

## Crash Recovery

`LockManager.start()` automatically scans for leftover markers in `/local/_system/redo/` on startup:

| Scenario | Recovery action |
|----------|----------------|
| session_memory extraction crash | Redo memory extraction + write + enqueue from archive |
| Crash while holding lock | Lock file remains in AGFS; stale detection auto-cleans on next acquisition (default 300s expiry) |
| Crash after enqueue, before worker processes | QueueFS SQLite persistence; worker auto-pulls after restart |
| Orphan index | Cleaned on L2 on-demand load |

### Defense Summary

| Failure scenario | Defense | Recovery timing |
|-----------------|--------|-----------------|
| Crash during operation | Lock auto-expires + stale detection | Next acquisition of same path lock |
| Crash during add_resource semantic processing | Lifecycle lock expires + SemanticProcessor re-acquires on restart | Worker restart |
| Crash during session.commit Phase 2 | RedoLog marker + redo | On restart |
| Crash after enqueue, before worker | QueueFS SQLite persistence | Worker restart |
| Orphan index | L2 on-demand load cleanup | When user accesses |

## Configuration

Path locks are enabled by default with no extra configuration needed. **The default behavior is no-wait**: if the path is locked, `LockAcquisitionError` is raised immediately. To allow wait/retry, configure the `storage.transaction` section:

```json
{
  "storage": {
    "transaction": {
      "lock_timeout": 5.0,
      "lock_expire": 300.0
    }
  }
}
```

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `lock_timeout` | float | Lock acquisition timeout (seconds). `0` = fail immediately if locked (default). `> 0` = wait/retry up to this many seconds. | `0.0` |
| `lock_expire` | float | Lock inactivity threshold (seconds). Locks not refreshed within this window are treated as stale and reclaimed. | `300.0` |

### QueueFS Persistence

The lock mechanism relies on QueueFS using the SQLite backend to ensure enqueued tasks survive process restarts. This is the default configuration and requires no manual setup.

## Related Documentation

- [Architecture](./01-architecture.md) - System architecture overview
- [Storage](./05-storage.md) - AGFS and vector store
- [Session Management](./08-session.md) - Session and memory management
- [Configuration](../guides/01-configuration.md) - Configuration reference