# Context Extraction OpenViking uses a three-layer async architecture for document parsing and context extraction. ## Overview ``` Input File → Parser → TreeBuilder → SemanticQueue → Vector Index ↓ ↓ ↓ Parse & Move Files L0/L1 Generation Convert Queue Semantic (LLM Async) (No LLM) ``` **Design Principle**: Parsing and semantics are separated. Parser doesn't call LLM; semantic generation is async. ## Parser Parser handles document format conversion and structuring, creating file structure in temp directory. ### Supported Formats | Format | Parser | Extensions | Status | |--------|--------|------------|--------| | Markdown | MarkdownParser | .md, .markdown | Supported | | Plain text | TextParser | .txt | Supported | | PDF | PDFParser | .pdf | Supported | | HTML | HTMLParser | .html, .htm | Supported | | Code | CodeRepositoryParser | .py, .js, .go, etc. | | | Image | ImageParser | .png, .jpg, etc. | | | Video | VideoParser | .mp4, .avi, etc. | | | Audio | AudioParser | .mp3, .wav, etc. | | ### Core Flow (Document Example) ```python # 1. Parse file parse_result = registry.parse("/path/to/doc.md") # 2. Returns temp directory URI parse_result.temp_dir_path # viking://temp/abc123 ``` ### Smart Splitting ``` If document_tokens <= 1024: → Save as single file Else: → Split by headers → Section < 512 tokens → Merge → Section > 1024 tokens → Create subdirectory ``` ### Return Result ```python ParseResult( temp_dir_path: str, # Temp directory URI source_format: str, # pdf/markdown/html parser_name: str, # Parser name parse_time: float, # Duration (seconds) meta: Dict, # Metadata ) ``` ## TreeBuilder TreeBuilder moves temp directory to AGFS and queues semantic processing. ### Core Flow ```python building_tree = tree_builder.finalize_from_temp( temp_dir_path="viking://temp/abc123", scope="resources", # resources/user/agent ) ``` ### 5-Phase Processing 1. **Find document root**: Ensure exactly 1 subdirectory in temp 2. **Determine target URI**: Map base URI by scope 3. **Recursively move directory tree**: Copy all files to AGFS 4. **Clean up temp directory**: Delete temp files 5. **Queue semantic generation**: Submit SemanticMsg to queue ### URI Mapping | scope | Base URI | |-------|----------| | resources | `viking://resources` | | user | `viking://user` | | agent | `viking://agent` | ## SemanticQueue SemanticQueue handles async L0/L1 generation and vectorization. ### Message Structure ```python SemanticMsg( id: str, # UUID uri: str, # Directory URI context_type: str, # resource/memory/skill status: str, # pending/processing/completed ) ``` ### Processing Flow (Bottom-up) ``` Leaf directories → Parent directories → Root ``` ### Single Directory Processing Steps 1. **Concurrent file summary generation**: Limited to 10 concurrent 2. **Collect child directory abstracts**: Read generated .abstract.md 3. **Generate .overview.md**: LLM generates L1 overview 4. **Extract .abstract.md**: Extract L0 from overview 5. **Write files**: Save to AGFS 6. **Vectorize**: Create Context and queue to EmbeddingQueue ### Configuration Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `max_concurrent_llm` | 10 | Concurrent LLM calls | | `max_images_per_call` | 10 | Max images per VLM call | | `max_sections_per_call` | 20 | Max sections per VLM call | ## Code Skeleton Extraction (AST Mode) For code files, OpenViking supports AST-based skeleton extraction via tree-sitter as a lightweight alternative to LLM summarization, significantly reducing processing cost. ### Modes Controlled by `code_summary_mode` in `ov.conf` (see [Configuration](../guides/01-configuration.md#code)): | Mode | Description | |------|-------------| | `"ast"` | Extract structural skeleton for files ≥100 lines, skip LLM calls (**default**) | | `"llm"` | Always use LLM for summarization (original behavior) | | `"ast_llm"` | Extract AST skeleton first, then pass it as context to LLM for summarization | ### What AST Extracts The skeleton includes: - Module-level docstring (first line) - Import statement list - Class names, base classes, and method signatures (`ast` mode: first-line docstrings only; `ast_llm` mode: full docstrings) - Top-level function signatures ### Supported Languages The following languages have dedicated extractors built on tree-sitter: | Language | Status | |----------|--------| | Python | Supported | | JavaScript / TypeScript | Supported | | Rust | Supported | | Go | Supported | | Java | Supported | | C / C++ | Supported | Other languages automatically fall back to LLM. ### Fallback Behavior The following conditions trigger automatic fallback to LLM, with the reason logged. The overall pipeline is unaffected: - Language not in the supported list - File has fewer than 100 lines - AST parse error - Extraction produces an empty skeleton ### File Structure ``` openviking/parse/parsers/code/ast/ ├── extractor.py # Language detection and dispatch ├── skeleton.py # CodeSkeleton / FunctionSig / ClassSkeleton data structures └── languages/ # Per-language extractors ``` ## Three Context Types Extraction ### Flow Comparison | Phase | Resource | Memory | Skill | |-------|----------|--------|-------| | **Parser** | Common flow | Common flow | Common flow | | **Base URI** | `viking://resources` | `viking://user/memories` | `viking://agent/skills` | | **TreeBuilder scope** | resources | user/agent | agent | | **SemanticMsg type** | resource | memory | skill | ### Resource Extraction ```python # Add resource await client.add_resource( "/path/to/doc.pdf", reason="API documentation" ) # Flow: Parser → TreeBuilder(scope=resources) → SemanticQueue ``` ### Skill Extraction ```python # Add skill await client.add_skill({ "name": "search-web", "content": "# search-web\\n..." }) # Flow: Direct write to viking://agent/skills/{name}/ → SemanticQueue ``` ### Memory Extraction ```python # Memory auto-extracted from session await session.commit() # Flow: MemoryExtractor → TreeBuilder(scope=user) → SemanticQueue ``` ## Related Documents - [Architecture Overview](./01-architecture.md) - System architecture - [Context Layers](./03-context-layers.md) - L0/L1/L2 model - [Storage Architecture](./05-storage.md) - AGFS and vector index - [Session Management](./08-session.md) - Memory extraction details