Three-Layer Architecture: Model-Encoder-Extractor

Understanding the systematic approach to database schema processing in the db/ module

Overview

Every database system in the mcp_ohmy_sql/db/ module follows a standardized three-layer architecture that separates concerns and ensures consistent behavior across different database technologies. This document explains each layer using the relational/ sub-module as a concrete example.

The three layers work together in a pipeline:

  • Layer 1 - Model: Structures the raw data into typed Python objects

  • Layer 2 - Encoder: Transforms structured data into AI-friendly formats

  • Layer 3 - Extractor: Connects to databases and extracts raw metadata

Architecture Flow

Database → Extractor → Model Objects → Encoder → AI-Friendly Text

Raw SQL  → Layer 3   → Layer 1       → Layer 2 → Compact Schema
Metadata               (Pydantic)              → Representation

This separation ensures that:

  • Database-specific logic is isolated in extractors

  • Data validation and structure is handled by models

  • AI optimization is managed by encoders

  • Each layer can be tested and modified independently

Layer 1: Data Models (schema_1_model.py)

Purpose: Define structured data containers for database schema elements

Location: mcp_ohmy_sql/db/relational/schema_1_model.py

Core Responsibility: Transform raw database metadata into strongly-typed Python objects using Pydantic for validation and serialization.

Hierarchical Model Structure

The models follow a hierarchical structure that mirrors database organization:

DatabaseInfo
└── schemas: List[SchemaInfo]
    └── tables: List[TableInfo]
        └── columns: List[ColumnInfo]
            └── foreign_keys: List[ForeignKeyInfo]

Key Model Classes: ForeignKeyInfo, ColumnInfo, TableInfo, SchemaInfo, DatabaseInfo

Each class uses Pydantic for type validation and serialization, ensuring data consistency and providing clear documentation through field definitions. The hierarchical structure enables easy navigation of database metadata while maintaining type safety throughout the system.

Testing: The model layer has minimal testing requirements since it primarily defines data structures. The corresponding test file tests/db/relational/test_db_relational_schema_1_model.py contains simple import tests to ensure all model classes are properly defined and accessible.

Layer 2: Schema Encoders (schema_2_encoder.py)

Purpose: Transform structured model objects into compact, AI-optimized text representations

Location: mcp_ohmy_sql/db/relational/schema_2_encoder.py

Core Responsibility: Convert verbose database metadata into token-efficient formats that preserve essential information while reducing LLM context usage by ~70%.

Key Encoding Functions

The encoder provides specialized functions for each model type: encode_column_info(), encode_table_info(), encode_schema_info(), encode_database_info()

The encoding strategy focuses on constraint-aware compression - retaining maximum information while minimizing tokens through intelligent abbreviations and smart constraint logic that avoids redundant information (e.g., primary keys don’t need explicit NOT NULL markers).

Column Format: ${COLUMN_NAME}:${DATA_TYPE}${CONSTRAINTS} with abbreviations like *PK (Primary Key), *FK->Table.Column (Foreign Key), *NN (Not Null)

Table Format: SQL-like structure that’s immediately recognizable to both humans and AI systems, preserving visual hierarchy while dramatically reducing token count.

Testing: The encoder layer requires comprehensive testing to ensure accurate schema representation. The test file tests/db/relational/test_db_relational_schema_2_encoder.py creates mock schema model objects (ColumnInfo, TableInfo) with various constraint combinations and verifies the encoded output matches expected formats. Testing focuses on column and table encoding since schema and database objects are primarily data containers.

Layer 3: Schema Extractors (schema_3_extractor.py)

Purpose: Connect to databases and extract raw metadata using database-specific logic

Location: mcp_ohmy_sql/db/relational/schema_3_extractor.py

Core Responsibility: Handle the complexity of database introspection, type mapping, and metadata extraction while providing a clean interface to higher layers.

Key Extraction Functions

The extractor provides hierarchical construction functions: new_foreign_key_info(), new_column_info(), new_table_info(), new_schema_info(), new_database_info()

The extraction mechanism leverages SQLAlchemy’s built-in introspection capabilities to query system tables and metadata. For relational databases, this means using sa.inspect() and sa.MetaData.reflect() to discover tables, columns, constraints, and relationships. The key innovation is the comprehensive type mapping system that normalizes database-specific types into simplified LLM-friendly categories.

Type Mapping: Database-specific types (VARCHAR, BIGINT, TIMESTAMP) are mapped to universal categories (STR, INT, TS) for consistent AI consumption across different database systems.

Object Construction: Bottom-up approach building from foreign keys → columns → tables → schemas → databases, ensuring all relationships are properly captured.

Testing: The extractor layer requires real database connections for proper testing. The test file tests/db/relational/test_db_relational_schema_3_extractor.py uses in-memory SQLite databases to test the complete extraction pipeline, verifying that SQLAlchemy objects are correctly transformed into the corresponding model objects with accurate metadata, types, and constraints.

Layer Integration Example

Here’s how the three layers work together to process a database table:

1. Raw Database Metadata (what extractor receives):

CREATE TABLE customer (
    customer_id INTEGER PRIMARY KEY AUTOINCREMENT,
    email VARCHAR(255) UNIQUE NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

2. Model Objects (after Layer 3 → Layer 1):

TableInfo(
    name="customer",
    object_type=ObjectTypeEnum.TABLE,
    columns=[
        ColumnInfo(
            name="customer_id",
            type="INTEGER",
            llm_type=LLMTypeEnum.INT,
            primary_key=True,
            nullable=False
        ),
        ColumnInfo(
            name="email",
            type="VARCHAR(255)",
            llm_type=LLMTypeEnum.STR,
            unique=True,
            nullable=False
        )
    ]
)

3. AI-Friendly Output (after Layer 1 → Layer 2):

Table customer(
    customer_id:INT*PK,
    email:STR*UQ*NN,
    first_name:STR*NN,
    created_at:TS
)

Architecture Benefits

For Development:

  • Clear Separation: Each layer has distinct, well-defined responsibilities

  • Independent Testing: Layers can be unit tested with mock dependencies

  • Predictable Patterns: New database systems follow the same structure

  • Debugging Support: Intermediate objects can be inspected and validated

For Maintenance:

  • Isolated Changes: Modifications to one layer don’t affect others

  • Type Safety: Pydantic models catch data inconsistencies early

  • Documentation: Code structure serves as architectural documentation

  • Extensibility: New features can be added layer by layer

For AI Integration:

  • Optimized Output: Encoders are specifically designed for LLM consumption

  • Consistent Format: All database systems produce identical output format

  • Token Efficiency: Significant reduction in context usage

  • Semantic Preservation: Essential schema information is retained despite compression