Schema Encoding¶
This proof of concept (POC) explores different methods of encoding database schema information to optimize its comprehension by large language models (LMs). We evaluated three encoding formats:
Hierarchy-encoded format: A custom-designed structure that maximizes token efficiency while preserving all schema details.
Formatted JSON (human-readable): A standard indented JSON format that is easy for humans to read and understand.
Compact JSON (machine-optimized): A minified, single-line JSON format optimized for machine parsing.
We used OpenAI’s Tokenizer to calculate the number of tokens generated by each format and compared their sizes:
Token and Character Comparison:
Encoding Format |
Tokens |
Characters |
|---|---|---|
hierarchy |
596 |
1980 |
json formatted |
8241 |
49581 |
json compact |
6437 |
22319 |