Schema Encoding

This proof of concept (POC) explores different methods of encoding database schema information to optimize its comprehension by large language models (LMs). We evaluated three encoding formats:

  1. Hierarchy-encoded format: A custom-designed structure that maximizes token efficiency while preserving all schema details.

  2. Formatted JSON (human-readable): A standard indented JSON format that is easy for humans to read and understand.

  3. Compact JSON (machine-optimized): A minified, single-line JSON format optimized for machine parsing.

We used OpenAI’s Tokenizer to calculate the number of tokens generated by each format and compared their sizes:

Token and Character Comparison:

Encoding Format

Tokens

Characters

hierarchy

596

1980

json formatted

8241

49581

json compact

6437

22319