Extract Metadata from AWS Redshift

本文详细介绍如何从 AWS Redshift 的 System Table 中提取所有 Schema, Table, Column 的完整元数据信息, 以便将这些结构化数据编码后提供给 LLM, 帮助我们生成更准确的 SQL 查询语句.

Setup Helpers

在开始元数据提取工作之前, 我们首先配置必要的工具和连接环境. 这些工具将帮助我们在 Notebook 中便捷地执行 SQL 查询并直观地预览查询结果.

[35]:
import polars as pl
import sqlalchemy as sa
from rich import print as rprint
from rich.console import Console
from rich.syntax import Syntax
import simple_aws_redshift.api as aws_rs
from boto_session_manager import BotoSesManager

from mcp_ohmy_sql.tests.aws.constants import aws_profile, database_name, namespace_name, workgroup_name

接下来创建与 Redshift Serverless 的数据库连接:

[36]:
# Create sqlalchemy engine.
bsm = BotoSesManager(profile_name=aws_profile)
params = aws_rs.RedshiftServerlessConnectionParams.new(
    redshift_serverless_client=bsm.redshiftserverless_client,
    namespace_name=namespace_name,
    workgroup_name=workgroup_name,
)
engine = params.get_engine()

为了提高开发效率, 我们定义一个通用的 SQL 执行和结果展示函数:

[37]:
# 定义一个用来运行 SQL 以及展示结果的 helper 函数
console = Console()
pl.Config(tbl_rows=20)

def run_sql(sql: str)-> pl.DataFrame:
    syntax = Syntax(sql, "sql", line_numbers=True)
    console.print(syntax)
    with engine.connect() as conn:
        rows = conn.execute(sa.text(sql)).fetchall()
        df = pl.DataFrame(rows)
    return df

sql = "SELECT 1 as value;"
run_sql(sql)
  1 SELECT 1 as value;                                                                                             
[37]:
shape: (1, 1)
value
i64
1

Explore Key System Tables

Redshift 继承了 PostgreSQL 的架构设计, 采用 Schema → Table → Column 的三层数据组织结构. 我们的目标是找到三个核心 SQL 查询, 分别用于获取:

  • Schema 信息 - 每行代表一个 Schema 及其属性

  • Table 信息 - 每行代表一个 Table 及其属性

  • Column 信息 - 每行代表一个 Column 及其详细属性

在编写复杂的元数据查询之前, 我们需要先了解 Redshift 中可用的系统表及其数据结构. 值得注意的是, Redshift 在 PostgreSQL 原有系统表基础上进行了扩展, 添加了许多 _info 后缀的视图 (如 pg_user_info, pg_class_info), 这些视图对 Redshift 用户更加友好, 我们将优先使用这些增强版本.

pg_user_info

pg_user_info 是 Amazon Redshift 提供的系统视图, 显示用户信息包括用户 ID 和密码过期时间等详细信息. 需要注意的是, 只有超级用户才能访问 PG_USER_INFO 视图. 对于我们的元数据提取需求, 使用基础的 pg_user 表就足够了, 因为我们主要需要获取用户名信息用于权限过滤.

[38]:
query = """
SELECT id, name
FROM users
WHERE active = 1;
"""
[39]:
sql = """
SELECT
    *
FROM pg_user t
LIMIT 10;
"""
run_sql(sql)
  1                                                                                                                
  2 SELECT                                                                                                         
  3     *                                                                                                          
  4 FROM pg_user t                                                                                                 
  5 LIMIT 10;                                                                                                      
  6                                                                                                                
[39]:
shape: (4, 8)
usenameusesysidusecreatedbusesuperusecatupdpasswdvaluntiluseconfig
stri64boolboolboolstrstrnull
"rdsdb"1truetruetrue"********""infinity"null
"admin"100truetruefalse"********"nullnull
"IAMR:cdk-hnb659fds-cfn-exec-ro…101truetruefalse"********"nullnull
"IAM:sanhe"102falsefalsefalse"********"nullnull

pg_description

pg_description 表是 PostgreSQL 的标准系统表, 用于存储数据库对象的注释信息 (也称为描述信息). 这些注释通过 SQL 的 COMMENT 语句定义, 相当于将文档直接嵌入到数据库结构中.

[40]:
sql = """
SELECT
    *
FROM pg_description t
LIMIT 10;
"""
run_sql(sql)
  1                                                                                                                
  2 SELECT                                                                                                         
  3     *                                                                                                          
  4 FROM pg_description t                                                                                          
  5 LIMIT 10;                                                                                                      
  6                                                                                                                
[40]:
shape: (10, 4)
objoidclassoidobjsubiddescription
i64i64i64str
124212550"I/O"
124312550"I/O"
124412550"I/O"
3112550"I/O"
124512550"I/O"
3312550"I/O"
3412550"I/O"
3512550"I/O"
3812550"I/O"
3912550"I/O"

我们可以通过 JOIN 操作将 pg_description 与其他系统表结合, 获取带有描述信息的完整元数据:

[41]:
sql = """
-- oid is the hidden database object id
SELECT
    pg_namespace.oid as object_id,
    pg_namespace.nspname as schema_name,
    pg_description.description as description
FROM pg_namespace
JOIN pg_description
    ON pg_namespace.oid = pg_description.objoid
WHERE
    pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')
    AND pg_namespace.nspname NOT LIKE 'pg_%'
LIMIT 20;
"""
run_sql(sql)
   1                                                                                                               
   2 -- oid is the hidden database object id                                                                       
   3 SELECT                                                                                                        
   4     pg_namespace.oid as object_id,                                                                            
   5     pg_namespace.nspname as schema_name,                                                                      
   6     pg_description.description as description                                                                 
   7 FROM pg_namespace                                                                                             
   8 JOIN pg_description                                                                                           
   9     ON pg_namespace.oid = pg_description.objoid                                                               
  10 WHERE                                                                                                         
  11     pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')                                     
  12     AND pg_namespace.nspname NOT LIKE 'pg_%'                                                                  
  13 LIMIT 20;                                                                                                     
  14                                                                                                               
[41]:
shape: (1, 3)
object_idschema_namedescription
i64strstr
2200"public""Standard public schema"

pg_namespace

pg_namespace 是 PostgreSQL 的核心系统表. 在数据库理论中, namespace 是 SQL schema 的底层实现结构, 每个 namespace 可以包含独立的关系 (表), 类型等对象, 且不会产生命名冲突. 重要特征:

  • 来源: PostgreSQL 原生系统表

  • 用户友好性: 是, 自动根据当前用户权限过滤结果

  • 数据粒度: 表中每一行代表一个 Schema

[42]:
sql = """
SELECT *
FROM pg_namespace t
LIMIT 20;
"""
df = run_sql(sql)
df
  1                                                                                                                
  2 SELECT *                                                                                                       
  3 FROM pg_namespace t                                                                                            
  4 LIMIT 20;                                                                                                      
  5                                                                                                                
[42]:
shape: (10, 3)
nspnamenspownernspacl
stri64str
"pg_toast"1null
"pg_internal"1null
"catalog_history"1null
"pg_mv"1null
"pg_automv"1null
"pg_auto_copy"1null
"pg_s3"1null
"pg_catalog"1"{rdsdb=UCDA/rdsdb,=U/rdsdb}"
"public"1"{rdsdb=UCDA/rdsdb,=UC/rdsdb}"
"information_schema"1"{rdsdb=UCDA/rdsdb,=U/rdsdb}"

为了获取用户自定义的 Schema (排除系统 Schema), 我们使用以下过滤条件:

[43]:
sql = """
-- oid is the hidden database object id
SELECT
    t.oid as object_id,
    t.*
FROM pg_namespace t
WHERE
    t.nspname NOT IN ('information_schema', 'catalog_history')
    AND t.nspname NOT LIKE 'pg_%'
LIMIT 20;
"""
run_sql(sql)
   1                                                                                                               
   2 -- oid is the hidden database object id                                                                       
   3 SELECT                                                                                                        
   4     t.oid as object_id,                                                                                       
   5     t.*                                                                                                       
   6 FROM pg_namespace t                                                                                           
   7 WHERE                                                                                                         
   8     t.nspname NOT IN ('information_schema', 'catalog_history')                                                
   9     AND t.nspname NOT LIKE 'pg_%'                                                                             
  10 LIMIT 20;                                                                                                     
  11                                                                                                               
[43]:
shape: (1, 4)
object_idnspnamenspownernspacl
i64stri64str
2200"public"1"{rdsdb=UCDA/rdsdb,=UC/rdsdb}"

这个 SQL 就是我们要的, 给出了所有的 Schema 和 Description.

[44]:
sql = """
SELECT
    pg_namespace.nspname as schema_name,
    pg_description.description as description
FROM pg_namespace
JOIN pg_description
    ON pg_namespace.oid = pg_description.objoid
WHERE
    pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')
    AND pg_namespace.nspname NOT LIKE 'pg_%'
;
"""
run_sql(sql)
   1                                                                                                               
   2 SELECT                                                                                                        
   3     pg_namespace.nspname as schema_name,                                                                      
   4     pg_description.description as description                                                                 
   5 FROM pg_namespace                                                                                             
   6 JOIN pg_description                                                                                           
   7     ON pg_namespace.oid = pg_description.objoid                                                               
   8 WHERE                                                                                                         
   9     pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')                                     
  10     AND pg_namespace.nspname NOT LIKE 'pg_%'                                                                  
  11 ;                                                                                                             
  12                                                                                                               
[44]:
shape: (1, 2)
schema_namedescription
strstr
"public""Standard public schema"

pg_class_info

PG_CLASS_INFO 是 Amazon Redshift 专有的增强视图, 包含表的创建时间和当前分布样式等详细信息. 重要特征:

  • 来源: Redshift 专有 (非 PostgreSQL 原生)

  • 用户友好性: 是, 只显示当前用户有权限访问的表

  • 数据粒度: 表中每一行代表一个 Table 及其属性

[45]:
sql = """
SELECT *
FROM PG_CLASS_INFO
LIMIT 5;
"""
run_sql(sql)
  1                                                                                                                
  2 SELECT *                                                                                                       
  3 FROM PG_CLASS_INFO                                                                                             
  4 LIMIT 5;                                                                                                       
  5                                                                                                                
[45]:
shape: (5, 33)
reloidrelnamerelnamespacereltyperelownerrelamrelfilenodereltablespacerelpagesreltuplesreltoastrelidreltoastidxidrelhasindexrelissharedrelkindrelnattsrelexternidrelisreplicatedrelispinnedreldiststylerelprojbaseidrelchecksreltriggersrelukeysrelfkeysrelrefsrelhasoidsrelhaspkeyrelhasrulesrelhassubclassrelaclreleffectivediststylerelcreationtime
i64stri64i64i64i64i64i64i64f64i64i64boolboolstri64i64boolbooli64i64i64i64i64i64i64boolboolboolboolnulli64null
376"pg_xactlock"110100166400.000falsetrue"s"14294967295falsefalse0000000falsefalsefalsefalsenull0null
16774"pg_aggregate_fnoid_index"110140316774000.000falsefalse"i"14294967295falsefalse0000000falsefalsefalsefalsenull0null
16775"pg_am_name_index"110140316775000.000falsefalse"i"14294967295falsefalse0000000falsefalsefalsefalsenull0null
16776"pg_am_oid_index"110140316776000.000falsefalse"i"14294967295falsefalse0000000falsefalsefalsefalsenull0null
16777"pg_amop_opc_strat_index"110140316777000.000falsefalse"i"34294967295falsefalse0000000falsefalsefalsefalsenull0null

通过与 pg_namespace 进行 JOIN 操作, 我们可以获得完整的 schema_name.table_name 信息, 并且包含 Redshift 特有的分布键信息:

[46]:
sql = """
SELECT
    pg_class_info.reloid AS table_id,
    trim(pg_namespace.nspname) AS schema_name,
    trim(pg_class_info.relname) AS table_name,
    pg_class_info.reldiststyle,
    pg_class_info.releffectivediststyle,
    CASE WHEN pg_class_info.reldiststyle = 0 THEN 'EVEN'::text
        WHEN pg_class_info.reldiststyle = 1 THEN 'KEY'::text
        WHEN pg_class_info.reldiststyle = 8 THEN 'ALL'::text
        WHEN pg_class_info.releffectivediststyle = 10 THEN 'AUTO(ALL)'::text
        WHEN pg_class_info.releffectivediststyle = 11 THEN 'AUTO(EVEN)'::text
        WHEN pg_class_info.releffectivediststyle = 12 THEN 'AUTO(KEY)'::text ELSE '<<UNKNOWN>>'::text END AS diststyle,
    pg_class_info.relcreationtime AS creation_time,
    pg_class_info.relowner AS owner
FROM pg_class_info
LEFT JOIN pg_namespace
    ON pg_class_info.relnamespace = pg_namespace.oid
WHERE
    pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')
    AND pg_namespace.nspname NOT LIKE 'pg_%'
LIMIT 100;
"""
run_sql(sql)
   1                                                                                                               
   2 SELECT                                                                                                        
   3     pg_class_info.reloid AS table_id,                                                                         
   4     trim(pg_namespace.nspname) AS schema_name,                                                                
   5     trim(pg_class_info.relname) AS table_name,                                                                
   6     pg_class_info.reldiststyle,                                                                               
   7     pg_class_info.releffectivediststyle,                                                                      
   8     CASE WHEN pg_class_info.reldiststyle = 0 THEN 'EVEN'::text                                                
   9         WHEN pg_class_info.reldiststyle = 1 THEN 'KEY'::text                                                  
  10         WHEN pg_class_info.reldiststyle = 8 THEN 'ALL'::text                                                  
  11         WHEN pg_class_info.releffectivediststyle = 10 THEN 'AUTO(ALL)'::text                                  
  12         WHEN pg_class_info.releffectivediststyle = 11 THEN 'AUTO(EVEN)'::text                                 
  13         WHEN pg_class_info.releffectivediststyle = 12 THEN 'AUTO(KEY)'::text ELSE '<<UNKNOWN>>'::text END AS d
  14     pg_class_info.relcreationtime AS creation_time,                                                           
  15     pg_class_info.relowner AS owner                                                                           
  16 FROM pg_class_info                                                                                            
  17 LEFT JOIN pg_namespace                                                                                        
  18     ON pg_class_info.relnamespace = pg_namespace.oid                                                          
  19 WHERE                                                                                                         
  20     pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')                                     
  21     AND pg_namespace.nspname NOT LIKE 'pg_%'                                                                  
  22 LIMIT 100;                                                                                                    
  23                                                                                                               
[46]:
shape: (23, 8)
table_idschema_nametable_namereldiststylereleffectivediststylediststylecreation_timeowner
i64strstri64i64strdatetime[μs]i64
135770"public""artist_pkey"00"EVEN"null102
135774"public""album_pkey"00"EVEN"null102
135782"public""genre_pkey"00"EVEN"null102
135786"public""mediatype_pkey"00"EVEN"null102
135790"public""track_pkey"00"EVEN"null102
135772"public""album"11"KEY"2025-06-16 20:18:56.956046102
135784"public""mediatype"88"ALL"2025-06-16 20:18:57.083479102
135780"public""genre"88"ALL"2025-06-16 20:18:57.024100102
135806"public""playlist_pkey"00"EVEN"null102
135810"public""playlisttrack_pkey"00"EVEN"null102
135830"public""customer_pkey"00"EVEN"null102
135820"public""employee"88"ALL"2025-06-16 20:18:57.340479102
135838"public""invoice_pkey"00"EVEN"null102
135828"public""customer"11"KEY"2025-06-16 20:18:57.409248102
135846"public""invoiceline_pkey"00"EVEN"null102
135836"public""invoice"11"KEY"2025-06-16 20:18:57.477920102
135844"public""invoiceline"11"KEY"2025-06-16 20:18:57.536679102
135788"public""track"11"KEY"2025-06-16 20:18:57.143263102
135856"public""albumsalesstats"00"EVEN"null102
135768"public""artist"88"ALL"2025-06-16 20:18:56.862195102

这个 SQL 就是我们要的, 给出了所有的 Table, Distribution Key 信息, 和 Owner 信息

[47]:
sql = """
SELECT
    trim(pg_namespace.nspname) AS schema_name,
    trim(pg_class_info.relname) AS table_name,
    CASE WHEN pg_class_info.reldiststyle = 0 THEN 'EVEN'::text
        WHEN pg_class_info.reldiststyle = 1 THEN 'KEY'::text
        WHEN pg_class_info.reldiststyle = 8 THEN 'ALL'::text
        WHEN pg_class_info.releffectivediststyle = 10 THEN 'AUTO(ALL)'::text
        WHEN pg_class_info.releffectivediststyle = 11 THEN 'AUTO(EVEN)'::text
        WHEN pg_class_info.releffectivediststyle = 12 THEN 'AUTO(KEY)'::text ELSE '<<UNKNOWN>>'::text END AS diststyle,
    pg_user.usename AS owner_name,
    pg_description.description as description
FROM pg_class_info
LEFT JOIN pg_namespace
    ON pg_class_info.relnamespace = pg_namespace.oid
LEFT JOIN pg_description
    ON pg_namespace.oid = pg_description.objoid
LEFT JOIN pg_user
    ON pg_class_info.relowner = pg_user.usesysid
WHERE
    pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')
    AND pg_namespace.nspname NOT LIKE 'pg_%'
LIMIT 100;
"""
run_sql(sql)
   1                                                                                                               
   2 SELECT                                                                                                        
   3     trim(pg_namespace.nspname) AS schema_name,                                                                
   4     trim(pg_class_info.relname) AS table_name,                                                                
   5     CASE WHEN pg_class_info.reldiststyle = 0 THEN 'EVEN'::text                                                
   6         WHEN pg_class_info.reldiststyle = 1 THEN 'KEY'::text                                                  
   7         WHEN pg_class_info.reldiststyle = 8 THEN 'ALL'::text                                                  
   8         WHEN pg_class_info.releffectivediststyle = 10 THEN 'AUTO(ALL)'::text                                  
   9         WHEN pg_class_info.releffectivediststyle = 11 THEN 'AUTO(EVEN)'::text                                 
  10         WHEN pg_class_info.releffectivediststyle = 12 THEN 'AUTO(KEY)'::text ELSE '<<UNKNOWN>>'::text END AS d
  11     pg_user.usename AS owner_name,                                                                            
  12     pg_description.description as description                                                                 
  13 FROM pg_class_info                                                                                            
  14 LEFT JOIN pg_namespace                                                                                        
  15     ON pg_class_info.relnamespace = pg_namespace.oid                                                          
  16 LEFT JOIN pg_description                                                                                      
  17     ON pg_namespace.oid = pg_description.objoid                                                               
  18 LEFT JOIN pg_user                                                                                             
  19     ON pg_class_info.relowner = pg_user.usesysid                                                              
  20 WHERE                                                                                                         
  21     pg_namespace.nspname NOT IN ('information_schema', 'catalog_history')                                     
  22     AND pg_namespace.nspname NOT LIKE 'pg_%'                                                                  
  23 LIMIT 100;                                                                                                    
  24                                                                                                               
[47]:
shape: (23, 5)
schema_nametable_namediststyleowner_namedescription
strstrstrstrstr
"public""artist_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""album_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""genre_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""mediatype_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""track_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""album""KEY""IAM:sanhe""Standard public schema"
"public""mediatype""ALL""IAM:sanhe""Standard public schema"
"public""genre""ALL""IAM:sanhe""Standard public schema"
"public""playlist_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""playlisttrack_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""customer_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""employee""ALL""IAM:sanhe""Standard public schema"
"public""invoice_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""customer""KEY""IAM:sanhe""Standard public schema"
"public""invoiceline_pkey""EVEN""IAM:sanhe""Standard public schema"
"public""invoice""KEY""IAM:sanhe""Standard public schema"
"public""invoiceline""KEY""IAM:sanhe""Standard public schema"
"public""track""KEY""IAM:sanhe""Standard public schema"
"public""albumsalesstats""EVEN""IAM:sanhe""Standard public schema"
"public""artist""ALL""IAM:sanhe""Standard public schema"

pg_table_def

pg_table_def 是 Redshift 中最重要的元数据表之一, 包含了所有表和列的详细定义信息. 重要特征:

  • 来源: Redshift 专有 (非 PostgreSQL 原生)

  • 用户友好性: 是, 只显示当前用户有权限访问的表

  • 数据粒度: 表中每一行代表一个 Column 的详细信息

这个表是我们进行列级元数据提取的核心数据源, 下面这个 Query 就已经够用了.

[48]:
sql = """
SELECT
    td.schemaname AS schema_name,
    td.tablename AS table_name,
    td.column AS column_name,
    td.type AS column_type,
    td.encoding AS column_encoding,
    td.distkey AS is_column_a_distkey,
    td.sortkey AS sortkey_position,
    td.notnull AS is_column_notnull
FROM pg_table_def td
WHERE td.schemaname NOT IN ('pg_catalog', 'information_schema')
LIMIT 20;
"""
run_sql(sql)
   1                                                                                                               
   2 SELECT                                                                                                        
   3     td.schemaname AS schema_name,                                                                             
   4     td.tablename AS table_name,                                                                               
   5     td.column AS column_name,                                                                                 
   6     td.type AS column_type,                                                                                   
   7     td.encoding AS column_encoding,                                                                           
   8     td.distkey AS is_column_a_distkey,                                                                        
   9     td.sortkey AS sortkey_position,                                                                           
  10     td.notnull AS is_column_notnull                                                                           
  11 FROM pg_table_def td                                                                                          
  12 WHERE td.schemaname NOT IN ('pg_catalog', 'information_schema')                                               
  13 LIMIT 20;                                                                                                     
  14                                                                                                               
[48]:
shape: (20, 8)
schema_nametable_namecolumn_namecolumn_typecolumn_encodingis_column_a_distkeysortkey_positionis_column_notnull
strstrstrstrstrbooli64bool
"public""album""albumid""integer""none"false1true
"public""album""title""character varying(255)""lzo"false0true
"public""album""artistid""integer""none"true2true
"public""albumsalesstats""albumid""integer""none"false0false
"public""albumsalesstats""albumtitle""character varying(255)""none"false0false
"public""albumsalesstats""artistname""character varying(255)""none"false0false
"public""albumsalesstats""totalsales""integer""none"false0false
"public""albumsalesstats""totalquantity""integer""none"false0false
"public""albumsalesstats""totalrevenue""numeric(10,2)""none"false0false
"public""albumsalesstats""avgtrackprice""numeric(10,2)""none"false0false
"public""albumsalesstats""tracksinalbum""integer""none"false0false
"public""artist""artistid""integer""none"false1true
"public""artist""name""character varying(255)""lzo"false0false
"public""customer""customerid""integer""none"true1true
"public""customer""firstname""character varying(255)""lzo"false0true
"public""customer""lastname""character varying(255)""lzo"false0true
"public""customer""company""character varying(255)""lzo"false0false
"public""customer""address""character varying(500)""lzo"false0false
"public""customer""city""character varying(100)""none"false3false
"public""customer""state""character varying(100)""lzo"false0false

Summary

通过探索这些关键的系统表, 我们现在掌握了从 Redshift 中提取完整元数据的方法:

  • Schema 层级: 使用 pg_namespace 获取所有用户定义的 Schema

  • Table 层级: 使用 pg_class_info 结合 pg_namespace 获取表的完整信息

  • Column 层级: 使用 pg_table_def 获取列的详细定义

这些查询的结果可以被结构化处理后提供给 LLM, 帮助生成更准确和上下文相关的 SQL 语句. 在后续的实现中, 我们将基于这些基础查询构建完整的元数据提取流程.

[ ]: