在本节,我们将学习如何使用 SUPER 数据类型将半结构化 JSON 数据加载到 Redshift 中的单个列中
我们的JSON数据源在以下 s3 位置:
在Redshift中创建一个表来存储这些数据:
drop table if exists transaction;
create table transaction (
data_json super
);
使用 COPY 语句加载数据。请注意noshred选项,该选项将数据加载到一个字段中:
copy transaction
from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role default region 'us-east-1'
json 'noshred';
可以看到成功的load了5条数据:
验证数据已加载:
编写一个提取客户相关属性的查询,这些字段位于 JSON 结构的顶层:
select data_json.c_custkey, data_json.c_phone, data_json.c_acctbal from transaction;
我们看到c_phone
字段有双引号,这是因为 Redshift 在从 SUPER 字段查询时不强制类型,从而确保如果数据类型随时间变化,加载过程不会失败。请参阅文档以了解有关动态类型的更多信息
。
现在我们对提取的字段进行类型转换,以便它们正确显示:
select data_json.c_custkey::int, data_json.c_phone::varchar, data_json.c_acctbal::decimal(18,2) from transaction;
上面的JSON 文档包含一个c_orders字段,这是另一个复杂的数据结构,c_orders结构包含另一个嵌套字段o_lineitems。
编写一个解数据嵌套(data unnest)
的查询, 确定订单和订单项的数量。要解除数据嵌套, 可以利用PartiQL
句法:
select count(1) from transaction t, t.data_json.c_orders o; # 结果是72条记录
select count(1) from transaction t, t.data_json.c_orders o, o.o_lineitems l; # 结果是251条记录
要获取customer, order
和lineitem
详细信息的完整数据集,可以执行以下查询,从嵌套层次结构中的每个级别提取数据:
select data_json.c_custkey::int, data_json.c_phone::varchar, data_json.c_acctbal::decimal(18,2), o.o_orderstatus::varchar, l.l_shipmode::varchar, l.l_extendedprice::decimal(18,2)
from transaction t, t.data_json.c_orders o, o.o_lineitems l; -- 此查询返回251条记录
之前执行Load时,我们指定了一个参数noshred
, 将数据加载到了一个字段中。
在某些场景中,顶级JSON数据不会经常更改, 而且在ETL过程中不会新增新的顶级字段。在这些情况下,可以使用ignorecase 和auto选项来加载JSON数据。
创建表时,我们包含JSON结构的第一层字段:
drop table if exists transaction_shred;
create table transaction_shred (
c_custkey bigint,
c_phone varchar(20),
c_acctbal decimal(18,2),
c_orders super
);
现在,使用 COPY 语句再次加载数据。这次使用auto选项,这将导致数据加载到与 JSON 对象中的列名称匹配的字段中。列映射将使用基于ignorecase关键字的不区分大小写的匹配:
copy transaction_shred from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role default region 'us-east-1'
json 'auto ignorecase';
获取customer, order
和lineitem
,但这一次,JSON结构的第一级已被提取出来:
select c_custkey, c_phone, c_acctbal, o.o_orderstatus::varchar, l.l_shipmode::varchar, l.l_extendedprice::decimal(18,2)
from transaction_shred t, t.c_orders o, o.o_lineitems l;
使用JSON_PARSE函数可以将 JSON 数据插入或更新到 SUPER 列中,此函数将数据解析为 JSON 格式并将其转换为 SUPER 数据类型,可以在 INSERT 或 UPDATE 语句中使用该数据类型。
如果查询(
insert / update
)中缺少JSON_PARSE
函数,Redshift 会将该值视为单个字符串,而不是解析成JSON 格式的字符串。
在transaction_shred表中插入新的客户订单:
INSERT INTO transaction_shred VALUES
(1234,
'800-867-5309',
441989.88,
JSON_PARSE(
'[{
"o_orderstatus":"F",
"o_clerk":"Clerk#0000001991",
"o_lineitems":[
{
"l_returnflag":"R",
"l_receiptdate":"2017-07-23",
"l_tax":0.03,
"l_shipmode":"TRUCK",
"l_suppkey":4799,
"l_shipdate":"2014-06-24",
"l_commitdate":"2014-06-05",
"l_partkey":54798,
"l_quantity":4,
"l_linestatus":"F",
"l_comment":"Net new order for new customer",
"l_extendedprice":28007.64,
"l_linenumber":1,
"l_discount":0.02,
"l_shipinstruct":"TAKE BACK RETURN"
}
],
"o_orderdate":"2014-06-01",
"o_shippriority":0,
"o_totalprice":28308.25,
"o_orderkey":1953697,
"o_comment":"wing for 997.1 gt3",
"o_orderpriority":"5-LOW"
}],'));
由于JSON_PARSE函数检测到格式错误的 JSON 文档,插入错误:
删除 JSON 末尾不需要的逗号,然后尝试再次插入数据。
INSERT INTO transaction_shred VALUES
(1234,
'800-867-5309',
441989.88,
JSON_PARSE( '[{
"o_orderstatus":"F",
"o_clerk":"Clerk#0000001991",
"o_lineitems":[
{
"l_returnflag":"R",
"l_receiptdate":"2017-07-23",
"l_tax":0.03,
"l_shipmode":"TRUCK",
"l_suppkey":4799,
"l_shipdate":"2014-06-24",
"l_commitdate":"2014-06-05",
"l_partkey":54798,
"l_quantity":4,
"l_linestatus":"F",
"l_comment":"Net new order for new customer",
"l_extendedprice":28007.64,
"l_linenumber":1,
"l_discount":0.02,
"l_shipinstruct":"TAKE BACK RETURN"
}
],
"o_orderdate":"2014-06-01",
"o_shippriority":0,
"o_totalprice":28308.25,
"o_orderkey":1953697,
"o_comment":"wing for 997.1 gt3",
"o_orderpriority":"5-LOW"
}]'));
插入成功。
接下来,更新刚刚插入的行以添加订单的新行:
update transaction_shred
set c_orders =
JSON_PARSE( '[{
"o_orderstatus":"F",
"o_clerk":"Clerk#0000001991",
"o_lineitems":[
{
"l_returnflag":"R",
"l_receiptdate":"2017-07-23",
"l_tax":0.03,
"l_shipmode":"TRUCK",
"l_suppkey":4799,
"l_shipdate":"2014-06-24",
"l_commitdate":"2014-06-05",
"l_partkey":54798,
"l_quantity":4,
"l_linestatus":"F",
"l_comment":"Net new order for new customer",
"l_extendedprice":28007.64,
"l_linenumber":1,
"l_discount":0.02,
"l_shipinstruct":"TAKE BACK RETURN"
},
{
"l_returnflag":"R",
"l_receiptdate":"2017-07-23",
"l_tax":0.03,
"l_shipmode":"TRUCK",
"l_suppkey":4799,
"l_shipdate":"2014-06-24",
"l_commitdate":"2014-06-05",
"l_partkey":54798,
"l_quantity":4,
"l_linestatus":"F",
"l_comment":"Net new order2 for new customer",
"l_extendedprice":28007.64,
"l_linenumber":2,
"l_discount":0.02,
"l_shipinstruct":"TAKE BACK RETURN"
}
],
"o_orderdate":"2014-06-01",
"o_shippriority":0,
"o_totalprice":28308.25,
"o_orderkey":1953697,
"o_comment":"wing for 997.1 gt3",
"o_orderpriority":"5-LOW"
}]')
where c_custkey = 1234;
最后,让我们获得customer, order
和lineitem
的详细信息:
select c_custkey, c_phone, c_acctbal, o.o_orderstatus::varchar, l.l_shipmode::varchar, l.l_extendedprice::decimal(18,2), l.l_linenumber::int
from transaction_shred t, t.c_orders o, o.o_lineitems l
order by c_custkey;