加载数据

SSB 介绍

SSB(Star Schema Benchmark 下载地址: https://github.com/Kyligence/ssb-kylin )是针对OLAP星型模型的开源压测工具。

测试的过程会创建5张表,表的数据量可以根据参数调整。SSB 的表结构如下图所示:

img

其中 lineorder 为事实表(fact table),其他四张为维度表(dimension table),每张维度表通过 primary key 与事实表关联,这是标准的星型结构。数据仓库系统通常使用星形结构

加载测试集

创建schema:

create schema ssb;
set search_path = ssb, public;  # 这样下面语句创建的物化视图会在ssb schema下面

创建没有指定sort keys, distribution styles, compression encodings的表:

CREATE TABLE part 
(
  p_partkey INTEGER NOT NULL,
  p_name VARCHAR(22) NOT NULL,
  p_mfgr VARCHAR(6) NOT NULL,
  p_category VARCHAR(7) NOT NULL,
  p_brand1 VARCHAR(9) NOT NULL,
  p_color VARCHAR(11) NOT NULL,
  p_type VARCHAR(25) NOT NULL,
  p_size INTEGER NOT NULL,
  p_container VARCHAR(10) NOT NULL
);

CREATE TABLE supplier 
(
  s_suppkey INTEGER NOT NULL,
  s_name VARCHAR(25) NOT NULL,
  s_address VARCHAR(25) NOT NULL,
  s_city VARCHAR(10) NOT NULL,
  s_nation VARCHAR(15) NOT NULL,
  s_region VARCHAR(12) NOT NULL,
  s_phone VARCHAR(15) NOT NULL
);

CREATE TABLE customer 
(
  c_custkey INTEGER NOT NULL,
  c_name VARCHAR(25) NOT NULL,
  c_address VARCHAR(25) NOT NULL,
  c_city VARCHAR(10) NOT NULL,
  c_nation VARCHAR(15) NOT NULL,
  c_region VARCHAR(12) NOT NULL,
  c_phone VARCHAR(15) NOT NULL,
  c_mktsegment VARCHAR(10) NOT NULL
);

CREATE TABLE dwdate 
(
  d_datekey INTEGER NOT NULL,
  d_date VARCHAR(19) NOT NULL,
  d_dayofweek VARCHAR(10) NOT NULL,
  d_month VARCHAR(10) NOT NULL,
  d_year INTEGER NOT NULL,
  d_yearmonthnum INTEGER NOT NULL,
  d_yearmonth VARCHAR(8) NOT NULL,
  d_daynuminweek INTEGER NOT NULL,
  d_daynuminmonth INTEGER NOT NULL,
  d_daynuminyear INTEGER NOT NULL,
  d_monthnuminyear INTEGER NOT NULL,
  d_weeknuminyear INTEGER NOT NULL,
  d_sellingseason VARCHAR(13) NOT NULL,
  d_lastdayinweekfl VARCHAR(1) NOT NULL,
  d_lastdayinmonthfl VARCHAR(1) NOT NULL,
  d_holidayfl VARCHAR(1) NOT NULL,
  d_weekdayfl VARCHAR(1) NOT NULL
);

CREATE TABLE lineorder 
(
  lo_orderkey INTEGER NOT NULL,
  lo_linenumber INTEGER NOT NULL,
  lo_custkey INTEGER NOT NULL,
  lo_partkey INTEGER NOT NULL,
  lo_suppkey INTEGER NOT NULL,
  lo_orderdate INTEGER NOT NULL,
  lo_orderpriority VARCHAR(15) NOT NULL,
  lo_shippriority VARCHAR(1) NOT NULL,
  lo_quantity INTEGER NOT NULL,
  lo_extendedprice INTEGER NOT NULL,
  lo_ordertotalprice INTEGER NOT NULL,
  lo_discount INTEGER NOT NULL,
  lo_revenue INTEGER NOT NULL,
  lo_supplycost INTEGER NOT NULL,
  lo_tax INTEGER NOT NULL,
  lo_commitdate INTEGER NOT NULL,
  lo_shipmode VARCHAR(10) NOT NULL
  );

加载SSB测试数据:

copy customer from 's3://awssampledbuswest2/ssbgz/customer' 
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3' 
gzip compupdate off region 'us-west-2';

copy dwdate from 's3://awssampledbuswest2/ssbgz/dwdate' 
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3' 
gzip compupdate off region 'us-west-2';

copy lineorder from 's3://awssampledbuswest2/ssbgz/lineorder' 
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3'
gzip compupdate off region 'us-west-2';

copy part from 's3://awssampledbuswest2/ssbgz/part' 
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3'
gzip compupdate off region 'us-west-2';

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier' 
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3'
gzip compupdate off region 'us-west-2';

1.根据实际情况,将iam_roleregion做替换

2.上面的命令中,COMPUPDATE=OFF, 这样在导入数据时不会开启自动压缩
参考: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html#copy-compupdate

小表导入过程大概要加载几秒,最大的表lineorder耗时最长,花了15多分钟:

image-20221104201706428

查询每张表的条数:

select count(*) from LINEORDER union
select count(*) from PART union
select count(*) from CUSTOMER union
select count(*) from SUPPLIER union
select count(*) from DWDATE;

结果:

image-20221104201945495