测试优化后性能

本节我们将从三个维度测试集群优化后的性能:

  1. 从S3导入redshift加载时间,上一节已完成
  2. 几张表的磁盘占用空间
  3. 查询语句性能

我们将把测试的数据继续记录到上一节的表格

磁盘占用存储空间统计

select stv_tbl_perm.name as "table", count(*) as "blocks (mb)"
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')
group by stv_tbl_perm.name
order by 1 asc;

结果:

image-20221104205738967

Benchmark 优化前 优化后
Load time (lineorder) 15m 23s 19m 27s
Storage Use(MB)
LINEORDER 25136 25839
PART 88 140
CUSTOMER 192 382
DWDATE 40 80
SUPPLIER 76 142
Total storage 25532 26583
Query execution time(ms)
Query 1 1850
Query 2 4642
Query 3 2891
Total execution time 9283

检查数据的分布情况。如果数据分布不均匀,某些节点工作负载就比其他的高,造成系统瓶颈。

执行:

select trim(name) as table, slice, sum(num_values) as rows, min(minvalue), max(maxvalue)
from svv_diskusage
where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder') 
and col =0
group by name, slice
order by name, slice;

结果:

image-20221104205913454

由于有三张表使用了ALL分布,所以只会在每个node的第一个slice上分布。而lineorder使用了KEY分布,所以分布在所有的slice上。分布看起来比较均匀,不用进行调整。

测试查询性能

执行和前面同样的查询:

-- Query 1
-- Restrictions on only one dimension. 
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997 
and lo_discount between 1 and 3 
and lo_quantity < 24;

-- Query 2
-- Restrictions on two dimensions 

select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;

-- Query 3
-- Drill down in time to just one month 

select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

结果如下:

Benchmark 优化前 优化后
Load time (lineorder) 15m 23s 19m 27s
Storage Use(MB)
LINEORDER 25136 25839
PART 88 140
CUSTOMER 192 382
DWDATE 40 80
SUPPLIER 76 142
Total storage 25532 26583
Query execution time(ms)
Query 1 1850 325
Query 2 4642 5534
Query 3 2891 255
Total execution time 9283 6114

结果评估

我们在上面的表格中,对导入时间存储空间查询时间进行了优化前后的比较

  1. 导入时间增加了4分钟,这是因为sorting, compressiondistribution都会增加加载时间。

  2. 存储空间不降反升,因为有些表启动了ALL distribution,会占用额外磁盘。另外没有数据分布不均匀的情况。

  3. 查询时间大大降低,通过sort keys, distribution, compression的优化,查询时间减少了。

总结

  1. sort key的最佳实践:
  • 对查询的列应用sort key(range 、filter、equal查询)

  • 对join的列应用sort key

  1. distribution最佳实践:
  • fact table 和一个 dimension table 的公共列使用KEY distribution
  • 将剩下的dimension table使用ALL distribution
  1. 使用automatic compression一般可以收到最好的压缩效果。