Redshift > 实验 - Redshift表优化 > 测试优化后性能

测试优化后性能

本节我们将从三个维度测试集群优化后的性能：

从S3导入redshift加载时间，上一节已完成
几张表的磁盘占用空间
查询语句性能

我们将把测试的数据继续记录到上一节的表格

磁盘占用存储空间统计

select stv_tbl_perm.name as "table", count(*) as "blocks (mb)"
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')
group by stv_tbl_perm.name
order by 1 asc;

结果：

Benchmark	优化前	优化后
Load time (lineorder)	15m 23s	19m 27s
Storage Use(MB)
LINEORDER	25136	25839
PART	88	140
CUSTOMER	192	382
DWDATE	40	80
SUPPLIER	76	142
Total storage	25532	26583
Query execution time(ms)
Query 1	1850
Query 2	4642
Query 3	2891
Total execution time	9283

检查数据的分布情况。如果数据分布不均匀，某些节点工作负载就比其他的高，造成系统瓶颈。

执行：

select trim(name) as table, slice, sum(num_values) as rows, min(minvalue), max(maxvalue)
from svv_diskusage
where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder') 
and col =0
group by name, slice
order by name, slice;

结果：

由于有三张表使用了ALL分布，所以只会在每个node的第一个slice上分布。而lineorder使用了KEY分布，所以分布在所有的slice上。分布看起来比较均匀，不用进行调整。

测试查询性能

执行和前面同样的查询：

-- Query 1
-- Restrictions on only one dimension. 
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997 
and lo_discount between 1 and 3 
and lo_quantity < 24;

-- Query 2
-- Restrictions on two dimensions 

select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;

-- Query 3
-- Drill down in time to just one month 

select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

结果如下：

Benchmark	优化前	优化后
Load time (lineorder)	15m 23s	19m 27s
Storage Use(MB)
LINEORDER	25136	25839
PART	88	140
CUSTOMER	192	382
DWDATE	40	80
SUPPLIER	76	142
Total storage	25532	26583
Query execution time(ms)
Query 1	1850	325
Query 2	4642	5534
Query 3	2891	255
Total execution time	9283	6114

结果评估

我们在上面的表格中，对导入时间、存储空间、查询时间进行了优化前后的比较

导入时间增加了4分钟，这是因为sorting, compression和distribution都会增加加载时间。
存储空间不降反升，因为有些表启动了ALL distribution，会占用额外磁盘。另外没有数据分布不均匀的情况。
查询时间大大降低，通过sort keys, distribution, compression的优化，查询时间减少了。

总结

sort key的最佳实践：

对查询的列应用sort key(range 、filter、equal查询)
对join的列应用sort key

distribution最佳实践：

将 fact table 和一个 dimension table 的公共列使用KEY distribution
将剩下的dimension table使用ALL distribution。

使用automatic compression一般可以收到最好的压缩效果。