本节我们将从三个维度测试集群优化后的性能:
我们将把测试的数据继续记录到上一节的表格
select stv_tbl_perm.name as "table", count(*) as "blocks (mb)"
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')
group by stv_tbl_perm.name
order by 1 asc;
结果:
Benchmark | 优化前 | 优化后 |
---|---|---|
Load time (lineorder) | 15m 23s | 19m 27s |
Storage Use(MB) | ||
LINEORDER | 25136 | 25839 |
PART | 88 | 140 |
CUSTOMER | 192 | 382 |
DWDATE | 40 | 80 |
SUPPLIER | 76 | 142 |
Total storage | 25532 | 26583 |
Query execution time(ms) | ||
Query 1 | 1850 | |
Query 2 | 4642 | |
Query 3 | 2891 | |
Total execution time | 9283 |
检查数据的分布情况。如果数据分布不均匀,某些节点工作负载就比其他的高,造成系统瓶颈。
执行:
select trim(name) as table, slice, sum(num_values) as rows, min(minvalue), max(maxvalue)
from svv_diskusage
where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')
and col =0
group by name, slice
order by name, slice;
结果:
由于有三张表使用了ALL分布,所以只会在每个node的第一个slice上分布。而lineorder使用了KEY分布,所以分布在所有的slice上。分布看起来比较均匀,不用进行调整。
执行和前面同样的查询:
-- Query 1
-- Restrictions on only one dimension.
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997
and lo_discount between 1 and 3
and lo_quantity < 24;
-- Query 2
-- Restrictions on two dimensions
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;
-- Query 3
-- Drill down in time to just one month
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;
结果如下:
Benchmark | 优化前 | 优化后 |
---|---|---|
Load time (lineorder) | 15m 23s | 19m 27s |
Storage Use(MB) | ||
LINEORDER | 25136 | 25839 |
PART | 88 | 140 |
CUSTOMER | 192 | 382 |
DWDATE | 40 | 80 |
SUPPLIER | 76 | 142 |
Total storage | 25532 | 26583 |
Query execution time(ms) | ||
Query 1 | 1850 | 325 |
Query 2 | 4642 | 5534 |
Query 3 | 2891 | 255 |
Total execution time | 9283 | 6114 |
我们在上面的表格中,对导入时间
、存储空间
、查询时间
进行了优化前后的比较
导入时间增加了4分钟,这是因为sorting, compression
和distribution
都会增加加载时间。
存储空间不降反升,因为有些表启动了ALL distribution,会占用额外磁盘。另外没有数据分布不均匀的情况。
查询时间大大降低,通过sort keys, distribution, compression
的优化,查询时间减少了。
sort key
的最佳实践:对查询的列应用sort key
(range 、filter、equal查询)
对join的列应用sort key
fact table
和一个 dimension table
的公共列使用KEY distribution
dimension table
使用ALL distribution
。automatic compression
一般可以收到最好的压缩效果。