Redshift Spectrum

上一节我们涉及到了电影数据的两个表,本节我们将进行4个表的Join查询,前三张表将导入Redshift,最后一张title.principals存放在S3,使用Redshift Spectrum查询:

image-20221028025432082

中间两份数据在前面已经下载并上传到S3, 我们只需要下载第一份和最后一份数据:

wget https://datasets.imdbws.com/name.basics.tsv.gz
wget https://datasets.imdbws.com/title.principals.tsv.gz

将第一份数据上传到之前创建的S3, 跟中间两份数据放在一起:

aws s3 cp name.basics.tsv.gz s3://redshift-kpf

将最后一份数据单独上传到一个S3, 用于后面使用Redshift Spectrum来查询它:

aws s3 mb s3://redshift-kpf-spectrum
aws s3 cp title.principals.tsv.gz s3://redshift-kpf-spectrum/principles/    # Redshift Spectrum查询外部表时,必须要是S3的一个子目录

image-20221028031950409

给Redshift加权限

由于Redshift Spectrum要使用Glue Data CatalogAthena Data Catalog,我们先在之前的IAM Role上增加对这两个服务的访问:

image-20221028032240939

添加完成后的效果:

image-20221028032357662

创建表定义

上一节已经创建好了titletitle_basics表定义, 我们再把name_basicsprinciples表定义创建好:

CREATE TABLE imdb.name_basics(
  nconst VARCHAR (20),
  primaryName VARCHAR (200),
  birthYear VARCHAR (4),
  deathYear VARCHAR (4),
  primaryProfession VARCHAR (5000),
  knownForTitles VARCHAR (5000)
);

# 创建外部数据库 - All external tables must be created in an external schema, which you create using a CREATE EXTERNAL SCHEMA statement.  在执行这条语句时,需要有glue的访问权限
CREATE EXTERNAL SCHEMA spectrum_schema FROM data catalog 
database 'spectrum_db' 
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3'
create external database if not exists;  

此时查看Glue的数据库,发现spectrum_db被创建:

image-20221028032518682

创建外部表:

CREATE EXTERNAL TABLE spectrum_schema.principles(
  tconst VARCHAR (20),
  ordering BIGINT,
  nconst VARCHAR (20),
  category VARCHAR (500),
  job VARCHAR (500),
  characters VARCHAR(5000)
)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://redshift-kpf-spectrum/principles';

查看Glue的数据表,发现principles:

image-20221028032809373

导入数据并进行查询

name_basics需要从S3导入:

COPY imdb.name_basics
FROM 's3://redshift-kpf/name.basics.tsv.gz'
IGNOREHEADER 1
ACCEPTINVCHARS
DELIMITER '\t'
GZIP
iam_role 'arn:aws:iam::145197526627:role/redshift-access-s3';

principles数据可以直接从spectrum里查询:

image-20221028033504652

使用Redshift Spectrum,执行join查询:

select
  title.title,
  title.language,
  title_basics.primarytitle,
  title_basics.startyear,
  title_basics.genres,
  principles.nconst,
  name_basics.primaryname,
  name_basics.primaryprofession
from imdb.title
  join imdb.title_basics on title.titleid = title_basics.tconst
  join spectrum_schema.principles on title.titleid = principles.tconst
  join imdb.name_basics on name_basics.nconst = principles.nconst
where title.language = 'en'
and title_basics.startyear is not null
order by title_basics.startyear desc
limit 500;

执行结果如下:

image-20221028033845050