Redshift > Loading data实践 > 并行导入数据

并行导入数据

Redshift不是OLTP型数据库，所以应当避免使用insert语句来插入数据。推荐使用COPY命令导入数据

导入数据时的行为

当使用COPY命令导入多个压缩文件时，Redshift会并行导入

当使用COPY命令导入一个大的压缩文件时，Redshift只能串行导入，速度会慢很多

当使用COPY命令导入一个大的未压缩文件时，Redshift会自动将其分割成多部分，在每个节点的slice上并行导入

所以，当导入一个大的压缩文件时，推荐将其分割成多个小文件(1MB-1GB)再导入

使用manifest导入

当并行导入多个文件时，一种方式是指定S3的文件夹路径；另一种方式是使用manifest，将文件列表写到里面

它的格式如下：

{
  "entries": [
    {"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
    {"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
    {"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
    {"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
  ]
}

mandatory是可选项，如果为true，则当文件不存在时，COPY过程会返回失败，默认为false

使用manifest这种方式适合以下场景：

只需要导入特定几个文件
从不同bucket导入文件，从不同prefix导入文件。