creating aws external tables
You can query the data from your aws s3 files by creating an external table for redshift spectrum, having a partition update strategy, which then allows you to query data as you would with other redshift tables.
To create the table and describe the external schema, referencing the columns and location of my s3 files, I usually run DDL statements in aws athena. If files are added on a daily basis, use a date string as your partition
- Create table with schema indicated via DDL
CREATE EXTERNAL TABLE `<external table name>`(
`id` string,
`user_id` string,
`time` bigint,
`app_id` string,
`campaign_id` string,
`campaign_name` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://<file_path>'
TBLPROPERTIES (
'parquet.compress'='SNAPPY')
2. Alter your table daily to add new partitions by date, you can use Athena to run the following:
ALTER TABLE {database}.{table} ADD IF NOT EXISTSPARTITION (dt='{date}');
3. If you need to do an initial bulk load, in the athena UI, you can right click on the table options to Load partitions
.
It will a statement like this:
MSCK REPAIR TABLE <table>;
A gotcha I ran into is that in the DDL statement, the s3 path indicated is case sensitive. Where LOCATION
is indicated:
LOCATION
's3://<bucket>/<table>/this.file.path.is.case.sensitive'
Another error I ran into was syntax related. For DDL statements, make sure you are using back ticks to enclose your table and column names. I am referencing this section:
CREATE EXTERNAL TABLE `<external table name>`(
`id` string,
`user_id` string,
`time` bigint,
`app_id` string,
`campaign_id` string,
`campaign_name` string)
PARTITIONED BY (
`dt` string)
If you use quotes instead, you may get an error that reads:
line 1:8: no viable alternative at input 'create external' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: 9c5b9120-5992-4329-8f6a-7ce9c6607e4c)
For external tables with schemas that can change, you can additionally use aws glue to help crawl and detect new fields.