In my previous post on migrating a large amount of data I walked through the challenges faced when using AWS Data Pipeline. Now that we have our Data Pipeline job running weekly, we decided to copy related data tables to Redshift as well. This came with it’s own challenges. We use the Incremental copy of RDS MySQL table to Redshift template supplied by AWS.
Duplicates
Unlike the, now 600GB+ table, the related table schemas have ids. This makes a difference. It’s somewhat easier to deal with duplicate entries. You can define the id key in the Data Pipeline job and it will overwrite existing values in Redshift.
Set the id as the Redshift table distribution key. If the id is not the distribution key, set the id as one of the Redshift table sort keys.
Choose the proper insert mode. Select OVERWRITE_EXISTING.
This should avoid the insertion of duplicates. As Redshift creates a staging table before inserting into the original table. Then copies unique values into original table.
Tables with No IDs
What if your table has no IDs though and you are not sure if you have duplicates. To check if duplicates exist you can run the following on Redshift:
SELECT count(*) FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY COL_1, COL_2, COL_n order by COL_1, COL_2, COL_n) AS r, t.* from TABLENAME t ) x where x.r > 1;
Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. A bit tedious to type all your column names, our table has 11. (Source)
Oh No! duplicates! What to do, if you do have duplicates? Delete them of course! Easier said than done though. You can’t delete the duplicates in the table. A temp table has to be created with unique rows only. And depending on your table size, this can be a real issue. We’ve have to resize our cluster to accommodate removing duplicates. I’ve posted a feature request.
To remove duplicates run the following on Redshift:
begin; lock table public.TABLENAME; create table if not exists public."TABLENAME_mig" ( COL_1, COL_2, COL_N, ) SORTKEY(COL_1, COL_2, COL_N); alter table public."TABLENAME_mig" add primary key (COL_1); insert into public."TABLENAME_mig" SELECT COL_1,COL_2, COL_N FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY COL_1,COL_2, COL_N order by COL_1,COL_2, COL_N) AS r, t.* from TABLENAME t ) x where x.r = 1; analyze public."TABLENAME_mig"; drop table public.TABLENAME cascade; alter table public."TABLENAME_mig" rename to TABLENAME; commit; vacuum delete only;
Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. (Source)
Character encodings
Another issue that we constantly encounter is importing the S3 file to Redshift. This does not always succeed. As the data, which is plain CSV, might have encoding issues. In our case it was null-byte characters. Have a look at how to prepare your input data.
The problem is we don’t have control on the options that dump the records to CSV. The table records has been written to CSV and stored on S3 already. We can check the errors and evaluate what to do next.
Running the following to check for errors on Redshift:
select * from stl_load_errors order by starttime desc
Produces output similar to:
Missing newline: Unexpected character 0x6f found at location 31
In the err_reason column.
To resolve, check your Activity Logs of the RDSToS3CopyActivity to see the CSV S3 file name.
You should see the S3 file name at the end of the Activity Logs, it usually starts with s3://.
Now that you have the S3 location, execute the following on Redshift:
COPY TABLENAME FROM 's3://PATH/TO/S3FILE.csv' CREDENTIALS 'aws_access_key_id=ACCESS_KEY;aws_secret_access_key=SECRET_KEY’ NULL AS '\0' CSV;
Replace s3://PATH/TO/S3FILE.csv with the S3 location you got from Activity Logs and replace TABLENAME with your table name and add your access credentials ACCESS_KEY and SECRET_KEY. (Source)
Of course, you could have your application not saving saving NULL bytes. But this is not always possible.
Multiple NULL Bytes
In the rare case that you get a shitload of NULL bytes in your data, the solution above will not work. As the NULL as command only seems to replace one NULL byte. What I currently do is download the S3 file to an EC2 instance. (Much quicker on EC2) Then replace the NULL bytes and upload the file to S3.
aws s3 cp s3://S3FILE.csv . tr < S3FILE.csv -d '\000' > S3FILE_OUT.csv aws s3 cp S3FILE.csv s3://S3FILE.csv
Conclusion
For now those are the only issues that I have encountered with my weekly, monthly job runs. Hopefully this saves somebody else some valuable time. In the end the challenges far outweigh the rewards! Our data scientists are now able to do queries that took hours to days on Aurora in minutes to seconds on Redshift.
Tags: aurora, aws, datapipeline, rds, redshift
July 6, 2016 at 10:00 am |
[…] daily dealings with all things cloud « Hello WordPress Dealing with Data Pipeline Failures […]
July 7, 2016 at 1:07 am |
[…] via Dealing with Data Pipeline Failures — Head in the clouds […]
April 27, 2017 at 10:00 pm |
[…] a previous post I detailed my trials and fails with using Datapipeline to archive Aurora tables to Redshift. This […]