Dealing with Data Pipeline Failures

In my previous post on migrating a large amount of data I walked through the challenges faced when using AWS Data Pipeline. Now that we have our Data Pipeline job running weekly, we decided to copy related data tables to Redshift as well. This came with it’s own challenges. We use the Incremental copy of RDS MySQL table to Redshift template supplied by AWS.

Duplicates

Unlike the, now 600GB+ table, the related table schemas have ids. This makes a difference. It’s somewhat easier to deal with duplicate entries. You can define the id key in the Data Pipeline job and it will overwrite existing values in Redshift.

Set the id as the Redshift table distribution key. If the id is not the distribution key, set the id as one of the Redshift table sort keys.

Screen Shot 2016-07-06 at 9.48.44 AM

Choose the proper insert mode. Select OVERWRITE_EXISTING.

Screen Shot 2016-07-06 at 9.51.40 AM

This should avoid the insertion of duplicates. As Redshift creates a staging table before inserting into the original table. Then copies unique values into original table.

Tables with No IDs

What if your table has no IDs though and you are not sure if you have duplicates. To check if duplicates exist you can run the following on Redshift:

SELECT count(*) FROM (
       SELECT
          ROW_NUMBER() OVER (PARTITION BY COL_1, COL_2, COL_n
          order by COL_1, COL_2, COL_n) AS r,
       t.*
       from TABLENAME t
     ) x
where x.r > 1;

Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. A bit tedious to type all your column names, our table has 11. (Source)

Oh No! duplicates! What to do, if you do have duplicates? Delete them of course! Easier said than done though. You can’t delete the duplicates in the table. A temp table has to be created with unique rows only. And depending on your table size, this can be a real issue. We’ve have to resize our cluster to accommodate removing duplicates. I’ve posted a feature request.

To remove duplicates  run the following on Redshift:

begin;
lock table public.TABLENAME;
create table if not exists public."TABLENAME_mig" (
COL_1,
COL_2,
COL_N,
)
SORTKEY(COL_1, COL_2, COL_N);
alter table public."TABLENAME_mig" add primary key (COL_1);
insert into public."TABLENAME_mig" SELECT COL_1,COL_2, COL_N FROM (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY COL_1,COL_2, COL_N order by COL_1,COL_2, COL_N) AS r,
      t.*
      from TABLENAME t
    ) x
where x.r = 1;
analyze public."TABLENAME_mig";
drop table public.TABLENAME cascade;
alter table public."TABLENAME_mig" rename to TABLENAME;
commit;
vacuum delete only;

Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. (Source)

Character encodings

Another issue that we constantly encounter is importing the S3 file to Redshift. This does not always succeed. As the data, which is plain CSV, might have encoding issues. In our case it was null-byte characters. Have a look at how to prepare your input data.

Screen Shot 2016-07-06 at 10.34.34 AM.png

The problem is we don’t have control on the options that dump the records to CSV. The table records has been written to CSV and stored on S3 already. We can check the errors and evaluate what to do next.

Running the following to check for errors on Redshift:

select * from  stl_load_errors order by starttime desc

Produces output similar to:

Missing newline: Unexpected character 0x6f found at location 31                                     

In the err_reason column.

To resolve, check your Activity Logs of the RDSToS3CopyActivity to see the CSV S3 file name.

Screen Shot 2016-07-06 at 10.41.25 AM

You should see the S3 file name at the end of the Activity Logs, it usually starts with s3://.

Now that you have the S3 location, execute the following on Redshift:

COPY TABLENAME
FROM 's3://PATH/TO/S3FILE.csv'
CREDENTIALS 'aws_access_key_id=ACCESS_KEY;aws_secret_access_key=SECRET_KEY’
NULL AS '\0'
CSV;

Replace s3://PATH/TO/S3FILE.csv with the S3 location you got from Activity Logs and replace TABLENAME with your table name and add your access credentials ACCESS_KEY and SECRET_KEY. (Source)

Of course, you could have your application not saving saving NULL bytes. But this is not always possible.

Multiple NULL Bytes

In the rare case that you get a shitload of NULL bytes in your data, the solution above will not work. As the NULL as command only seems to replace one NULL byte. What I currently do is download the S3 file to an EC2 instance. (Much quicker on EC2) Then replace the NULL bytes and upload the file to S3.

aws s3 cp s3://S3FILE.csv .
tr < S3FILE.csv -d '\000' > S3FILE_OUT.csv
aws s3 cp S3FILE.csv s3://S3FILE.csv

Conclusion

For now those are the only issues that I have encountered with my weekly, monthly job runs. Hopefully this saves somebody else some valuable time. In the end the challenges far outweigh the rewards! Our data scientists are now able to do queries that took hours to days on Aurora in minutes to seconds on Redshift.

Advertisements

Tags: , , , ,

3 Responses to “Dealing with Data Pipeline Failures”

  1. Migrating large(500GB+) datasets to Redshift. | Head in the clouds Says:

    […] daily dealings with all things cloud « Hello WordPress Dealing with Data Pipeline Failures […]

  2. Dealing with Data Pipeline Failures — Head in the clouds | SutoCom Solutions Says:

    […] via Dealing with Data Pipeline Failures — Head in the clouds […]

  3. DMS instead of Datapipeline | Head in the clouds Says:

    […] a previous post I detailed my trials and fails with using Datapipeline to archive Aurora tables to Redshift. This […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s