Archive for July, 2016

AWS Summit Notes

July 12, 2016

 

What a great first AWS Summit on African soil!

2016-07-12 08.14.29

Even had the honour of getting a quick pic with the legend himself ūüôā

What a massive turn out as well! 600+ people turned up. Never knew the interest in AWS was this big in Cape Town.

We had great talks from Mix telematics on how they manage 66TB data.

 

2016-07-12 09.39.48

How Jumo world came to trust AWS security in the unsecured lending space. Some points I took with me were around compliance reporting. Which I might need in the future.

Other key points of discussion were ECS and their Lambda service. Lambda shows real promise in the Micro Services space. The Serverless Architecture demo was great. Non technical, but reducing the cost of running a website from $4000 to $13. Really got a lot of oohs and aahs.

There was a voice deploy demo using Amazon Echo. The backend was powered by AWS Lambda. It in turn triggered a CodeDeploy job to update code using a voice command. Some cool shit!

2016-07-12 16.28.59.jpg

Werner ended the day off with a talk on Machine Learning (Kudos for his AVB t-shirt and red chucks). Discussing how Machine Learning has become the differentiator in some spheres of business.

2016-07-12 21.22.43

All in all it turned out to be a good day. We even had some time to mingle over a beer or 6 and exceptional wines from Saronsberg ūüôā

AWS Summit Cape Town

July 11, 2016

capetownairporthome.jpg

Yes you heard correct! AWS Summit is coming to little old Cape Town! The heart of South Africa. Probably the home of EC2 and a few other services. Not to mention a¬†great place for developers to work,live and play. ūüôā

If you haven’t already registered, it might not be to late.¬†https://aws.amazon.com/south-africa/summit-cape-town.

Dr. Werner Vogels will be the keynote speaker. That should get you there by itself. Him being the CTO of Amazon and all. ūüėČ

Looking forward to hear some cool announcements pertaining to Africa. Maybe an edge location in the making?

AWS re:Invent 2016

July 6, 2016

The dates have been set! Time for the largest gathering of AWS users in the world. If you are¬†lucky to attend this year, there’s a few things that you should know before going.

Book the closest hotel.

That would be the Venetian. As it all goes down there. Second choice would be the Mirage. The Encore/Wynn are very classy hotels, but they are quite a walk away from the Venetian.

Arrive early.

Try and be there a day early. So you can adjust to the weather etc. And the side-effects of jet lag.

Don’t try and walk!

Everything is damn far from everything. Rather jump on a bus and walk from the nearest stop. You can get a 3 day bus pass instead. +-$22. Money well spent.

Get connected.

Get yourself a sim card for your phone. There’s wireless in the Venetian, but not all places outside are free. Install the official app to stay in touch with conference activities.

Nothing good happens after 12.

If you would like to partake in the majority of events, get back to your room by 12. (That’s my age talking) Always have a coupling with you. Good to have buddies nearby.

Shop

It’s year end. And if you looking for bargains, there are bargains to be had. Mainly off The Strip though. You might even decide to pack light, and get there a day early to shop for some jean pants and the like ūüôā

Explore

Use Tripadvisor to see what’s cool to do in the area, if you have spare time. There’s a ton of stuff to do! (Besides the obvious! ūüėČ

Related Links:

http://reinvent.awsevents.com/

https://blogs.aws.amazon.com/security/post/Tx31S9L7QPFKF6O/Start-Planning-for-AWS-re-Invent-2016

https://blog.cloudability.com/aws-reinvent-2016-registration-open-what-to-expect/

 

Dealing with Data Pipeline Failures

July 6, 2016

In my previous post on migrating a large amount of data I walked through the challenges faced when using AWS Data Pipeline. Now that we have our Data Pipeline job running weekly, we decided to copy related data tables to Redshift as well. This came with it’s own challenges. We use the¬†Incremental copy of RDS MySQL table to Redshift¬†template supplied by AWS.

Duplicates

Unlike the, now 600GB+ table, the related table schemas have ids. This makes a difference. It’s somewhat easier to deal with duplicate entries. You can define the id key in the Data Pipeline job and it will overwrite existing values in Redshift.

Set the id as the Redshift table distribution key. If the id is not the distribution key, set the id as one of the Redshift table sort keys.

Screen Shot 2016-07-06 at 9.48.44 AM

Choose the proper insert mode. Select OVERWRITE_EXISTING.

Screen Shot 2016-07-06 at 9.51.40 AM

This should avoid the insertion of duplicates. As Redshift creates a staging table before inserting into the original table. Then copies unique values into original table.

Tables with No IDs

What if your table has no IDs though and you are not sure if you have duplicates. To check if duplicates exist you can run the following on Redshift:

SELECT count(*) FROM (
       SELECT
          ROW_NUMBER() OVER (PARTITION BY COL_1, COL_2, COL_n
          order by COL_1, COL_2, COL_n) AS r,
       t.*
       from TABLENAME t
     ) x
where x.r > 1;

Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. A bit tedious to type all your column names, our table has 11. (Source)

Oh No! duplicates! What to do, if you do have duplicates? Delete them of course! Easier said than done though. You can’t delete the duplicates in the table. A temp table has to be created with unique rows¬†only. And¬†depending on your table size, this can be a real issue. We’ve have to resize our cluster to accommodate removing duplicates.¬†I’ve posted a feature request.

To remove duplicates  run the following on Redshift:

begin;
lock table public.TABLENAME;
create table if not exists public."TABLENAME_mig" (
COL_1,
COL_2,
COL_N,
)
SORTKEY(COL_1, COL_2, COL_N);
alter table public."TABLENAME_mig" add primary key (COL_1);
insert into public."TABLENAME_mig" SELECT COL_1,COL_2, COL_N FROM (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY COL_1,COL_2, COL_N order by COL_1,COL_2, COL_N) AS r,
      t.*
      from TABLENAME t
    ) x
where x.r = 1;
analyze public."TABLENAME_mig";
drop table public.TABLENAME cascade;
alter table public."TABLENAME_mig" rename to TABLENAME;
commit;
vacuum delete only;

Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. (Source)

Character encodings

Another issue that we constantly encounter is importing the S3 file to Redshift. This does not always succeed. As the data, which is plain CSV, might have encoding issues. In our case it was null-byte characters. Have a look at how to prepare your input data.

Screen Shot 2016-07-06 at 10.34.34 AM.png

The problem is we don’t have control on the options that dump the records to CSV.¬†The table records has been written to CSV and stored on S3 already. We can check the errors and evaluate what to do next.

Running the following to check for errors on Redshift:

select * from  stl_load_errors order by starttime desc

Produces output similar to:

Missing newline: Unexpected character 0x6f found at location 31                                     

In the err_reason column.

To resolve, check your Activity Logs of the RDSToS3CopyActivity to see the CSV S3 file name.

Screen Shot 2016-07-06 at 10.41.25 AM

You should see the S3 file name at the end of the Activity Logs, it usually starts with s3://.

Now that you have the S3 location, execute the following on Redshift:

COPY TABLENAME
FROM 's3://PATH/TO/S3FILE.csv'
CREDENTIALS 'aws_access_key_id=ACCESS_KEY;aws_secret_access_key=SECRET_KEY’
NULL AS '\0'
CSV;

Replace s3://PATH/TO/S3FILE.csv with the S3 location you got from Activity Logs and replace TABLENAME with your table name and add your access credentials ACCESS_KEY and SECRET_KEY. (Source)

Of course, you could have your application not saving saving NULL bytes. But this is not always possible.

Multiple NULL Bytes

In the rare case that you get a shitload of NULL bytes in your data, the solution above will not work. As the NULL as command only seems to replace one NULL byte. What I currently do is download the S3 file to an EC2 instance. (Much quicker on EC2) Then replace the NULL bytes and upload the file to S3.

aws s3 cp s3://S3FILE.csv .
tr < S3FILE.csv -d '\000' > S3FILE_OUT.csv
aws s3 cp S3FILE.csv s3://S3FILE.csv

Conclusion

For now those are the only issues that I have encountered with my weekly, monthly job runs. Hopefully this saves somebody else some valuable time. In the end the challenges far outweigh the rewards! Our data scientists are now able to do queries that took hours to days on Aurora in minutes to seconds on Redshift.