This has happened to us all at some point in time. We start off with a MySQL instance. Just a few tables, very basic stuff. Our application starts growing. We then decide to use MySQL for all forms of data. And here is our problem. Out of the box it does not deal well with very large datasets.
We a small team and don’t have the capacity to manage the infrastructure side of things. So we use AWS where we can. We started with an AWS MySQL RDS instance and slowly but surely scaled vertically. Then we tried RDS Aurora with its improved performance compared to RDS MySQL. This only worked for a while as well. As most gurus will tell you, “it seems time that you move to something scalable…” In comes Redshift. I won’t go into the details, but it is definitely worth a read to see how they deal with BIG data. Have a read here http://docs.aws.amazon.com/redshift.
Thought I would make a few notes on my experience in moving +-500GB table from RDS Aurora to Redshift.
The Template
To copy a table from RDS to Redshift you can use the “Full copy of RDS MySQL table to Redshift” template. There’s an incremental version of the template as well. The settings that I discuss below are all based on these two templates.
Yes you can script it yourself. Looking back, and the time that I took to get it to work! That might have been a better option.
There are many Third-Party companies that do this for you as well. They have made a business out of ETLs.
Instance type
The template uses a fairly small instance. I would recommend something with a bit more memory. As the process will run for hours if not days. If you see the following error in your logs, you probably need a larger instance:
Can not read response from server. Expected to read 85 bytes, read 3 bytes before connection was unexpectedly lost.
You can change the Instance Type by editing the template in Architect. See below.
The template does 3 retries. You can imagine your S3 bill if it should timeout on a few 100 GB.
Instance Store Size
The EC2 instance used has an 8GB instance store. This is way too small. To remedy this create your own custom AMI. I chose the standard Amazon Linux AMI. For further explanation read this http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html. When you launch the instance that you will create the AMI from, make sure the instance store is larger than your table. See below.
The root instance store does not have to be SSD. Magnetic works just fine. And it’s cheaper. Once your AMI has been created you will have to specify it in the template. See the image in Instance Type above. The Image Id is specified in the same resource.
Instance Location and Access
If your RDS instance resides in a private VPC, like ours, put your EC2 Instance in that VPC as well. See the image in Instance Type above. The Subnet Id is specified in the same resource.
The Security Group parameter should be deleted. As it cannot be used in a VPC, instead use the Security Group Ids parameter instead. Give the EC2 instance the Security Group Ids required to access your RDS instance. See the image in Instance Type above. TheSecurity Group Ids are specified in the same resource.
Make your Redshift instance publicly accessible so that the EC2 Instance can access the cluster.
RDS Connection
I’ve also seen the error below when the RDS server has been very busy:
Can not read response from server. Expected to read 85 bytes, read 2 bytes before connection was unexpectedly lost.
Try and run your migration when things are not to busy. This is obviously not possible for everyone. But find the least busy time on your RDS instance to do the initial migration.
Template Parameters
There are parameters marked optional, but they are actually required. Yes! Optional MANDATORY parameters.
For some reason the myRedshiftTypeConvOverrideMap has to be set. Even if you don’t have overrides. I set the value to tinyint(1):smallint.
Cluster Size
The raw size of the table was +-425GB. The Redshift cluster was configured with 4 dc1.large instances. They contain 160GB/node. Unfortunately the template failed to ingest the table as it was too large for the 4 nodes. Rather provision a bit more to get the initial raw data in. I would suggest 50% more. In our case we provisioned 6 nodes.
Redshift Table
The template uses a table migration script to create a Redshift table based on your MySQL table. This might fail as well. We tried using MySQL table partitioning to improve performance on Aurora. The table migration could not interpret this though. You can get an explanation here http://aws.mannem.me/?p=1249. A workaround is to create the table yourself. This will cause the template to skip the table creation step.
Redshift Copy
The Datapipeline Copy command is pretty basic. And can’t be modified. This is a problem for some cases. If your table contains null byte characters, Redshift sees this as newlines. Your Datapipeline will fail. But it will not delete the S3 file. Check the activity logs of the RDSToS3CopyActivity to see where it wrote the S3 file. Then manually use COPY command using a Redshift client. See below:
COPY TABLENAME FROM 's3://path/to/file.csv' CREDENTIALS 'aws_access_key_id=ACCESS_ID;aws_secret_access_key=SECRET_KEY' null as '\0' CSV;
Redshift Errors
To check Redshift errors simple run:
select * from stl_load_errors order by starttime desc;
I made some more notes on errors in Dealing with Data Pipeline Failures post.
Other Notes
To really save time, create a copy of your large table with a 10MB subset of your data. This should be enough data to test the copy end to end. And speed up fixing constant failures.
If that succeeds. Read https://blogs.aws.amazon.com/bigdata. In that post Issue #1 on column encoding is of great importance. You can use the Amazon Redshift Column Encoding Utility and check what encoding your table columns should be. Then recreate that table with the proper encodings.
The difference in table size is considerable. Our +- 500 GB table in Aurora turns out to be +- 140GB compressed correctly. Insane!
Conclusion
There are plenty of configuration parameters that are missing from the Datapipline template. Or at least better documentation. Hopefully this will save somebody a few days, or in my case a couple of weeks.
Tags: aurora, datapipeline, rds, redshift
July 6, 2016 at 9:32 am |
[…] My daily dealings with all things cloud « Migrating large(500GB+) datasets to Redshift. […]
July 7, 2016 at 6:32 am |
Did you consider AWS DMS to convert from Aurora to Redshift? (This is also being discussed on Reddit/AWS)
July 7, 2016 at 9:35 am |
The guys in the AWS SA User Group suggested that as well. To be honest, I haven’t had time to have a look yet. I remember the group not being able to answer me if truncates/deletes are replicated as well. Eventually we will truncate the Aurora source table and this should not be replicated.
July 7, 2016 at 3:14 pm |
Another question from Reddit:
> Why are you going to Redshift? Unless you are doing data warehousing (OLAP/BI/reporting) this may not the way to go. There are a few exceptions. But redshift may not be the solution if you have an OLTP type application. Solutions I have seen have an RDBMS for OLTP with replication to a data warehouse (Sybase IQ) for reports.
July 7, 2016 at 3:28 pm |
Reporting and BI yes. At the moment any queries involving that, now +700GB table in Aurora, takes quite some time. We looking at eventually truncating that table and only keeping a few weeks of data. I am investigating using AWS DMS today. First hurdle, bin logging not turned on and requires a server restart! 🙂
February 6, 2017 at 11:38 pm |
I’m using the Full copy from Rds to S3 template and is running into the “connect timed out” error. I use the default for security group but it still gave me the same error. How did you set your security group up? Thanks.
February 10, 2017 at 9:41 am |
You need to attach an SG to your EC2 instance that has access to your RDS instance. So in the Resources section there is a field called Security Group Ids. You need to put the SG id in there.
April 27, 2017 at 10:00 pm |
[…] and fails with using Datapipeline to archive Aurora tables to Redshift. This led to a comment from rjhintz about using AWS DMS instead. I initially went with Datapipeline because we would eventually […]
January 29, 2019 at 8:25 am |
I go to see day-to-day some blogs and blogs to read content, however this website presents feature based writing.