Posts Tagged ‘rds’

Amazon Aurora Row size limit.

October 31, 2018

We started using Aurora as soon as it was released. With the obvious pros over using MySQL. There’s never been an issue running our production workloads.

Recently we encountered an issue that one would have assumed would be a non-issue in Aurora. The row size of a column is limited to a little less than 8KB. Our application had a description text area that needed to accept a little more than 8KB of data. [1]

“Amazon Aurora MySQL does not support compressed tables that is tables created with ROW_FORMAT=COMPRESSED, hence we request you to change the row_format to DYNAMIC in BARRACUDA file format” [5]

What I found on the forums wasn’t helpful. The solutions there were a no go, use Oracle lol. [6]


The issue occurs when the file format chosen by InnoDB is ANTELOPE. (Please note that ANTELOPE is the original InnoDB file format and currently this file format supports COMPACT and REDUNDANT row formats for the InnoDB tables and this currently is the default file format in MySQL 5.6. This is kept as default to ensure that the earlier MySQL versions have maximum compatibility with 5.6 MySQL. Please note that this is the case with MySQL also which leverages ANTELOPE as default file format.)

Currently InnoDB has another file format called BARRACUDA which is the newest file format for InnoDB and this currently supports all InnoDB row formats including the newer COMPRESSED and DYNAMIC besides the COMPACT and REDUNDANT row format from ANTELOPE file format. The COMPRESSED and DYNAMIC row formats include compressed tables, efficient storage of off-page columns and index key prefixes up to 3072 bytes [4]. So when you create a table with row format in DYNAMIC or COMPRESSED, InnoDB can store long variable-length column values for VARCHAR, BLOB and TEXT data types fully off-page. [4]

Having said this please kindly note that both DYNAMIC and COMPRESSED row formats support index key prefixes up to 3072 bytes, to get the prefix of more than 768 bytes we will have to use turn the innodb_large_prefix parameter to ON that can give key prefixes up to 3072 bytes. DYNAMIC row format can maintain the efficiency of storing the entire row in the index node, however, this format avoids the problem of filling B-tree nodes with a large number of data bytes of long columns.


1. By default innodb_file_per_table parameter is set to ON in Cluster Parameter group, you can have a different custom cluster parameter group just in case if would like to change the collation and character sets in future, for now even with your default cluster parameter group the innodb_file_per_table parameter is set to ON.

  • The default values are innodb_file_format = Antelope [2] and innodb_large_prefix = 0 [3]

2. Tweak these respective parameters in a custom DB parameter group called ‘aurora56’

  • innodb_file_format value to BARRACUDA
  • innodb_large_prefix value to 1 so that you can get up to 3072 bytes for key prefixes.

Assign the custom DB parameter group to you Aurora instance.

3. Perform the ALTER TABLE after changing the values of parameter groups as suggested earlier and tag the custom parameter groups to your instance and after ensuring that the parameter groups have taken into effect you may perform the below.



This solution was made available by AWS Support. And I thought it sounded pretty smart for a workaround. Credit to Ashika that picked up my ticket.


[1] Limits on InnoDB Tables:

[2] innodb_file_format:


[4] InnoDB File-Format Management:

[5] Compressed tables not supported:

[6] Forums.

Dealing with Data Pipeline Failures

July 6, 2016

In my previous post on migrating a large amount of data I walked through the challenges faced when using AWS Data Pipeline. Now that we have our Data Pipeline job running weekly, we decided to copy related data tables to Redshift as well. This came with it’s own challenges. We use the Incremental copy of RDS MySQL table to Redshift template supplied by AWS.


Unlike the, now 600GB+ table, the related table schemas have ids. This makes a difference. It’s somewhat easier to deal with duplicate entries. You can define the id key in the Data Pipeline job and it will overwrite existing values in Redshift.

Set the id as the Redshift table distribution key. If the id is not the distribution key, set the id as one of the Redshift table sort keys.

Screen Shot 2016-07-06 at 9.48.44 AM

Choose the proper insert mode. Select OVERWRITE_EXISTING.

Screen Shot 2016-07-06 at 9.51.40 AM

This should avoid the insertion of duplicates. As Redshift creates a staging table before inserting into the original table. Then copies unique values into original table.

Tables with No IDs

What if your table has no IDs though and you are not sure if you have duplicates. To check if duplicates exist you can run the following on Redshift:

SELECT count(*) FROM (
          order by COL_1, COL_2, COL_n) AS r,
       from TABLENAME t
     ) x
where x.r > 1;

Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. A bit tedious to type all your column names, our table has 11. (Source)

Oh No! duplicates! What to do, if you do have duplicates? Delete them of course! Easier said than done though. You can’t delete the duplicates in the table. A temp table has to be created with unique rows only. And depending on your table size, this can be a real issue. We’ve have to resize our cluster to accommodate removing duplicates. I’ve posted a feature request.

To remove duplicates  run the following on Redshift:

lock table public.TABLENAME;
create table if not exists public."TABLENAME_mig" (
alter table public."TABLENAME_mig" add primary key (COL_1);
insert into public."TABLENAME_mig" SELECT COL_1,COL_2, COL_N FROM (
      from TABLENAME t
    ) x
where x.r = 1;
analyze public."TABLENAME_mig";
drop table public.TABLENAME cascade;
alter table public."TABLENAME_mig" rename to TABLENAME;
vacuum delete only;

Replace COL_1 to COL_N with you columns and replace TABLENAME with your table name. (Source)

Character encodings

Another issue that we constantly encounter is importing the S3 file to Redshift. This does not always succeed. As the data, which is plain CSV, might have encoding issues. In our case it was null-byte characters. Have a look at how to prepare your input data.

Screen Shot 2016-07-06 at 10.34.34 AM.png

The problem is we don’t have control on the options that dump the records to CSV. The table records has been written to CSV and stored on S3 already. We can check the errors and evaluate what to do next.

Running the following to check for errors on Redshift:

select * from  stl_load_errors order by starttime desc

Produces output similar to:

Missing newline: Unexpected character 0x6f found at location 31                                     

In the err_reason column.

To resolve, check your Activity Logs of the RDSToS3CopyActivity to see the CSV S3 file name.

Screen Shot 2016-07-06 at 10.41.25 AM

You should see the S3 file name at the end of the Activity Logs, it usually starts with s3://.

Now that you have the S3 location, execute the following on Redshift:

FROM 's3://PATH/TO/S3FILE.csv'
CREDENTIALS 'aws_access_key_id=ACCESS_KEY;aws_secret_access_key=SECRET_KEY’
NULL AS '\0'

Replace s3://PATH/TO/S3FILE.csv with the S3 location you got from Activity Logs and replace TABLENAME with your table name and add your access credentials ACCESS_KEY and SECRET_KEY. (Source)

Of course, you could have your application not saving saving NULL bytes. But this is not always possible.

Multiple NULL Bytes

In the rare case that you get a shitload of NULL bytes in your data, the solution above will not work. As the NULL as command only seems to replace one NULL byte. What I currently do is download the S3 file to an EC2 instance. (Much quicker on EC2) Then replace the NULL bytes and upload the file to S3.

aws s3 cp s3://S3FILE.csv .
tr < S3FILE.csv -d '\000' > S3FILE_OUT.csv
aws s3 cp S3FILE.csv s3://S3FILE.csv


For now those are the only issues that I have encountered with my weekly, monthly job runs. Hopefully this saves somebody else some valuable time. In the end the challenges far outweigh the rewards! Our data scientists are now able to do queries that took hours to days on Aurora in minutes to seconds on Redshift.

Migrating large(500GB+) datasets to Redshift.

May 19, 2016

This has happened to us all at some point in time. We start off with a MySQL instance. Just a few tables, very basic stuff. Our application starts growing. We then decide to use MySQL for all forms of data. And here is our problem. Out of the box it does not deal well with very large datasets.

We a small team and don’t have the capacity to manage the infrastructure side of things. So we use AWS where we can. We started with an AWS MySQL RDS instance and slowly but surely scaled vertically. Then we tried RDS Aurora with its improved performance compared to RDS MySQL. This only worked for a while as well. As most gurus will tell you, “it seems time that you move to something scalable…” In comes Redshift. I won’t go into the details, but it is definitely worth a read to see how they deal with BIG data. Have a read here

Thought I would make a few notes on my experience in moving +-500GB table from RDS Aurora to Redshift.

The Template

To copy a table from RDS to Redshift you can use the “Full copy of RDS MySQL table to Redshift” template. There’s an incremental version of the template as well. The settings that I discuss below are all based on these two templates.

Yes you can script it yourself. Looking back, and the time that I took to get it to work! That might have been a better  option.

There are many Third-Party companies that do this for you as well. They have made a business out of ETLs.

Instance type

The template uses a fairly small instance. I would recommend something with a bit more memory. As the process will run for hours if not days. If you see the following error in your logs, you probably need a larger instance:

Can not read response from server. Expected to read 85 bytes, read 3 bytes before connection was unexpectedly lost.

You can change the Instance Type by editing the template in Architect. See below.

Screen Shot 2016-05-19 at 1.44.48 PM

The template does 3 retries. You can imagine your S3 bill if it should timeout on a few 100 GB.

Instance Store Size

The EC2 instance used has an 8GB instance store. This is way too small. To remedy this create your own custom AMI. I chose the standard Amazon Linux AMI. For further explanation read this When you launch the instance that you will create the AMI from, make sure the instance store is larger than your table. See below.

Screen Shot 2016-05-19 at 3.27.13 PM.png

The root instance store does not have to be SSD. Magnetic works just fine. And it’s cheaper. Once your AMI has been created you will have to specify it in the template. See the image in Instance Type above. The Image Id is specified in the same resource.

Instance Location and Access

If your RDS instance resides in a private VPC, like ours, put your EC2 Instance in that VPC as well. See the image in Instance Type above. The Subnet Id is specified in the same resource.

The Security Group parameter should be deleted. As it cannot be used in a VPC, instead use the Security Group Ids parameter instead. Give the EC2 instance the Security Group Ids required to access your RDS instance. See the image in Instance Type above. TheSecurity Group Ids are specified in the same resource.

Make your Redshift instance publicly accessible so that the EC2 Instance can access the cluster.

RDS Connection

I’ve also seen the error below when the RDS server has been very busy:

Can not read response from server. Expected to read 85 bytes, read 2 bytes before connection was unexpectedly lost.

Try and run your migration when things are not to busy. This is obviously not possible for everyone. But find the least busy time on your RDS instance to do the initial migration.

Template Parameters

There are parameters marked optional, but they are actually required. Yes! Optional MANDATORY parameters.

For some reason the myRedshiftTypeConvOverrideMap has to be set. Even if you don’t have overrides. I set the value to tinyint(1):smallint.

Cluster Size

The raw size of the table was +-425GB. The Redshift cluster was configured with 4 dc1.large instances. They contain 160GB/node. Unfortunately the template failed to ingest the table as it was too large for the 4 nodes. Rather provision a bit more to get the initial raw data in. I would suggest 50% more. In our case we provisioned 6 nodes.

Redshift Table

The template uses a table migration script to create a Redshift table based on your MySQL table. This might fail as well. We tried using MySQL table partitioning to improve performance on Aurora. The table migration could not interpret this though. You can get an explanation here A workaround is to create the table yourself. This will cause the template to skip the table creation step.

Redshift Copy

The Datapipeline Copy command is pretty basic. And can’t be modified. This is a problem for some cases. If your table contains null byte characters, Redshift sees this as newlines. Your Datapipeline will fail. But it will not delete the S3 file. Check the activity logs of the RDSToS3CopyActivity to see where it wrote the S3 file. Then manually use COPY command using a Redshift client. See below:

 FROM 's3://path/to/file.csv' 
 CREDENTIALS 'aws_access_key_id=ACCESS_ID;aws_secret_access_key=SECRET_KEY' 
 null as '\0'

Redshift Errors

To check Redshift errors simple run:

select * from  stl_load_errors order by starttime desc;

I made some more notes on errors in Dealing with Data Pipeline Failures post.

Other Notes

To really save time, create a copy of your large table with a 10MB subset of your data. This should be enough data to test the copy end to end. And speed up fixing constant failures.

If that succeeds. Read In that post Issue #1 on column encoding is of great importance. You can use the Amazon Redshift Column Encoding Utility and check what encoding your table columns should be. Then recreate that table with the proper encodings.

The difference in table size is considerable. Our +- 500 GB table in Aurora turns out to be +- 140GB compressed correctly. Insane!


There are plenty of configuration parameters that are missing from the Datapipline template. Or at least better documentation. Hopefully this will save somebody a few days, or in my case a couple of weeks.