Rowanto Luo

Just another blog. or log.

Playing with AWS Redshift - Sum of 1.2 billion rows took 8 seconds

One of the think which makes zanox. an exciting place to work at is because of the innovation time. We used to always have 20% of our work time to be innovation time, where we can learn new stuff which makes life a whole lot interesting. About a month before the writing of this, the format of the innovation time was changed, instead of doing whatever we want on friday, we need to make it transparent what we're learning, and the company will support them by giving you the required stuff. And that's when we got the idea with redshift.

So what did we do? Basically, we just mimic the schema of one of our most important table, and generate data for it. Together with one of my college, we wrote a program which can generate csv data and upload it into S3 directly. Since reading/writing a file is slow and we don't want to fill the drive, the file will be generated in ram as batches and uploaded to AWS using multipart upload. So we created a simple application to do exactly that using Java 8 and the AWS Java SDK. We made it so that we can configure the size of each batch and how many batches will be created in parallel. In here I have to admit my knowledge about parallel execution is still not deep enough. We let the small application run in a EC2 c3.4xlarge instance, which has 16 cores and 30 GB RAM. And we let the java generator app to use 15 GB as heap, and run with 10 threads, each threads will create 500mb file in ram and upload it to AWS. And now here's the mystery, when we use 10 threads, the time it took for each thread to generate the 500mb byte array output stream is like ~7/8 times slower then when working on a single thread.

We parallelized thinking we want to use the idle resource, but it got slower instead. We're not using the harddrive, it's all in ram, and we have enough cpu, but it didn't work somehow. After a little tweaking, we found out that if we use about ~3 threads, then it will retain the original single threaded speed, so something was wrong there, and we don't know why. Since it's not our main job to generate the data in the innovation time, we put the mystery aside for a while, and reduced the instance size to c3.2xlarge with 8 cores and 15 GB RAM because all those extra resources cpu and ram can't used anyway. Using 3 threads and 10GB heap for the application, it took 12 hours to generate 1TB data dynamically and upload it to AWS using the multi post upload.

And we were past the biggest hurdle, which is getting the data there. Now we just need to create the schema and use the Redshift copy command to take the data from AWS. The process was pretty straightforward, we used the userId as the distribution key, and added a few other sort keys which make sense. It took 20 hours for the copy command to import the 1TB data from S3 to the cluster. We used 10 nodes of dw2.large. Each node has 0.16TB SSD storage, 15GB ram, 2 cpu with a total of 7 EC2 compute unit.

We have about 15 columns, and I made a request to select on all records, sum on 2 columns, avg on 1 column, and the count column. All grouped by the user id. It took 8 seconds.. It took 8 seconds! I was a little bit taken aback that it can aggregate 1TB data in 8 seconds. Then we made a count of all the records, there're 1.2 billion entries in there. It was a little bit of a letdown for me, because I didn't know that 1TB will amount to 1.2 billion entries. My expectation was a few times more, 1.2 billion entries doesn't really sound like big data. Some company I read about generate 30 billion rows a week (although not necessarily in the same table).

I am not a DBA, and I don't know what's the usual number of entries in a very big table. But given the spec of the Redshift instance, deep in my heart, I think elastic search aggregation will also be able to perform this feat. Imagine 10 nodes of the same spec with elastic search. Even more interesting, can we also query at the same speed if we use 10 sharded and tuned mysql, mssql, or postgresql? Man, this stuff is interesting.

Don't get me wrong, I'm not dissatisfied with redshift. I am sure Redshift is very fast, the letdown is I don't feel like we reached the limit of redshift yet. I really want to push it and break it. Only when I see the real limit of redshift I can be satisfied. So the quest still have to continue..

p.s. If I update this post, then I will give a more specific spec of the result.