AWS Redshift — Automated Table Design
The article talks about the automated table design feature of AWS Redshift which takes care of data distribution for efficient storage, retrieval and fast processing
Data Distribution in Redshift
A typical redshift cluster has multiple nodes and slices where the data is distributed. When the query is executed, the query execution plan determines what data blocks need to be located to best run the query.
Then the data is physically moved, or redistributed. And it all happens in parallel to the execution of the query. Redistribution might involve either sending specific rows to nodes for joining or broadcasting an entire table to all of the nodes.
When you load the data into Redshift, it distributes the data rows to the compute nodes and its slices. The distribution of data depends on the distribution style we choose while creating the table. The data distribution has two goals
- Distribute the workload uniformly among the nodes in the cluster.
- To minimize data movement as a query runs.
If either of this is skewed, our query performance can go wrong. By choosing the best distribution style including the key, we can balance the data distribution and significantly improve overall system performance.
Another major consideration while distributing the data is the data sorting which is achieved through sort keys. They are used to scan the data efficiently and applicable for the queries which involve the range restricted predicates.
For instance if I request the list of users whose name is starting with R, and if the data is not sorted. I need to go through all the data blocks (1 MB in size with Redshift) and pick the ones starting with “R”. As the names starting with “R” could be present in any block, most probably you need to scan the whole data set for this.
On the other side if the data is sorted, we can have metadata with each data block…