12 Amazon Redshift Query Optimization Techniques

published on 09 October 2024

Want lightning-fast Redshift queries? Here's your cheat sheet:

  1. Design tables smartly
  2. Load data efficiently
  3. Write better SQL
  4. Use materialized views
  5. Manage workloads
  6. Cache query results
  7. Update statistics
  8. Run VACUUM commands
  9. Read query plans
  10. Leverage column storage
  11. Use Redshift Spectrum
  12. Scale for multiple users

Quick wins:

  • Use ZSTD encoding to slash storage by 50%+
  • Run ANALYZE when "stats off" hits 10%
  • Set up WLM with 8 queues for query management
Technique Benefit Potential Drawback
Table Design Faster queries Needs upfront planning
Better Loading Less disk usage May change existing processes
Smarter SQL Uses fewer resources Requires ongoing review
Materialized Views Pre-computes complex math Uses more storage
Workload Management Prioritizes key queries Can be tricky to set up

Remember: Monitor with Redshift Advisor to boost efficiency and cut costs.

Good Table Design

Table design can make or break your Amazon Redshift queries. It's all about setting up your tables for fast, efficient data access.

Here's what matters:

1. Sort Keys

Think of sort keys as your book's index. They help Redshift find data fast.

Two types:

  • Compound
  • Interleaved

Pro tip: For compound keys, start with the column that has the fewest unique values. This helps Redshift skip more data blocks when scanning.

2. Distribution Keys

These spread your data across the cluster. Choose based on how you join tables:

Style Use Case
EVEN Big fact tables, no clear key
KEY Fact and dimension tables you join often
ALL Small dimension tables joining with big fact tables

Example: If you're always joining 'sales' and 'customer' tables on customer ID, use that as the distribution key for both.

3. Compression

Compression saves space and speeds up I/O. But don't compress sort key columns - it can slow things down.

Tip: Use ANALYZE COMPRESSION to find the best fit for each column.

Remember: You can't change distribution keys later. Plan carefully!

To check your design:

  • Use PG_TABLE_DEF for column info
  • Check SVV_TABLE_INFO for data distribution and stats

Seeing skewed data or lots of unsorted rows? Time to rethink your design.

2. Better Data Loading

Loading data into Amazon Redshift can make or break your query performance. Here's how to do it right:

Use COPY, not INSERT

COPY is your go-to for bulk inserts. It's way faster than INSERT for lots of rows because it:

  • Uses all compute nodes in parallel
  • Supports multiple data sources (S3, DynamoDB, EMR HDFS, SSH)
  • Handles compressed files (GZIP, LZO for CSV)

Size matters

For best results:

  • Use multiple small files, not one big one
  • Aim for 1MB-1GB per compressed file
  • Match file count to your cluster's slice count

Find your slice count:

SELECT COUNT(*) AS number_of_slices FROM stv_slices;

Compress your data

Compression saves space and speeds up I/O. For CSV, use GZIP or LZO.

Use Change Data Capture (CDC)

Only copy new or changed data. This cuts disk usage, reduces redundant data, and minimizes VACUUM operations.

Tweak COPY command

For UPSERT operations, add:

COMPUPDATE OFF STATUPDATE OFF

This prevents unnecessary 'analyze' commands that slow things down.

ETL tools can help

For complex data loading, consider:

Tool Good Not so good
AWS Glue Native AWS integration Needs coding skills
Integrate.io No-code, drag-and-drop Third-party solution
Fivetran AWS partner, automated Possible extra costs

3. Writing Better Queries

Want to speed up your Amazon Redshift queries? It's all about smart SQL. Here's how:

1. Pick your columns

Don't use SELECT *. It's lazy. Choose what you need:

SELECT customer_id, order_date, total_amount
FROM orders
WHERE order_date > '2023-01-01';

2. Filter early

Use WHERE to cut the fat:

SELECT sales.salesid, listing.listid
FROM sales
JOIN listing ON sales.salesid = listing.listid
WHERE sales.saletime > '2023-01-01'
  AND listing.listtime > '2023-01-01';

3. Join smart

INNER JOINs are usually faster. For small sets, try subqueries:

SELECT sum(sales.qtysold)
FROM sales
WHERE salesid IN (
  SELECT listid
  FROM listing
  WHERE listtime > '2023-01-01'
);

4. CASE for complex stuff

Don't scan tables multiple times. Use CASE:

SELECT
  CASE
    WHEN total_amount < 100 THEN 'Low'
    WHEN total_amount < 1000 THEN 'Medium'
    ELSE 'High'
  END AS spend_category,
  COUNT(*) AS order_count
FROM orders
GROUP BY 1;

5. Match GROUP BY and ORDER BY

Keep the column order the same:

SELECT category, subcategory, SUM(sales)
FROM products
GROUP BY category, subcategory
ORDER BY category, subcategory;

6. EXPLAIN is your friend

See how your query runs:

EXPLAIN
SELECT * FROM sales
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31';

It's like X-ray vision for your queries. Use it to spot and fix slow ones.

4. Using Materialized Views

Think of materialized views in Amazon Redshift as pre-cooked meals for your queries. They store complex query results, saving you from number-crunching every time. Here's how to use them:

1. Create a materialized view

CREATE MATERIALIZED VIEW city_sales AS (
  SELECT st.city, SUM(sa.amount) as total_sales
  FROM sales sa, store st
  WHERE sa.store_id = st.id
  GROUP BY st.city
);

This view pre-calculates total sales for each city.

2. Query the view

SELECT * FROM city_sales WHERE total_sales > 1000000;

3. Keep it fresh

When base data changes, refresh the view:

REFRESH MATERIALIZED VIEW city_sales;

Use materialized views for frequent queries, when you need quick results (but not real-time data), or to simplify complex joins or aggregations.

After introducing materialized views in March 2020, some Amazon Redshift users saw query times drop from 20 seconds to 900 milliseconds - a 96% improvement!

Here's a neat trick: Redshift can automatically rewrite your queries to use materialized views when possible. Let it do the heavy lifting for you.

5. Managing Workloads

Amazon Redshift's Workload Management (WLM) is your secret weapon for supercharging query execution. Without it? All queries fight for resources like kids in a candy store. Not pretty.

Here's how to tame the WLM beast:

  1. Create user logins: Give everyone their own key to the kingdom.
  2. Group users: Think of it like sorting laundry:
    • Loads (COPY and UNLOAD)
    • Transforms (INSERT, UPDATE, DELETE)
    • Ad-hoc (SELECT)
  3. Set up queues: Redshift gives you 8 queues with 50 slots. But don't go crazy - stick to 15 or fewer for best results.
  4. Embrace Automatic WLM: It's like having a traffic cop for your queries. It juggles memory and concurrency to keep things flowing.

Want to see Automatic WLM in action? Check out this JSON setup:

[ 
  { "user_group": ["ETL_users"], "priority": "highest", "queue_type": "auto", "auto_wlm": true }, 
  { "user_group": ["Dashboard_users"], "priority": "high", "queue_type": "auto", "auto_wlm": true }, 
  { "query_group": ["Adhoc_query"], "user_group": ["Analytics_users"], "priority": "normal", "queue_type": "auto", "auto_wlm": true }, 
  { "priority": "low", "queue_type": "auto", "auto_wlm": true }, 
  { "short_query_queue": true } 
]
  1. Turn on Concurrency Scaling: It's like having a backup generator for your queries. More power when you need it, less when you don't.
  2. Use Short Query Acceleration: Let those quick queries jump the line.
  3. Keep an eye on things: Watch your system throughput. It's the pulse of your WLM setup.

6. Caching Query Results

Think of Amazon Redshift's query result caching as a super-fast memory bank for your data queries. It's built-in and works automatically to speed up repeat queries.

Here's the gist:

  1. Run a query
  2. Redshift saves the results
  3. Run it again? Redshift checks for a cached version
  4. If found, you get instant results

The best part? It's on by default. No setup needed.

What's in it for you?

  • Lightning-fast cached queries (under 100ms)
  • Smart caching (only when data hasn't changed)
  • Great for dashboards and BI tools

Want to check if your queries are using the cache? Try this:

select userid, query, elapsed, source_query 
from svl_qlog 
where userid > 1 
order by query desc;

If source_query isn't null, you've hit the cache!

Need to turn off caching? Use this:

Set Enable_Result_Cache_For_Session = FALSE;

7. Updating Table Statistics

Think of table statistics as your database's GPS. Without them, your query planner is lost.

Here's why they're crucial:

  • They help Redshift allocate memory efficiently
  • They speed up query execution
  • They prevent performance hiccups

To update stats, use the ANALYZE command:

ANALYZE mytable;

For a specific column:

ANALYZE mytable (mycolumn);

When should you run ANALYZE? This query shows tables needing attention:

SELECT "schema" + '.' + "table" 
FROM svv_table_info 
WHERE stats_off > 10;

It lists tables with stats off by more than 10%.

Pro tip: After big data changes, run VACUUM then ANALYZE.

Action Use Case
ANALYZE After major data changes
VACUUM Before ANALYZE if you've deleted or updated a lot

Redshift auto-runs ANALYZE on new tables. But ongoing? That's on you.

sbb-itb-9890dba

8. Using Vacuum Commands

Think of your Redshift database as a messy closet. Over time, it gets cluttered with "ghost rows" - deleted data that still takes up space. This slows down your queries. Enter the VACUUM command.

VACUUM does two things:

  1. Cleans up deleted rows
  2. Sorts data within each slice

Here's a quick look at VACUUM options:

Command Purpose
VACUUM FULL Clean and sort
VACUUM DELETE ONLY Just clean
VACUUM SORT ONLY Just sort
VACUUM REINDEX Clean, sort, and reindex

When to run VACUUM:

  • After big DELETE or UPDATE operations
  • Weekly for your whole cluster
  • Daily for busy tables

To find tables needing a vacuum, run:

SELECT "schema" + '.' + "table" 
FROM svv_table_info 
WHERE unsorted > 10;

This shows tables with over 10% unsorted data.

"Vacuuming can slow down ETL jobs and queries by up to 80%", warns a Redshift expert.

So, schedule VACUUM during quiet times.

9. Reading Query Plans

Query plans are roadmaps for your Redshift queries. They show how Redshift will execute a query, helping you spot performance issues. To get a query plan, use the EXPLAIN command:

EXPLAIN SELECT * FROM sales WHERE saledate = '2021-12-25';

This breaks down the query's execution steps. Let's look at a more complex example:

EXPLAIN SELECT eventname, COUNT(*) 
FROM event 
GROUP BY eventname;

QUERY PLAN
-------------------------------------------------------------------
XN HashAggregate  (cost=131.97..133.41 rows=576 width=17)
  ->  XN Seq Scan on event  (cost=0.00..87.98 rows=8798 width=17)

What does this mean?

Operation Description
XN HashAggregate Groups results using a hash table
XN Seq Scan Scans the entire 'event' table

The cost values show relative execution times. Higher numbers? Longer processing times.

To decode query plans:

  1. Start from the bottom, most indented line
  2. Work your way up
  3. Look for high-cost operations
  4. Check if row counts match expectations

Row counts off? Might be outdated statistics. Run ANALYZE to update them.

Want deeper insights? Use system tables:

SELECT * FROM STL_QUERYTEXT 
WHERE query = your_query_id;

This shows your query's full text, helping you spot issues in complex queries.

10. Column-Based Storage

Amazon Redshift uses column-based storage to speed up queries. Instead of storing data by rows, it organizes it by columns. This makes data retrieval and analysis much faster.

Why is column-based storage so effective?

  1. It's efficient. Redshift only reads the columns it needs for a query.
  2. It compresses better. Similar data types are stored together.
  3. It's quick for aggregations. Calculations on specific columns are faster.

Let's look at a real-world example:

SELECT AVG(sales_amount) 
FROM sales 
WHERE sale_date > '2023-01-01';

For this query, Redshift only reads the sales_amount and sale_date columns. It ignores everything else in the sales table. This cuts down on data scanning big time.

Want to get the most out of column-based storage? Here's how:

  • Only include columns you need in your SELECT statements
  • Filter tables before joins using predicates
  • Use DISTKEY and SORTKEY for columns you query often

Here's a quick comparison of row-based vs. column-based storage:

Aspect Row-Based Column-Based
Data Organization By record By field
Query Performance Better for full record retrieval Better for analytical queries
Compression Less efficient More efficient
I/O Operations More for analytical queries Fewer for analytical queries

Column-based storage isn't just a fancy feature - it's a game-changer for data analysis in Redshift.

11. Using Redshift Spectrum

Redshift Spectrum is a powerful tool for querying massive datasets. It lets you run SQL on S3 data without loading it into Redshift tables. No more complex ETL processes!

Why use Spectrum?

  • Query S3 data directly
  • Pay only for what you query
  • Handle huge datasets without increasing cluster size

Here's a real example:

An online store keeps recent data in Redshift and older data in S3. With Spectrum, they analyze both together, spotting long-term trends without moving data around.

To get the most out of Spectrum:

1. Use columnar formats (Parquet, ORC)

2. Compress your data

3. Partition smartly

Check out this comparison:

Aspect All in Redshift Using Spectrum
Storage Cost Higher Lower
Query Speed Fast Fast (hot data), Slightly slower (cold data)
Scalability Limited Nearly unlimited
Data Freshness Immediate Slight lag for S3 data

Spectrum shines when joining Redshift and S3 data. Here's a sample query:

SELECT l_orderkey, SUM(l_extendedprice * (1 - l_discount)) AS revenue, 
       o_orderdate, o_shippriority
FROM customer, orders, s3_external_schema.LINEITEM_PART_PARQ
WHERE c_mktsegment = 'BUILDING'
  AND c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND o_orderdate < DATE '1995-03-15'
  AND l_shipdate > DATE '1995-03-15'
GROUP BY l_orderkey, o_orderdate, o_shippriority
ORDER BY revenue DESC, o_orderdate
LIMIT 20;

This query mixes Redshift tables with an S3 table.

Keep in mind: Spectrum charges $5 per TB scanned. To save money:

  • Use efficient file formats
  • Compress data
  • Partition wisely

12. Scaling for Many Users

As your Redshift cluster grows, you'll need to handle more queries simultaneously. Here's how to scale effectively:

1. Use Workload Management (WLM)

WLM helps manage resources and prioritize queries. Set up to 8 queues with different properties:

Queue Purpose Concurrency Memory
ETL Data loading 2 30%
BI Reporting 10 50%
Ad-hoc Quick queries 5 20%

2. Enable Concurrency Scaling

This adds extra processing power when needed. To turn it on:

  • Log into AWS Console
  • Go to Redshift settings
  • Set "Concurrency Scaling mode" to "auto"

3. Use Short Query Acceleration (SQA)

SQA lets quick queries jump the queue. Enable it in WLM settings to speed up ad-hoc queries.

4. Consider Redshift Serverless

For unpredictable workloads, Redshift Serverless might be your best bet. It scales automatically and you only pay for what you use.

5. Monitor and Adjust

Keep tabs on your cluster's performance. Use AWS CloudWatch to track key metrics and tweak settings as needed.

Scaling isn't one-size-fits-all. Take ExampleCorp:

  • Enterprise Data Warehouse: Handles all write workloads and some reads
  • Marketing Data Mart: Uses manual WLM for predictable ETL and BI tasks
  • Auditor Data Mart: Uses Redshift Serverless for infrequent workloads

Mix and match these strategies to find what works for your setup.

Good and Bad Points

Let's break down the pros and cons of Amazon Redshift query optimization techniques:

Technique Pros Cons
Good Table Design - Boosts query speed
- Cuts data skew
- Better compression
- Needs upfront planning
- Might need redesign as data grows
Better Data Loading - Faster data input
- Less disk usage spikes
- Requires proper ETL setup
- Might change existing processes
Writing Better Queries - Uses fewer resources
- Speeds up queries
- Needs ongoing query reviews
Using Materialized Views - Faster for repeat workloads
- Pre-computes complex math
- Uses more storage
- Needs upkeep
Managing Workloads - Better resource use
- Prioritizes key queries
- Can be tricky to set up right
Caching Query Results - Speeds up repeat queries
- Uses less compute power
- Might give old data if not managed
Updating Table Statistics - Better query planning
- Overall speed boost
- Needs regular upkeep
Using Vacuum Commands - Frees up space
- Faster queries
- Can hog resources
- Need to schedule during slow times
Reading Query Plans - Spots performance issues
- Guides optimization
- Takes skill to understand
Column-Based Storage - Less disk I/O
- Better compression
- Not great for OLTP work
Using Redshift Spectrum - Query S3 data
- Separate compute and storage scaling
- Can cost more if not watched
Scaling for Many Users - Handles more queries
- Better user experience
- Can cost more
- Needs careful monitoring

These techniques can boost performance, but they come with trade-offs. For example, caching speeds up queries, but needs upkeep. As Nathan Sullins from Udemy Tech Blog says:

We do daily vacuuming and analyzing of our tables. We leverage scripts from AWS Labs to assist with this.

Choosing the right distribution style matters too. KEY distribution can speed up JOINs but might overload some nodes. EVEN distribution is great for table scans but not for joins.

Scaling in Redshift is flexible but needs careful management. Concurrency scaling can boost throughput up to 10x, but watch those costs. One user found out the hard way:

In a live production cluster, a user querying a table with 443,744 rows ran the query 374,372 times, resulting in 125 minutes of query time due to excessive queuing.

This shows why proper workload management and scaling strategies are crucial.

Wrap-up

Let's recap the key techniques for Amazon Redshift query optimization:

1. Table Design and Data Loading

Smart table design reduces skew and improves compression. Better loading means faster input and less disk usage. Check this out:

Technique Impact
ZSTD encoding Can cut storage by 50%+
CDC Fewer rows ingested, smaller disk spikes

2. Query Writing and Workload Management

Write smarter queries and manage workloads. Amazon Redshift WLM offers 8 queues with 50 slots for query management.

"We vacuum and analyze tables daily. AWS Labs scripts help us out." - Nathan Sullins, Udemy Tech Blog

3. Materialized Views and Caching

Materialized views speed up repeat work. Caching uses less compute power. But watch out for stale data!

4. Statistics and Maintenance

VACUUM and ANALYZE keep performance high. Run ANALYZE when "stats off" hits 10%+.

5. Scaling and Spectrum

Redshift Spectrum queries S3 data directly. It's powerful, but mind the costs:

One user queried a 443,744-row table 374,372 times. Result? 125 minutes of query time due to queuing.

6. Monitoring and Optimization

Use Redshift Advisor. It watches your cluster and suggests ways to boost efficiency and cut costs.

FAQs

How to speed up queries in Redshift?

Want faster Redshift queries? Here's how:

1. Table Design

Pick the right sort key and distribution style. This can make your queries WAY faster.

2. Compress Data

Use automatic compression. It saves space AND speeds up I/O.

3. Use Constraints

They help the query optimizer make smarter plans.

4. Right-Size Columns

Smaller columns = faster queries. Simple as that.

5. Date/Time Data Types

For date columns, use date/time types. Not character types.

6. Write Better Queries

Only SELECT what you need. Use the right join types.

Here's a quick look at these tricks:

Technique What it does
Sort Key Scans less data
Distribution Style Moves less data
Compression Saves space, faster I/O
Constraints Smarter query plans
Small Columns Less storage, faster queries
Date/Time Types Better for date stuff

Try these out. Your queries will thank you.

Related posts

Read more