Redshift vs. BigQuery

You could get better results with RedShift if:

  • Your business uses data warehouses for everyday operations
  • Your business has consistently predictable workloads and operations
  • Your business needs a simpler solution in terms of cost and management
  • Your workflow is not based on Other Google Technologies

You could see better results with BigQuery if:

  • You use data warehouses for data mining operations
  • Data warehousing is not a big part of your daily operations
  • Your business takes on a lot of projects with wildly different conditions
  • Your business changes operations often, making workflows less consistent
  • You already rely on other Google platforms for business processes

3 Key Differences of RedShift vs. BigQuery

  1. Amazon RedShift is provisioned on clusters and nodes. Google BigQuery is serverless.
  2. RedShift supports 1,600 columns in a single table, BigQuery supports 10,000 columns.
  3. RedShift requires periodic management tasks like vacuuming tables, BigQuery has automatic management.

What is RedShift?

RedShift is Amazon’s data warehouse, and it’s part of Amazon’s massive overall cloud architecture, AWS.

Amazon acquired the source code for RedShift from ParAccel, who was developing ParAccel Analytic Database — a PostgreSQL-based database that utilized columnar data organization. So, RedShift is an MPP data warehouse that’s built on a PostgreSQL fork.

While RedShift shares many commonalities with PostgreSQL (such as its relational qualities), it also is unique in that it’s column structure. It doesn’t support indexes, and uses distribution styles and keys for data organization. Amazon also has a unique query execution engine for RedShift that differs from PostgreSQL.

The important thing to note about RedShift being built on top of a PostgreSQL fork is that it maintains some of its transactional qualities — making it a hybrid database of sorts. RedShift can still roll-back on transactions, which is a semi-unique feature in the data warehouse market.

More than 10,000 companies use Redshift, according to HG Insights. Customers include well-known brands such as McDonald’s, Pfizer, and Lyft.

A Redshift database is a cluster of worker database nodes. These nodes are able to split up a query, execute its constituent parts in parallel, then combine the results. As a result, you can horizontally scale performance by adding nodes to your

cluster. If you check out the Redshift pricing page 11 you’ll see that adding a node will give you more processing power, as well as more storage.

This is sort of a problem though: sometimes you need a lot of storage space, but you don’t need the extra processing power. The converse may also be true. This coupling of “storage” and “compute” is one of the main drawbacks of Redshift compared to the other databases listed below

Redshift comes with a bunch of configurations to optimize query performance like sort keys 3 and dist keys. You can also configure compression 4 for your columns, or perform table maintenance tasks like vacuum 4 and analyze 2. It ends up being close to a full-time job to keep Redshift humming along, and honestly, I think that time could be better spent actually doing analytics. The databases below don’t have as many bells and whistles, but you end up not really missing them. Maintaining Redshift can be a real chore.

What is BigQuery?

BigQuery is Google’s data warehouse, and it’s part of Google’s massive overall cloud architecture, Google Cloud.

BigQuery was one of the first major data warehouses on the market following C-Store and Monet DB. To function, BigQuery executes Dremel (a query engine developed by Google for read-only nested data that supports an SQL-like syntax) over a REST interface.

More than 3,000 companies use BigQuery, including The New York Times, HSBC Bank, and Spotify, according to HG Insights.

BigQuery is really, very bad at doing joins. The recommended way to handle one-to-many relationships is through repeated records 53. Repeated records are a radically new data storage pattern that Google just decided to make up. These repeated records are incredibly useful and they work very well, but you’ll need to internalize this feature of BigQuery to make effective use of its power.

Notice UPDATING TABLES must USE STAGING TABLE
Redshift can do upserts, but only via a convoluted process. Redshift doesn’t provide an UPSERT command to update a table. Users must load data into a staging table and then join the staging table with a target table for an UPDATE statement and an INSERT statement.

Price: RedShift vs. BigQuery

RedShift

RedShift’s pricing model is extremely simple.
With RedShift, you can choose from either Dense computing or large Dense Storage. The cheapest node you can spin up will cost you $0.25 per/hour, and it’s 160GB with a dc2.large node. Dense Storage runs at $0.425 per TB per hour. This cost covers both storage and processing. RedShift So, the lowest price you can get on RedShift is $306 per TB per month.

And, you can pay upfront for massive discounts.

This makes RedShift interesting to work with. If you can calculate your run-times and how often you’ll need to spin up each node, you can cut costs dramatically — especially if you pay upfront. Since most businesses aren’t going to be constantly running their RedShift nodes, getting granular is usually in your best interest.

For example, you may only run RedShift during the day when people are interacting with your stack or service. If that’s the case, you can adjust your upfront buying habits to reflect that behavior.

BigQuery

BigQuery’s pricing is much more complicated. On the surface, BigQuery looks cheaper. Storage costs $20 per TB per month, a good $286 cheaper than RedShift. But, BigQuery charges separately for storage and querying. Queries cost $5/TB. So, while storage is cheaper, query costs can add up quickly.

There are some pros and cons to this method. Really, BigQuery is perfect for a certain type of customer. Let’s say your business deals with spiky workloads. You run rapid queries a few times a day. BigQuery would be a far better option since you have to pay by the hour for RedShift. BigQuery may also be the best solution for data scientists running ML or data mining operations — since you’re dealing with extremely large, spikey workloads.

Who Wins?

BigQuery costs $20 per TB per month for the storage line and $5 per TB processed on that storage line.

RedShift costs $306 per TB per month for storage AND unlimited processing on that storage.

There isn’t a winner here. RedShift is more economical for everyday data warehouse operations for most businesses. But, BigQuery is better for businesses looking to do data mining or those who deal with extremely variant workloads.

Performance Benefits of Each Platform

In our experience with clients, RedShift is great at handling everyday business processes. This means spinning a node during work hours for BI tools and interfaces. It’s less expensive, has plenty of power to handle semi-complex schemas, and it’s easy-to-use.

BigQuery is great at handling niche business workloads that query big chunks in a small timeframe and for data scientists and ML/data mining.

In many cases, the difference between the two is going to depend upon your RedShift resources. So, if you’re paying for a single dc2.large node, BigQuery is most likely going to outperform RedShift. But, if you’re spinning up an expensive 8-node DC1.8XL, RedShift is probably going to outperform BigQuery.

Manageability: RedShift vs. BigQuery

  1. Data types/updates and deletes
  2. Ease-of-use
  3. Security
  4. Integrations

Data Types/Updates and Deletes

RedShift supports standard SQL data types, and BigQuery works with some standard SQL data types and a small range of sub-standard SQL. One of the biggest benefits of BigQuery is that it treats nested data classes as first-class citizens due to its Dremel capabilities. With RedShift, you have to flatten out your data before running a query.

Both of them can handle updates and deletes when something goes wrong in the query. Since BigQuery and RedShift are append-only, many assume they can’t do updates and deletes. They can. On BigQuery, the update and delete process exists, but it’s relatively expensive, and your options are limited. So, it’s not a widely used feature. With RedShift, you can reclaim tables with Postgre Vacuuming (which has its own host of complications) so update and delete support is typically better with RedShift.

Also, RedShift gives users the ability to roll back on transactions, which BigQuery doesn’t.

Ease-of-Use

Out-of-the-box, BigQuery is much simpler to use than RedShift. You don’t have to perform many tweaks, cluster management is a breeze, and the complexities of database config, etc. are handled by BigQuery.

Security

When it comes to security, both systems are comparable. RedShift uses Amazon IAM for identity, and BigQuery uses Google Cloud IAM. Both services work perfectly for almost all business scenarios. Google does have great B2B identity management with OAuth, so you can give identity controls to 3rd parties without introducing them into your entire ecosystem.

Integrations

Both Google and Amazon (unsurprisingly) have a wealth of integrations available. Redshift is built on a PostgreSQL fork

Real Test Deploy on CiHMS

ETL using DMS

  1. Use one T3-Micro to EC2 for DMS Instance to handle around 6 tasks, each task can handle all tables backfill and cdc
  2. During Backfill it easily uses all EC2 Resources and Redshift
  3. Once it switches to CDC, the EC2 Usage drop to 10 to 20%
  4. Redshift also goes to Peak during Backfill, CPU easily go to 80 to 90% peak
  5. Redshift can support elastic resize ( total time takes around 10 to 20 min), however the CPU types are limited ( only around 5 types)
  6. Redshift with 4 nodes r3.xlarge (0.3 usd/node, around 9xx usd/month) takes around 50% CPU in cdc only mode
  7. Elastic Resizing limits CPU type ( only around 4 or 5 types), when we scale up we also scale up both Computing and Storage ( in our case we need Computing much more than Storage)

Notice in ETL

  1. All Table/Column Names change to All lower Names
  2. Auto Processing Update and Delete –> no need DWH
  3. Quite high CPU during Backfill – Go back to Normal in CDC however the CPU still around 5x%
  4. Both ‘vaccuum’ and ‘analyze’ commands need to run frequently or the system will heavily impact the performance
  5. Redshift does not have Binary Datatype so all Joining SQL using Binary Id need to be changed in Code

Notice on Query

  1. Slower than 4 to 5 times when CPU in ‘Idle’ Mode compared with Bigquery (4 – 5 min vs 40 – 50 sec)
  2. In ‘Load’ Mode it will be much slower

Notice on Dbt

  1. Only need to change the adapter – BQ to Redshift
  2. Need to Provide a Service Account with full Rights to read/write to Tables
  3. Coz the DWH already be available all the times – propose switching to View Materialize instead of Table Materialize to change the Concept from Schedule Preparation to On-Demand Request

Notice on Visualizer

  1. For JasperReports can use JDBC Adapter ( need to setup everything including ServiceAccount and Key) or AWS Datasource Adapter ( take care of everything only need to provide the ServiceAccount)
    https://community.jaspersoft.com/wiki/what-difference-between-aws-data-source-and-jdbc-connection
  2. For PowerBI solution, AWS provides a native Redshift Adapter
    https://learn.microsoft.com/en-us/power-bi/connect-data/desktop-connect-redshift

Conclusion on Replacing BigQuery to Redshift

  1. Some frequent / on-demand maintaining Commands ‘analyze’, ‘vacuum’ need to be run periodically or after important events to enhance performance
  2. All DBT transformations need to be rewritten to focus on:
    1. Table and Column Names
    2. Throw away DWH layer
    3. Rewrite the logic if joining using Binary
  3. In case of switching Model from Scheduling Preparation to OnDemand Request we need to :
    1. Change Materialize from Table to View
    2. Cross-check whether JasperReports can query and wait for a long time (around 4 to 5 min). If not backend need to act as the request and cache mode
    3. Limit the number of Queries
  4. Deadlock, bottomneck problem can happen especially when multiple Queries or when we need to run Mass Insert / Update Commands ( Backfill Mode) ( The more Nodes the less deadlock the more money we have to pay)
  5. Testing on one of our warehouses with the size is around 1TB, the Query Frequency around 1 hour/each and With the same budget spending, we can have a Redshift system that is slower than BigQuery around 5 to 6 times. The difference can be subjected to the size of the Data warehouse and the query frequency. Certainly, on Redshift you pay in one lump sum while on Bigquery the cost will go rocket high once we increase the Query Frequency to maintain the Data Freshness

#bigdata #comparison #bigquery #redshift

Recent Articles