Snowflake Summit 2023 - Instacart Cost Optimization

date
Jun 27, 2023
slug
snowflake-summit-2023-instacart-cost-optimization
author
status
Public
tags
Blog
summary
Instacart 50% Snowflake Cost Optimization Key Recommendations
type
Post
thumbnail
Cover-Image-Snowflake-Summit-June-2024@2x.png
updatedAt
Jul 7, 2023 08:20 AM
 

Introduction

In today's world, organizations are increasingly adopting cloud data warehousing solutions like Snowflake. With this adoption, it becomes crucial to ensure efficient resource utilization and cost management. In this blog post, we will delve into the key strategies implemented by Instacart to enhance observability, monitoring, and accountability within Snowflake. By implementing these strategies, you will be empowered to make informed decisions and optimize your data warehouse usage effectively.
 

Observability & Monitoring:

 
The first step towards optimizing Snowflake is considering the virtual warehouse level. Snowflake offers a range of warehouse sizes, from X-Small to 6X-Large, each with varying credit/hour rates. It is essential to choose the appropriate warehouse size based on your workload requirements to avoid underutilization or over-provisioning.
 
notion image
 
To start, it is recommended to begin with an X-Small warehouse and scale up as needed. This approach allows for better cost control and resource allocation. When monitoring your Snowflake environment, leveraging millisecond granularity is crucial.
 
Instacart has developed an open-source DBT package that provides powerful tools for monitoring and managing Snowflake queries. This package enables you to gain insights into query performance and identify potential bottlenecks. Instacart utilizes the Query Fingerprinting technique to understand how to effectively manage their warehouse.
 
The idea behind query fingerprinting is to group together queries that are the same but have different values for the where clause. This technique allows for efficient tracking and optimization of similar queries.
 
hash(trim(REGEXP_REPLACE(trim(lower(query_text)), '([\n]|[,"\)\(|$:]|[0-9])|(''[^'']*'')', ""))) as query_fingerprint,
 
Proper query tagging using the Instacart DBT package can also accomplish this objective 👆🏼.
 

Assign Budgets

 
To successfully reduce costs by 50 percent, Instacart implemented budgeting for each pillar within the company. They assigned DRIs (Direct Responsible Individuals) to ensure responsible usage within each business unit.
 
Here are some suggestions for setting budgets:
  • Utilize the previous year's usage as a baseline.
  • Anticipate business growth when budgeting.
  • Consider future growth in the number of people and applications.
 
In addition, setting up alerting via Slack or other methods can notify business units when they are nearing their budgets. It is recommended to send alerts at various budget thresholds, such as 30%, 60%, 80%, 90%, 95%, and 100%.
 

General Daily Observability

Instacart relies on key metrics for daily observability, including:
  1. Daily Spend vs Budget: This metric provides daily guidance on the spend incurred compared to the assigned budget for each business unit.
  1. Daily Warehouse Spend: It offers visibility into which warehouse is the top spender for a given day.
These two metrics serve as basic observability indicators and provide a general overview of the daily operations.
 
notion image
 

Optimizing Warehouse Utilization

 
To further drive down costs, optimizing the size of warehouses for each query or workload is essential. Snowflake's QUERY HISTORY feature allows you to view and expose query metrics.
 
Key metrics to consider when optimizing warehouse utilization include:
  • Query runtime: This metric helps identify the execution time of queries, enabling you to optimize their performance.
  • Bytes scanned: It provides insights into the amount of data processed by queries and can guide the need for a larger warehouse.
  • Spilling: Monitoring queries that spill data to local or remote storage helps identify inefficiencies and optimize performance.
 
In order to drive costs down further, every query or workload should have an optimally sized warehouse where it should run. We are able to use the QUERY HISTORY in Snowflake to view/expose query metrics.
 

Grouping Queries into Buckets

 
Grouping queries into buckets based on query fingerprinting is a valuable technique used by Instacart. It involves categorizing queries into small, medium, and large buckets, similar to the Kubernetes T-Shirt sizing method. To assign queries to appropriate buckets, you can use the following formula:
 
total_weight = sum(credits/hours * execution_time)
 
By leveraging the total_weight, you can classify queries into two categories:
  • Expensive: The top 50 queries with the highest total_weight.
  • Frequent: The top 50 queries with the highest total_times_run.
 

Query Fingerprinting Metrics

Tracking metrics and utilizing query fingerprinting allows for significant cost reduction. Two key metrics to focus on for cost optimization are:
 
  1. Queries Bucketed by Runtime: This metric determines which queries should be upgraded to larger warehouses. Ideally, all queries should fall within the fastest two buckets (0-1 second and/or 1-10 seconds). If queries take longer, consider upgrading the warehouse to improve performance.
  1. Number of Queries Spilling to Remote Storage: Disk Spilling is when Snowflake warehouse cannot fit an operation in memory, it starts spilling (storing) data first to the local disk of a warehouse node, and then to remote storage. In such a case, Snowflake first tries to temporarily store the data on the warehouse local disk. As this means extra IO operations, any query that requires spilling will take longer than a similar query running on similar data that is capable to fit the operations in memory. Furthermore, if the local disk is not sufficient to fit the spilled data, Snowflake further tries to write to the remote cloud storage, which will be shown in the query profile as "Bytes spilled to remote storage". Therefore, we should monitor spilling because queries that spill into remote storage run slower, which means they use up more credits, costing more money.
 
notion image
 

Built-In Snowflake Features to Leverage

Query Profile

The Query Profile can be your best friend to also quickly cut down costs. Whenever you see a “Step” jump up to the 1000’s or 2000’s then that means the query failed and was submitted again for retry.
This is a clear indicator that the query is too big for the warehouse or there are complexities that need to be simplified.
 
notion image
 

Most Expensive Nodes

The Most Expensive Nodes box in Snowflake provides insights into queries that consume the most time. By analyzing this information, you can identify queries that can be optimized for better performance and cost efficiency.
 
For example:
  • Aggregate [9] is taking up 30% percent of the query time to process X billion rows into Y billion rows. You always want to look out for any drastic increases in Data sizes. This generally indicates problems.
  • Tablescan 21 and Tablescan 15 are taking up 40% percent, therefore optimizing those two queries can significantly reduce the query time bringing costs down.
 
notion image
 

Profile Overview

The profile overview in Snowflake offers valuable insights based on the selected query graph. Analyzing this information can help identify areas for improvement, such as reducing remote disk I/O and optimizing synchronization and initialization time.
 
For example, here’s key takeaways from the query selected:
  • This query spent significant time using “Remote Disk I/O”. This means the query is having to write out to disk and read from s3, which we want to avoid when trying to keep costs down.
  • Synchronization and Initialization is time spent acquiring the compute nodes and setting up the cluster to run the query. For small queries that don’t need that many nodes, this is a waste of time. Smaller warehouses with only a few nodes are set up more quickly compared to larger warehouses.
 
notion image
 
 

Statistics

Before scaling up to a larger warehouse, it is important to ensure that the current warehouse has been optimized as much as possible. Pay attention to the Bytes scanned metric, as it provides guidance on when a larger warehouse may be necessary. Whenever Bytes scanned increases, then a bigger warehouse will help.
 
notion image
 
Furthermore, we should try to keep Bytes Spilled to local/remote storage to a minimum in order to improve the query performance. There are two types of spilling.
 
  1. Local Storage Spilling: Spilling to Local Storage (Compute nodes SSD) is generally okay. SSD access is pretty fast, so some spilling here is okay.
  1. Remote Storage Spilling: Spilling to Remote Storage (s3 bucket storage) is extremely SLOW compared to local storage. Therefore, any spilling to remote storage will significantly reduce our query performance.
 
Finally, if query optimization has already been done, and remote spilling is still occurring, then moving the job to a bigger warehouse will help significantly by improving the query time . We should expect the cost to come out to the same.
 
notion image
 

General Warehouse Sizing Recommendations

Instacart follows the sizing recommendations outlined below to determine the appropriate warehouse size:
notion image
It is generally advised not to assign warehouses based on teams or organizations. Starting with an X-Small warehouse and adjusting based on workload requirements provides better resource utilization.
 

Storage is Cheap but Costs Can Creep Up

Snowflake storage is generally cheap. Overtime, this cost can creep up and become significant. Always clean old tables that are not actively being used. Make sure you use Transient and Temporary tables in addition to Permanent tables to avoid surprises with Time Travel, which is based on storage costs. Time travel should always be set at the table level and not at the schema level.
 
notion image
 
 

Take Care of your Clones

Cloning tables doesn’t actually copy any data, but they are snapshots of the table’s micro-partitions at the time of the clone. Overtime, these clones can end up holding onto micro-partitions that have since been replaced with newer micro-partitions.
 
Until the cloned table is dropped, Snowflake will continue to hold onto the older micro-partition files. It’s best practice to alert on cloned tables that are aging and either drop the clone or recreate the clone. Once a clone is recreated, then it will update it’s pointers to the new micro-partitions that are active on the parent table.
 

Conclusion

Snowflake isn’t expensive compared to others. In fact, it’s quite the opposite in most cases. However, putting in place the right monitoring and controls, as well as educating users around best practices is key to keeping costs down.