Snowflake Summit 2023 - Instacart Cost Optimization
Introduction
In today's world, organizations are increasingly adopting cloud data warehousing solutions like Snowflake. With this adoption, it becomes crucial to ensure efficient resource utilization and cost management. In this blog post, we will delve into the key strategies implemented by Instacart to enhance observability, monitoring, and accountability within Snowflake. By implementing these strategies, you will be empowered to make informed decisions and optimize your data warehouse usage effectively.
Observability & Monitoring:
The first step towards optimizing Snowflake is considering the virtual warehouse level. Snowflake offers a range of warehouse sizes, from
X-Small
to 6X-Large
, each with varying credit/hour rates. It is essential to choose the appropriate warehouse size based on your workload requirements to avoid underutilization or over-provisioning.To start, it is recommended to begin with an
X-Small
warehouse and scale up as needed. This approach allows for better cost control and resource allocation. When monitoring your Snowflake environment, leveraging millisecond granularity is crucial.Instacart has developed an open-source DBT package that provides powerful tools for monitoring and managing Snowflake queries. This package enables you to gain insights into query performance and identify potential bottlenecks. Instacart utilizes the
Query Fingerprinting
technique to understand how to effectively manage their warehouse.The idea behind query fingerprinting is to group together queries that are the same but have different values for the where clause. This technique allows for efficient tracking and optimization of similar queries.
hash(trim(REGEXP_REPLACE(trim(lower(query_text)), '([\n]|[,"\)\(|$:]|[0-9])|(''[^'']*'')', ""))) as query_fingerprint,
Proper query tagging using the Instacart DBT package can also accomplish this objective 👆🏼.
Assign Budgets
To successfully reduce costs by 50 percent, Instacart implemented budgeting for each pillar within the company. They assigned DRIs (Direct Responsible Individuals) to ensure responsible usage within each business unit.
Here are some suggestions for setting budgets:
- Utilize the previous year's usage as a baseline.
- Anticipate business growth when budgeting.
- Consider future growth in the number of people and applications.
In addition, setting up alerting via Slack or other methods can notify business units when they are nearing their budgets. It is recommended to send alerts at various budget thresholds, such as 30%, 60%, 80%, 90%, 95%, and 100%.
General Daily Observability
Instacart relies on key metrics for daily observability, including:
Daily Spend vs Budget
: This metric provides daily guidance on the spend incurred compared to the assigned budget for each business unit.
Daily Warehouse Spend
: It offers visibility into which warehouse is the top spender for a given day.
These two metrics serve as basic observability indicators and provide a general overview of the daily operations.
Optimizing Warehouse Utilization
To further drive down costs, optimizing the size of warehouses for each query or workload is essential. Snowflake's
QUERY HISTORY
feature allows you to view and expose query metrics.Key metrics to consider when optimizing warehouse utilization include:
Query runtime
: This metric helps identify the execution time of queries, enabling you to optimize their performance.
Bytes scanned
: It provides insights into the amount of data processed by queries and can guide the need for a larger warehouse.
Spilling
: Monitoring queries that spill data to local or remote storage helps identify inefficiencies and optimize performance.
In order to drive costs down further, every query or workload should have an optimally sized warehouse where it should run. We are able to use the
QUERY HISTORY
in Snowflake to view/expose query metrics. Grouping Queries into Buckets
Grouping queries into buckets based on query fingerprinting is a valuable technique used by Instacart. It involves categorizing queries into small, medium, and large buckets, similar to the Kubernetes T-Shirt sizing method. To assign queries to appropriate buckets, you can use the following formula:
total_weight = sum(credits/hours * execution_time)
By leveraging the
total_weight
, you can classify queries into two categories:- Expensive: The top 50 queries with the highest
total_weight
.
- Frequent: The top 50 queries with the highest
total_times_run
.
Query Fingerprinting Metrics
Tracking metrics and utilizing query fingerprinting allows for significant cost reduction. Two key metrics to focus on for cost optimization are:
Queries Bucketed by Runtime
: This metric determines which queries should be upgraded to larger warehouses. Ideally, all queries should fall within the fastest two buckets (0-1 second and/or 1-10 seconds). If queries take longer, consider upgrading the warehouse to improve performance.
Number of Queries Spilling to Remote Storage
: Disk Spilling is when Snowflake warehouse cannot fit an operation in memory, it starts spilling (storing) data first to the local disk of a warehouse node, and then to remote storage. In such a case, Snowflake first tries to temporarily store the data on the warehouse local disk. As this means extra IO operations, any query that requires spilling will take longer than a similar query running on similar data that is capable to fit the operations in memory. Furthermore, if the local disk is not sufficient to fit the spilled data, Snowflake further tries to write to the remote cloud storage, which will be shown in the query profile as "Bytes spilled to remote storage". Therefore, we should monitor spilling because queries that spill into remote storage run slower, which means they use up more credits, costing more money.
Built-In Snowflake Features to Leverage
Query Profile
The
Query Profile
can be your best friend to also quickly cut down costs. Whenever you see a “Step” jump up to the 1000’s or 2000’s then that means the query failed and was submitted again for retry. This is a clear indicator that the query is too big for the warehouse or there are complexities that need to be simplified.
Most Expensive Nodes
The
Most Expensive Nodes
box in Snowflake provides insights into queries that consume the most time. By analyzing this information, you can identify queries that can be optimized for better performance and cost efficiency. For example:
Aggregate [9]
is taking up 30% percent of the query time to process X billion rows into Y billion rows. You always want to look out for any drastic increases in Data sizes. This generally indicates problems.
Tablescan 21
andTablescan 15
are taking up 40% percent, therefore optimizing those two queries can significantly reduce the query time bringing costs down.
Profile Overview
The profile overview in Snowflake offers valuable insights based on the selected query graph. Analyzing this information can help identify areas for improvement, such as reducing remote disk I/O and optimizing synchronization and initialization time.
For example, here’s key takeaways from the query selected:
- This query spent significant time using “Remote Disk I/O”. This means the query is having to write out to disk and read from s3, which we want to avoid when trying to keep costs down.
Synchronization and Initialization
is time spent acquiring the compute nodes and setting up the cluster to run the query. For small queries that don’t need that many nodes, this is a waste of time. Smaller warehouses with only a few nodes are set up more quickly compared to larger warehouses.
Statistics
Before scaling up to a larger warehouse, it is important to ensure that the current warehouse has been optimized as much as possible. Pay attention to the
Bytes scanned
metric, as it provides guidance on when a larger warehouse may be necessary. Whenever Bytes scanned
increases, then a bigger warehouse will help. Furthermore, we should try to keep
Bytes Spilled to local/remote storage
to a minimum in order to improve the query performance. There are two types of spilling.Local Storage Spilling
: Spilling toLocal Storage
(Compute nodes SSD) is generally okay. SSD access is pretty fast, so some spilling here is okay.
Remote Storage Spilling
: Spilling toRemote Storage
(s3 bucket storage) is extremely SLOW compared to local storage. Therefore, any spilling to remote storage will significantly reduce our query performance.
Finally, if query optimization has already been done, and remote spilling is still occurring, then moving the job to a bigger warehouse will help significantly by improving the query time . We should expect the cost to come out to the same.
General Warehouse Sizing Recommendations
Instacart follows the sizing recommendations outlined below to determine the appropriate warehouse size:
It is generally advised not to assign warehouses based on teams or organizations. Starting with an
X-Small
warehouse and adjusting based on workload requirements provides better resource utilization.Storage is Cheap but Costs Can Creep Up
Snowflake storage is generally cheap. Overtime, this cost can creep up and become significant. Always clean old tables that are not actively being used. Make sure you use
Transient
and Temporary
tables in addition to Permanent
tables to avoid surprises with Time Travel
, which is based on storage costs. Time travel should always be set at the table level and not at the schema level. Take Care of your Clones
Cloning tables doesn’t actually copy any data, but they are snapshots of the table’s micro-partitions at the time of the clone. Overtime, these clones can end up
holding onto
micro-partitions that have since been replaced with newer micro-partitions. Until the cloned table is dropped, Snowflake will continue to hold onto the older micro-partition files. It’s best practice to alert on
cloned
tables that are aging and either drop the clone or recreate the clone. Once a clone is recreated, then it will update it’s pointers to the new micro-partitions that are active on the parent table. Conclusion
Snowflake isn’t expensive compared to others. In fact, it’s quite the opposite in most cases. However, putting in place the right monitoring and controls, as well as educating users around best practices is key to keeping costs down.