Questions
A BigQuery query scans 10TB and costs $50 per run. Optimize it to under $5.
The Scenario
A daily analytics report query costs $50 per execution:
-- Current query: scans 10TB, costs $50
SELECT
user_id,
COUNT(*) as event_count,
SUM(revenue) as total_revenue
FROM `project.analytics.events`
WHERE event_date >= '2024-01-01'
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 1000;
The events table is 50TB total, with new data added daily. The query runs multiple times per day for different date ranges.
The Challenge
Reduce query costs by 90% using BigQuery optimization techniques: partitioning, clustering, materialized views, and query best practices.
A junior engineer might export data to a smaller table, use LIMIT expecting it to reduce scans, or avoid BigQuery entirely. These approaches don't address the root cause, LIMIT doesn't reduce data scanned, and alternatives lose BigQuery's analytical power.
A senior engineer implements partitioning by date (so queries only scan relevant partitions), clustering on frequently filtered columns, materialized views for repeated aggregations, and query optimization techniques. They also consider BI Engine for sub-second queries.
Step 1: Understand Current Table Structure
-- Check table size and partitioning
SELECT
table_name,
ROUND(total_bytes / POW(10,12), 2) as size_tb,
ROUND(total_rows / POW(10,9), 2) as rows_billions,
partition_expiration_ms,
clustering_fields
FROM `project.analytics.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 'events';
-- Check what columns are used in WHERE clauses
-- (from query history)
SELECT
query,
total_bytes_processed,
total_slot_ms
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND referenced_tables LIKE '%events%'
ORDER BY total_bytes_processed DESC
LIMIT 20;Step 2: Implement Partitioning
-- Create partitioned table
CREATE TABLE `project.analytics.events_partitioned`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM `project.analytics.events`;
-- Or add partitioning to existing table schema
-- (requires table recreation)
-- Set partition expiration for old data
ALTER TABLE `project.analytics.events_partitioned`
SET OPTIONS (
partition_expiration_days = 365,
require_partition_filter = true -- Prevent full scans!
);Step 3: Add Clustering
-- Clustering sorts data within partitions
-- Great for high-cardinality columns used in WHERE/JOIN
CREATE TABLE `project.analytics.events_optimized`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type, country
AS SELECT * FROM `project.analytics.events`;
-- Optimized query now scans much less data
SELECT
user_id,
COUNT(*) as event_count,
SUM(revenue) as total_revenue
FROM `project.analytics.events_optimized`
WHERE DATE(event_timestamp) >= '2024-01-01' -- Uses partition
AND user_id IN (SELECT user_id FROM active_users) -- Uses clustering
GROUP BY user_id;Step 4: Use Materialized Views
-- Pre-compute common aggregations
CREATE MATERIALIZED VIEW `project.analytics.daily_user_stats`
PARTITION BY event_date
CLUSTER BY user_id
OPTIONS (
enable_refresh = true,
refresh_interval_minutes = 60
)
AS
SELECT
DATE(event_timestamp) as event_date,
user_id,
event_type,
COUNT(*) as event_count,
SUM(revenue) as total_revenue,
COUNT(DISTINCT session_id) as sessions
FROM `project.analytics.events_optimized`
GROUP BY 1, 2, 3;
-- Query the materialized view (much faster and cheaper)
SELECT
user_id,
SUM(event_count) as total_events,
SUM(total_revenue) as total_revenue
FROM `project.analytics.daily_user_stats`
WHERE event_date >= '2024-01-01'
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 1000;Step 5: Query Optimization Best Practices
-- BEFORE: Inefficient query patterns
SELECT * -- Selects ALL columns
FROM events
WHERE EXTRACT(YEAR FROM event_timestamp) = 2024 -- Function prevents partition pruning
AND JSON_EXTRACT(metadata, '$.campaign') = 'summer' -- Scans all rows
-- AFTER: Optimized query
SELECT
user_id,
event_type,
revenue -- Only needed columns
FROM `project.analytics.events_optimized`
WHERE event_timestamp >= '2024-01-01' -- Partition pruning works
AND event_timestamp < '2025-01-01'
AND campaign = 'summer' -- Use extracted column, not JSON-- Use approximate functions for large datasets
SELECT
APPROX_COUNT_DISTINCT(user_id) as unique_users, -- Much faster
APPROX_QUANTILES(revenue, 100)[OFFSET(50)] as median_revenue
FROM events_optimized
WHERE event_date = CURRENT_DATE();
-- Avoid self-joins, use window functions
-- BEFORE: Self-join to get previous row
SELECT a.*, b.revenue as prev_revenue
FROM events a
JOIN events b ON a.user_id = b.user_id
AND a.event_timestamp = (
SELECT MIN(event_timestamp)
FROM events
WHERE user_id = a.user_id
AND event_timestamp > b.event_timestamp
);
-- AFTER: Window function
SELECT
*,
LAG(revenue) OVER (PARTITION BY user_id ORDER BY event_timestamp) as prev_revenue
FROM events_optimized;Step 6: Implement Slot Reservations for Predictable Costs
# For predictable workloads, flat-rate pricing is cheaper
# 100 slots = ~$2,000/month vs pay-per-query
# Create reservation
bq mk --reservation \
--project_id=project \
--location=US \
--slots=100 \
analytics-reservation
# Assign projects to reservation
bq mk --reservation_assignment \
--project_id=project \
--location=US \
--reservation_id=analytics-reservation \
--job_type=QUERY \
--assignee_type=PROJECT \
--assignee_id=analytics-projectStep 7: Use BI Engine for Dashboard Queries
-- Create BI Engine reservation for sub-second queries
-- Caches data in memory
-- Best for:
-- - Dashboard queries
-- - Repeated aggregations
-- - Time-series data
-- Configure via Console or API
-- 1GB BI Engine = ~$40/month
-- Dramatically faster for cached queriesCost Comparison
BEFORE:
- Table: 50TB, no partitioning
- Query scans: 10TB (filters in WHERE)
- Cost: 10TB × $5/TB = $50 per query
- Daily cost (5 runs): $250
AFTER:
- Table: 50TB, partitioned by day, clustered
- Query scans: 500GB (partition pruning + clustering)
- Cost: 0.5TB × $5/TB = $2.50 per query
- With materialized view: 50GB scan = $0.25
Optimization impact:
- Partitioning: 80% reduction (only scan needed dates)
- Clustering: 50% additional (within partition)
- Materialized view: 90% additional (pre-aggregated)
- Total: 95%+ reductionMonitoring Query Costs
-- Find expensive queries
SELECT
user_email,
query,
ROUND(total_bytes_processed / POW(10,12), 4) as tb_processed,
ROUND(total_bytes_processed / POW(10,12) * 5, 2) as estimated_cost_usd
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 20;
-- Set up cost controls
-- In BigQuery settings: set maximum bytes billed per query BigQuery Optimization Checklist
| Technique | Cost Reduction | Implementation |
|---|---|---|
| Partition by date | 50-90% | PARTITION BY DATE(timestamp) |
| Clustering | 20-50% additional | CLUSTER BY column1, column2 |
| Select specific columns | 30-70% | Avoid SELECT * |
| Materialized views | 80-95% | Pre-compute aggregations |
| BI Engine | N/A (faster) | Cache for dashboards |
| Flat-rate pricing | Predictable | For high-volume workloads |
Practice Question
Why does adding 'LIMIT 1000' to a BigQuery query NOT reduce the amount of data scanned or the cost?