DeployU
Interviews / Cloud & DevOps / A BigQuery query scans 10TB and costs $50 per run. Optimize it to under $5.

A BigQuery query scans 10TB and costs $50 per run. Optimize it to under $5.

practical BigQuery Interactive Quiz Code Examples

The Scenario

A daily analytics report query costs $50 per execution:

-- Current query: scans 10TB, costs $50
SELECT
  user_id,
  COUNT(*) as event_count,
  SUM(revenue) as total_revenue
FROM `project.analytics.events`
WHERE event_date >= '2024-01-01'
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 1000;

The events table is 50TB total, with new data added daily. The query runs multiple times per day for different date ranges.

The Challenge

Reduce query costs by 90% using BigQuery optimization techniques: partitioning, clustering, materialized views, and query best practices.

Wrong Approach

A junior engineer might export data to a smaller table, use LIMIT expecting it to reduce scans, or avoid BigQuery entirely. These approaches don't address the root cause, LIMIT doesn't reduce data scanned, and alternatives lose BigQuery's analytical power.

Right Approach

A senior engineer implements partitioning by date (so queries only scan relevant partitions), clustering on frequently filtered columns, materialized views for repeated aggregations, and query optimization techniques. They also consider BI Engine for sub-second queries.

Step 1: Understand Current Table Structure

-- Check table size and partitioning
SELECT
  table_name,
  ROUND(total_bytes / POW(10,12), 2) as size_tb,
  ROUND(total_rows / POW(10,9), 2) as rows_billions,
  partition_expiration_ms,
  clustering_fields
FROM `project.analytics.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 'events';

-- Check what columns are used in WHERE clauses
-- (from query history)
SELECT
  query,
  total_bytes_processed,
  total_slot_ms
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND referenced_tables LIKE '%events%'
ORDER BY total_bytes_processed DESC
LIMIT 20;

Step 2: Implement Partitioning

-- Create partitioned table
CREATE TABLE `project.analytics.events_partitioned`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM `project.analytics.events`;

-- Or add partitioning to existing table schema
-- (requires table recreation)

-- Set partition expiration for old data
ALTER TABLE `project.analytics.events_partitioned`
SET OPTIONS (
  partition_expiration_days = 365,
  require_partition_filter = true  -- Prevent full scans!
);

Step 3: Add Clustering

-- Clustering sorts data within partitions
-- Great for high-cardinality columns used in WHERE/JOIN

CREATE TABLE `project.analytics.events_optimized`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type, country
AS SELECT * FROM `project.analytics.events`;

-- Optimized query now scans much less data
SELECT
  user_id,
  COUNT(*) as event_count,
  SUM(revenue) as total_revenue
FROM `project.analytics.events_optimized`
WHERE DATE(event_timestamp) >= '2024-01-01'  -- Uses partition
  AND user_id IN (SELECT user_id FROM active_users)  -- Uses clustering
GROUP BY user_id;

Step 4: Use Materialized Views

-- Pre-compute common aggregations
CREATE MATERIALIZED VIEW `project.analytics.daily_user_stats`
PARTITION BY event_date
CLUSTER BY user_id
OPTIONS (
  enable_refresh = true,
  refresh_interval_minutes = 60
)
AS
SELECT
  DATE(event_timestamp) as event_date,
  user_id,
  event_type,
  COUNT(*) as event_count,
  SUM(revenue) as total_revenue,
  COUNT(DISTINCT session_id) as sessions
FROM `project.analytics.events_optimized`
GROUP BY 1, 2, 3;

-- Query the materialized view (much faster and cheaper)
SELECT
  user_id,
  SUM(event_count) as total_events,
  SUM(total_revenue) as total_revenue
FROM `project.analytics.daily_user_stats`
WHERE event_date >= '2024-01-01'
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 1000;

Step 5: Query Optimization Best Practices

-- BEFORE: Inefficient query patterns
SELECT *  -- Selects ALL columns
FROM events
WHERE EXTRACT(YEAR FROM event_timestamp) = 2024  -- Function prevents partition pruning
  AND JSON_EXTRACT(metadata, '$.campaign') = 'summer'  -- Scans all rows

-- AFTER: Optimized query
SELECT
  user_id,
  event_type,
  revenue  -- Only needed columns
FROM `project.analytics.events_optimized`
WHERE event_timestamp >= '2024-01-01'  -- Partition pruning works
  AND event_timestamp < '2025-01-01'
  AND campaign = 'summer'  -- Use extracted column, not JSON
-- Use approximate functions for large datasets
SELECT
  APPROX_COUNT_DISTINCT(user_id) as unique_users,  -- Much faster
  APPROX_QUANTILES(revenue, 100)[OFFSET(50)] as median_revenue
FROM events_optimized
WHERE event_date = CURRENT_DATE();

-- Avoid self-joins, use window functions
-- BEFORE: Self-join to get previous row
SELECT a.*, b.revenue as prev_revenue
FROM events a
JOIN events b ON a.user_id = b.user_id
  AND a.event_timestamp = (
    SELECT MIN(event_timestamp)
    FROM events
    WHERE user_id = a.user_id
      AND event_timestamp > b.event_timestamp
  );

-- AFTER: Window function
SELECT
  *,
  LAG(revenue) OVER (PARTITION BY user_id ORDER BY event_timestamp) as prev_revenue
FROM events_optimized;

Step 6: Implement Slot Reservations for Predictable Costs

# For predictable workloads, flat-rate pricing is cheaper
# 100 slots = ~$2,000/month vs pay-per-query

# Create reservation
bq mk --reservation \
  --project_id=project \
  --location=US \
  --slots=100 \
  analytics-reservation

# Assign projects to reservation
bq mk --reservation_assignment \
  --project_id=project \
  --location=US \
  --reservation_id=analytics-reservation \
  --job_type=QUERY \
  --assignee_type=PROJECT \
  --assignee_id=analytics-project

Step 7: Use BI Engine for Dashboard Queries

-- Create BI Engine reservation for sub-second queries
-- Caches data in memory

-- Best for:
-- - Dashboard queries
-- - Repeated aggregations
-- - Time-series data

-- Configure via Console or API
-- 1GB BI Engine = ~$40/month
-- Dramatically faster for cached queries

Cost Comparison

BEFORE:
- Table: 50TB, no partitioning
- Query scans: 10TB (filters in WHERE)
- Cost: 10TB × $5/TB = $50 per query
- Daily cost (5 runs): $250

AFTER:
- Table: 50TB, partitioned by day, clustered
- Query scans: 500GB (partition pruning + clustering)
- Cost: 0.5TB × $5/TB = $2.50 per query
- With materialized view: 50GB scan = $0.25

Optimization impact:
- Partitioning: 80% reduction (only scan needed dates)
- Clustering: 50% additional (within partition)
- Materialized view: 90% additional (pre-aggregated)
- Total: 95%+ reduction

Monitoring Query Costs

-- Find expensive queries
SELECT
  user_email,
  query,
  ROUND(total_bytes_processed / POW(10,12), 4) as tb_processed,
  ROUND(total_bytes_processed / POW(10,12) * 5, 2) as estimated_cost_usd
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 20;

-- Set up cost controls
-- In BigQuery settings: set maximum bytes billed per query

BigQuery Optimization Checklist

TechniqueCost ReductionImplementation
Partition by date50-90%PARTITION BY DATE(timestamp)
Clustering20-50% additionalCLUSTER BY column1, column2
Select specific columns30-70%Avoid SELECT *
Materialized views80-95%Pre-compute aggregations
BI EngineN/A (faster)Cache for dashboards
Flat-rate pricingPredictableFor high-volume workloads

Practice Question

Why does adding 'LIMIT 1000' to a BigQuery query NOT reduce the amount of data scanned or the cost?