Work/BigQuery Query Rerun Manager

BigQuery Query Rerun Manager

A production-oriented, metadata-driven system for executing and re-running grouped BigQuery queries using Docker and Kubernetes.

BigQueryKubernetesDockerPythonGCP IAM

TL;DR

Role

Lead Data Engineer

Timeline

4 months

Team

Solo project

Year

2025

Queries/day500+
Avg execution time<30s
Success rate98.5%

Overview

Built a production-grade system for managing BigQuery query execution and re-runs. Queries are stored as metadata in BigQuery, executed via Dockerized Python applications, and orchestrated using Kubernetes Jobs. All executions are logged back to BigQuery for full observability and auditability.

Problem

In production analytics pipelines, BigQuery queries often fail due to upstream data issues, schema changes, or transient errors. Re-running queries manually was error-prone, hard to audit, and difficult to scale across multiple query groups. There was no lightweight, metadata-driven mechanism to safely re-execute grouped queries with proper logging and traceability.

Approach

Designed a cloud-native system that executes grouped BigQuery queries using metadata stored in BigQuery. Built as a Dockerized Python application deployed via Kubernetes Jobs/CronJobs. All execution status, errors, and timestamps are logged back to BigQuery for full observability. Queries are defined dynamically via metadata tables, allowing updates without code changes.

Architecture

Kubernetes Jobs trigger Docker containers running Python apps → Read query metadata from BigQuery → Execute SQL queries sequentially → Log all results (success/failure, timestamps, errors) back to BigQuery execution log table. Service account authentication ensures security.

High-Level Architecture

BigQuery Rerun Manager Architecture

Execution Flow

BigQuery Execution Flow Diagram

Data Model

BigQuery Data Model

Key Decisions

Results

Eliminated manual query re-runs saving ~10 hours/week of engineering time. Achieved 98.5% success rate with automatic error logging. Reduced mean time to recovery from 2 hours to 15 minutes through better visibility. Query groups now scale independently via separate Kubernetes jobs.

Queries/day500+
Avg execution time<30s
Success rate98.5%

What I Learned

Metadata-driven systems provide incredible flexibility but require careful schema design. Kubernetes Jobs are perfect for batch workloads—simpler than full orchestration frameworks for focused use cases. Logging everything to BigQuery made debugging trivial and enabled data-driven optimization of query patterns.

What I'd Do Next

Add Slack alerting on failures, implement retry policies per query, enable parallel execution for large query groups, build a simple UI dashboard for execution history, and integrate with Airflow for complex DAG dependencies.