04/Projects

Selectedwork.

A deeper look at the systems I've built — what was broken, the architecture I chose, and what changed because of it.

// Project 01·2022 — 2024

Enterprise Cloud Modernization Program

// Headline number

~1 TB

processed / day

// Architecture

live · 9 nodes · 9 edges

// active jobs2 running·1 queued
RUN_3421·ingest_users_mysql
24s
RUN_3422·transform_price_history
12s
RUN_3423·load_to_bigquery
3s
RUN_3424·backfill_alteryx_exports

Company-wide modernization from legacy Alteryx and MySQL workflows to a cloud-native GCP ecosystem using Python, BigQuery, Airflow, PySpark, Terraform, Dataproc, Cloud Composer, GCS, CI/CD, and metadata-driven automation. Started as a proof of concept and became the architecture used to scale data processing, reduce runtime and cost, and support self-service operations across the business.

// The problem

Legacy Alteryx and MySQL workflows couldn't keep pace with growing data volume. ETL ran for hours, gave teams no operational visibility, and required engineering to onboard every new data source by hand.

// My approach

Designed and built a cloud-native architecture on GCP: BigQuery for warehousing, Cloud Composer + Airflow for orchestration, Dataproc + PySpark for heavy transforms. The core innovation was a metadata-driven framework that lets analysts add new pipelines by editing config files — not writing code. Terraform managed everything as code; CI/CD shipped changes safely. Migrations ran incrementally so production never stopped.

// Stack

gcpbigqueryairflowcloud-composerpysparkterraformdataprocgcspythoncloud-migrationalteryxmysql

// Outcome

  • Pipeline runtimes cut from hours to minutes
  • ~1 TB processed daily across business domains
  • Right-sized Dataproc clusters reduced cost per workload
  • Self-service onboarding for non-engineering teams

// Project 02·2023 — 2024

Cortex

// Headline number

~90%

routine investigations · self-serve

// Operations app

live

cortex · pipeline ops

v22

DAGS

4 dags+ new
user_etl
142ms
price_sync
refresh_cache
tenant_metrics
89ms

End-user App Engine application built with React, Python, BigQuery, MySQL, GCS buckets, Airflow, and GCP services that connects DAGs, metadata, BigQuery processes, MySQL state, logs, errors, and pipeline status into a single operational interface. Gives internal users real-time messages, real-time tracking, process snapshots, monitoring views, data checks, reporting utilities, and metadata controls so they can manage complex cloud workflows without engineering intervention.

// The problem

Engineering owned all operational visibility — logs, pipeline state, data checks, errors. Business teams had to file tickets to see what was happening with their own data. This created bottlenecks, slowed incident response, and eroded trust in the platform.

// My approach

Built a React + App Engine app that consolidates Airflow DAGs, BigQuery state, MySQL metadata, and pipeline logs into one operational interface. Real-time updates for status changes, structured search across logs and errors, one-click data validation, and controls to rerun, skip, or override pipelines without engineering tickets.

// Stack

app-enginereactpythonbigquerymysqlgcsairflowdagsgcpinternal-toolsmetadata-drivenmetadata-management

// Outcome

  • Engineering tickets for operational issues dropped sharply
  • Business teams self-serve ~90% of routine investigations
  • Mean-time-to-detect on data issues materially improved
  • Designed, built, and maintained as a single-engineer project

// Project 03·2021 — 2022

Automated Mover Modeling System

// Headline number

100s

models trained in parallel

// ML pipeline · stage 1/6

running

123456

Split data

train · test · val

train · 70%
test
val

8,420 rows · stratified split

Metadata-driven machine learning infrastructure supporting parallel model training, retraining, and scoring across large client portfolios with PySpark, Airflow, imbalance correction, and automated orchestration.

// The problem

Modeling client portfolios at scale meant training, retraining, and scoring hundreds of models in parallel. The old workflow lived in notebooks — slow, error-prone, manually triggered, and impossible to audit reliably.

// My approach

Built a metadata-driven ML platform on PySpark + Airflow. Data scientists configure new model runs via YAML; the system handles class imbalance correction, parallel training across client segments, model versioning, and automated scoring. Every run is fully reproducible from its logged config and metrics.

// Stack

machine-learningpysparkairflowmetadata-drivenmodel-orchestrationimbalanced-classificationretrainingscoring

// Outcome

  • Hundreds of models trained in parallel runs
  • Manual notebook work eliminated from the loop
  • Imbalance correction built in by default
  • Reproducibility through metadata-versioned configs

// Project 04·2020 — 2022

ROI and QBR Reporting Automation

// Headline number

€60k

annual time savings unlocked

// Report sheet · Q4 ROI

computing

fx=SUM(B2:E4)xlsx · auto-saved
ABCDE
1Q1Q2Q3Q4
2ACME CORP42k58k68k74k
3BETA INC38k45k52k60k
4GAMMA LTD28k33k41k48k
5TOTAL0k0k0k0k

Automated reporting systems that reduced ROI report generation from minutes or hours to seconds and produced fully formatted QBR PowerPoint reports in under one minute.

// The problem

ROI reports took 30 minutes to hours to generate manually and were prone to formatting errors. Quarterly business reviews (QBRs) ate entire days of slide assembly. Account managers spent more time formatting than analyzing.

// My approach

Built Django + React systems backed by BigQuery to generate ROI reports on demand (seconds) and assemble fully formatted QBR PowerPoint decks programmatically. The QBR engine reads from analytics tables and writes branded slides with consistent typography, charts, and narrative templates.

// Stack

reportingautomationroiqbrpowerpointdjangoreactbigquerypythonanalytics

// Outcome

  • ROI generation 30 min — hours → seconds
  • QBR PowerPoint a full day → under one minute
  • ~€60k annual time savings unlocked
  • Account teams refocus on analysis, not assembly