Data Engineering

Financial Data Warehouse (ETL Pipeline)

Enterprise-grade data warehouse transforming fragmented ERP data into a unified Star Schema for single-source-of-truth reporting.

🎯 Business Challenge

The finance team was drowning in data chaos: 5 different systems (ERP, CRM, Excel, legacy databases) with no unified view, leading to:

  • Conflicting Numbers: Sales report from CRM showed €2.5M revenue, accounting system showed €2.3M—which is correct?
  • Manual Reconciliation: Analysts spending 10 hours/week joining data across systems in Excel.
  • Slow Queries: Generating month-end reports took 45 minutes due to complex joins across denormalized tables.
  • No Historical Tracking: Couldn't analyze trends (e.g., "How has gross margin changed over 3 years?").
"We had 5 versions of 'the truth'. The CFO would ask a question and get 3 different answers depending on who responded."

💡 Solution Architecture

I designed and implemented a centralized data warehouse using Star Schema methodology, with automated ETL pipelines ensuring data freshness and consistency.

Phase 1: Data Warehouse Design (Star Schema)

  • Fact Tables: Sales Transactions, GL Entries, Inventory Movements (grain: 1 row per transaction).
  • Dimension Tables: Customers, Products, Time (Date), Accounts (Chart of Accounts), Departments.
  • Slowly Changing Dimensions (SCD Type 2): Track historical changes (e.g., customer address changes over time).

Phase 2: ETL Pipeline Development

  • Extract: Pull data from ERP (SAP), CRM (Salesforce), Excel files (FTP), legacy SQL Server.
  • Transform: Standardize formats, handle nulls, deduplicate, calculate derived metrics (gross margin, YoY growth).
  • Load: Insert into warehouse (PostgreSQL) using UPSERT logic (update existing, insert new).
  • Scheduling: Automated nightly runs via Python + cron (incremental loads for performance).

Phase 3: Power BI Integration

  • Connected Power BI to warehouse via DirectQuery (real-time dashboards).
  • Pre-built star schema enabled drag-and-drop reporting (no complex SQL needed).
  • Created reusable DAX measures: YTD Revenue, MTD Expenses, Customer Lifetime Value.

Technical Implementation

import pandas as pd
import psycopg2
from sqlalchemy import create_engine

# 1. Extract from multiple sources
erp_data = pd.read_sql("SELECT * FROM sales", erp_connection)
crm_data = pd.read_csv("crm_export.csv")

# 2. Transform
merged = pd.merge(erp_data, crm_data, on='customer_id', how='left')
merged['gross_margin'] = (merged['revenue'] - merged['cogs']) / merged['revenue']
merged['transaction_date'] = pd.to_datetime(merged['transaction_date'])

# 3. Load to Warehouse
warehouse_engine = create_engine('postgresql://user:pass@localhost/warehouse')

# UPSERT logic (update if exists, insert if new)
merged.to_sql('fact_sales', warehouse_engine, if_exists='append', index=False)

# 4. Update Dimensions (SCD Type 2)
existing_customers = pd.read_sql("SELECT * FROM dim_customer", warehouse_engine)
new_customers = merged[~merged['customer_id'].isin(existing_customers['customer_id'])]
new_customers.to_sql('dim_customer', warehouse_engine, if_exists='append')

📈 Key Results & Business Impact

Quantified Outcomes

  • 70% Faster Queries: Month-end reports down from 45min to 13min (optimized Star Schema).
  • 1M+ Transactions: Warehouse handles 1.2M transactions with sub-second query response.
  • 100% Data Integrity: Eliminated discrepancies through single source of truth.
  • 5 Sources Unified: ERP, CRM, Excel, legacy SQL Server, bank feeds—all in one place.

Business Metrics Improved

  • Reconciliation Time: 10 hours/week → 1 hour/week (90% reduction).
  • Report Accuracy: Zero "version conflicts" in executive dashboards.
  • Self-Service Analytics: Business users can now build their own reports without IT help.

Strategic Insights

  • Historical Trend Analysis: Discovered 15% decline in gross margin over 2 years due to supplier price increases.
  • Customer Segmentation: Identified top 20% of customers generating 80% of revenue.
  • Inventory Optimization: Detected slow-moving SKUs tying up €500k in working capital.

🛠️ Technical Methodology

Star Schema Design Principles

  1. Fact Tables: Contain metrics (revenue, cost, quantity) + foreign keys to dimensions.
  2. Dimension Tables: Contain descriptive attributes (customer name, product category, date).
  3. Grain: Define granularity (e.g., 1 row per invoice line item).
  4. Denormalization: Dimensions are denormalized for query performance.

ETL Best Practices

  • Incremental Loads: Only load new/changed records (not full refresh every night).
  • Data Validation: Check for nulls, duplicates, referential integrity before loading.
  • Error Handling: Log failed records to error table for manual review.
  • Idempotency: ETL can be re-run without creating duplicates (UPSERT logic).

Performance Optimization

  • Indexes: B-tree indexes on foreign keys, composite indexes on frequently filtered columns.
  • Partitioning: Fact tables partitioned by month for faster queries on recent data.
  • Materialized Views: Pre-aggregated tables for common reports (daily/monthly totals).

🎓 Lessons Learned

What Worked

  • Stakeholder Alignment: Involved end-users early to define business requirements and key metrics.
  • Incremental Development: Started with Sales data, then added Finance, then Inventory (iterative approach).
  • Documentation: Created data dictionary explaining every field (reduced support tickets by 80%).

Challenges Overcome

  • Data Quality: Source systems had 12% duplicate records; implemented fuzzy matching to deduplicate.
  • Schema Changes: ERP vendor changed field names in update; added abstraction layer to handle schema evolution.
  • Performance Degradation: Queries slowed as data grew; added partitioning and archived old data.

🚀 Future Enhancements

  • Real-Time Streaming: Move from batch (nightly) to streaming (Apache Kafka) for real-time dashboards.
  • Data Quality Framework: Automated data profiling and anomaly detection (Great Expectations).
  • Cloud Migration: Migrate from on-prem PostgreSQL to cloud (Snowflake/BigQuery).
  • Machine Learning: Add predictive models (sales forecasting, churn prediction) directly in warehouse.

📚 Technical Stack Deep Dive

  • PostgreSQL: OLAP database optimized for analytical queries (Star Schema).
  • Python (Pandas, SQLAlchemy): ETL scripting, data transformations.
  • Power BI: Visualization layer, DAX measures, self-service reporting.
  • Cron: Job scheduling for nightly ETL runs.
  • Git: Version control for ETL scripts and schema migrations.