Financial Data Warehouse (ETL Pipeline) - Case Study

🎯 Business Challenge

The finance team was drowning in data chaos: 5 different systems (ERP, CRM, Excel, legacy databases) with no unified view, leading to:

Conflicting Numbers: Sales report from CRM showed €2.5M revenue, accounting system showed €2.3M—which is correct?
Manual Reconciliation: Analysts spending 10 hours/week joining data across systems in Excel.
Slow Queries: Generating month-end reports took 45 minutes due to complex joins across denormalized tables.
No Historical Tracking: Couldn't analyze trends (e.g., "How has gross margin changed over 3 years?").

"We had 5 versions of 'the truth'. The CFO would ask a question and get 3 different answers depending on who responded."

💡 Solution Architecture

I designed and implemented a centralized data warehouse using Star Schema methodology, with automated ETL pipelines ensuring data freshness and consistency.

Phase 1: Data Warehouse Design (Star Schema)

Fact Tables: Sales Transactions, GL Entries, Inventory Movements (grain: 1 row per transaction).
Dimension Tables: Customers, Products, Time (Date), Accounts (Chart of Accounts), Departments.
Slowly Changing Dimensions (SCD Type 2): Track historical changes (e.g., customer address changes over time).

Phase 2: ETL Pipeline Development

Extract: Pull data from ERP (SAP), CRM (Salesforce), Excel files (FTP), legacy SQL Server.
Transform: Standardize formats, handle nulls, deduplicate, calculate derived metrics (gross margin, YoY growth).
Load: Insert into warehouse (PostgreSQL) using UPSERT logic (update existing, insert new).
Scheduling: Automated nightly runs via Python + cron (incremental loads for performance).

Phase 3: Power BI Integration

Connected Power BI to warehouse via DirectQuery (real-time dashboards).
Pre-built star schema enabled drag-and-drop reporting (no complex SQL needed).
Created reusable DAX measures: YTD Revenue, MTD Expenses, Customer Lifetime Value.

Technical Implementation

import pandas as pd
import psycopg2
from sqlalchemy import create_engine

# 1. Extract from multiple sources
erp_data = pd.read_sql("SELECT * FROM sales", erp_connection)
crm_data = pd.read_csv("crm_export.csv")

# 2. Transform
merged = pd.merge(erp_data, crm_data, on='customer_id', how='left')
merged['gross_margin'] = (merged['revenue'] - merged['cogs']) / merged['revenue']
merged['transaction_date'] = pd.to_datetime(merged['transaction_date'])

# 3. Load to Warehouse
warehouse_engine = create_engine('postgresql://user:pass@localhost/warehouse')

# UPSERT logic (update if exists, insert if new)
merged.to_sql('fact_sales', warehouse_engine, if_exists='append', index=False)

# 4. Update Dimensions (SCD Type 2)
existing_customers = pd.read_sql("SELECT * FROM dim_customer", warehouse_engine)
new_customers = merged[~merged['customer_id'].isin(existing_customers['customer_id'])]
new_customers.to_sql('dim_customer', warehouse_engine, if_exists='append')

📈 Key Results & Business Impact

Quantified Outcomes

70% Faster Queries: Month-end reports down from 45min to 13min (optimized Star Schema).
1M+ Transactions: Warehouse handles 1.2M transactions with sub-second query response.
100% Data Integrity: Eliminated discrepancies through single source of truth.
5 Sources Unified: ERP, CRM, Excel, legacy SQL Server, bank feeds—all in one place.

Business Metrics Improved

Reconciliation Time: 10 hours/week → 1 hour/week (90% reduction).
Report Accuracy: Zero "version conflicts" in executive dashboards.
Self-Service Analytics: Business users can now build their own reports without IT help.

Strategic Insights

Historical Trend Analysis: Discovered 15% decline in gross margin over 2 years due to supplier price increases.
Customer Segmentation: Identified top 20% of customers generating 80% of revenue.
Inventory Optimization: Detected slow-moving SKUs tying up €500k in working capital.

🛠️ Technical Methodology

Star Schema Design Principles

Fact Tables: Contain metrics (revenue, cost, quantity) + foreign keys to dimensions.
Dimension Tables: Contain descriptive attributes (customer name, product category, date).
Grain: Define granularity (e.g., 1 row per invoice line item).
Denormalization: Dimensions are denormalized for query performance.

ETL Best Practices

Incremental Loads: Only load new/changed records (not full refresh every night).
Data Validation: Check for nulls, duplicates, referential integrity before loading.
Error Handling: Log failed records to error table for manual review.
Idempotency: ETL can be re-run without creating duplicates (UPSERT logic).

Performance Optimization

Indexes: B-tree indexes on foreign keys, composite indexes on frequently filtered columns.
Partitioning: Fact tables partitioned by month for faster queries on recent data.
Materialized Views: Pre-aggregated tables for common reports (daily/monthly totals).

🎓 Lessons Learned

What Worked

Stakeholder Alignment: Involved end-users early to define business requirements and key metrics.
Incremental Development: Started with Sales data, then added Finance, then Inventory (iterative approach).
Documentation: Created data dictionary explaining every field (reduced support tickets by 80%).

Challenges Overcome

Data Quality: Source systems had 12% duplicate records; implemented fuzzy matching to deduplicate.
Schema Changes: ERP vendor changed field names in update; added abstraction layer to handle schema evolution.
Performance Degradation: Queries slowed as data grew; added partitioning and archived old data.

🚀 Future Enhancements

Real-Time Streaming: Move from batch (nightly) to streaming (Apache Kafka) for real-time dashboards.
Data Quality Framework: Automated data profiling and anomaly detection (Great Expectations).
Cloud Migration: Migrate from on-prem PostgreSQL to cloud (Snowflake/BigQuery).
Machine Learning: Add predictive models (sales forecasting, churn prediction) directly in warehouse.

📚 Technical Stack Deep Dive

PostgreSQL: OLAP database optimized for analytical queries (Star Schema).
Python (Pandas, SQLAlchemy): ETL scripting, data transformations.
Power BI: Visualization layer, DAX measures, self-service reporting.
Cron: Job scheduling for nightly ETL runs.
Git: Version control for ETL scripts and schema migrations.