🎯 Business Challenge
The finance team was drowning in data chaos: 5 different systems (ERP, CRM, Excel, legacy databases) with no unified view, leading to:
- Conflicting Numbers: Sales report from CRM showed €2.5M revenue, accounting system showed €2.3M—which is correct?
- Manual Reconciliation: Analysts spending 10 hours/week joining data across systems in Excel.
- Slow Queries: Generating month-end reports took 45 minutes due to complex joins across denormalized tables.
- No Historical Tracking: Couldn't analyze trends (e.g., "How has gross margin changed over 3 years?").
"We had 5 versions of 'the truth'. The CFO would ask a question and get 3 different answers depending on who responded."
💡 Solution Architecture
I designed and implemented a centralized data warehouse using Star Schema methodology, with automated ETL pipelines ensuring data freshness and consistency.
Phase 1: Data Warehouse Design (Star Schema)
- Fact Tables: Sales Transactions, GL Entries, Inventory Movements (grain: 1 row per transaction).
- Dimension Tables: Customers, Products, Time (Date), Accounts (Chart of Accounts), Departments.
- Slowly Changing Dimensions (SCD Type 2): Track historical changes (e.g., customer address changes over time).
Phase 2: ETL Pipeline Development
- Extract: Pull data from ERP (SAP), CRM (Salesforce), Excel files (FTP), legacy SQL Server.
- Transform: Standardize formats, handle nulls, deduplicate, calculate derived metrics (gross margin, YoY growth).
- Load: Insert into warehouse (PostgreSQL) using UPSERT logic (update existing, insert new).
- Scheduling: Automated nightly runs via Python + cron (incremental loads for performance).
Phase 3: Power BI Integration
- Connected Power BI to warehouse via DirectQuery (real-time dashboards).
- Pre-built star schema enabled drag-and-drop reporting (no complex SQL needed).
- Created reusable DAX measures: YTD Revenue, MTD Expenses, Customer Lifetime Value.
Technical Implementation
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
# 1. Extract from multiple sources
erp_data = pd.read_sql("SELECT * FROM sales", erp_connection)
crm_data = pd.read_csv("crm_export.csv")
# 2. Transform
merged = pd.merge(erp_data, crm_data, on='customer_id', how='left')
merged['gross_margin'] = (merged['revenue'] - merged['cogs']) / merged['revenue']
merged['transaction_date'] = pd.to_datetime(merged['transaction_date'])
# 3. Load to Warehouse
warehouse_engine = create_engine('postgresql://user:pass@localhost/warehouse')
# UPSERT logic (update if exists, insert if new)
merged.to_sql('fact_sales', warehouse_engine, if_exists='append', index=False)
# 4. Update Dimensions (SCD Type 2)
existing_customers = pd.read_sql("SELECT * FROM dim_customer", warehouse_engine)
new_customers = merged[~merged['customer_id'].isin(existing_customers['customer_id'])]
new_customers.to_sql('dim_customer', warehouse_engine, if_exists='append') 📈 Key Results & Business Impact
Quantified Outcomes
- 70% Faster Queries: Month-end reports down from 45min to 13min (optimized Star Schema).
- 1M+ Transactions: Warehouse handles 1.2M transactions with sub-second query response.
- 100% Data Integrity: Eliminated discrepancies through single source of truth.
- 5 Sources Unified: ERP, CRM, Excel, legacy SQL Server, bank feeds—all in one place.
Business Metrics Improved
- Reconciliation Time: 10 hours/week → 1 hour/week (90% reduction).
- Report Accuracy: Zero "version conflicts" in executive dashboards.
- Self-Service Analytics: Business users can now build their own reports without IT help.
Strategic Insights
- Historical Trend Analysis: Discovered 15% decline in gross margin over 2 years due to supplier price increases.
- Customer Segmentation: Identified top 20% of customers generating 80% of revenue.
- Inventory Optimization: Detected slow-moving SKUs tying up €500k in working capital.
🛠️ Technical Methodology
Star Schema Design Principles
- Fact Tables: Contain metrics (revenue, cost, quantity) + foreign keys to dimensions.
- Dimension Tables: Contain descriptive attributes (customer name, product category, date).
- Grain: Define granularity (e.g., 1 row per invoice line item).
- Denormalization: Dimensions are denormalized for query performance.
ETL Best Practices
- Incremental Loads: Only load new/changed records (not full refresh every night).
- Data Validation: Check for nulls, duplicates, referential integrity before loading.
- Error Handling: Log failed records to error table for manual review.
- Idempotency: ETL can be re-run without creating duplicates (UPSERT logic).
Performance Optimization
- Indexes: B-tree indexes on foreign keys, composite indexes on frequently filtered columns.
- Partitioning: Fact tables partitioned by month for faster queries on recent data.
- Materialized Views: Pre-aggregated tables for common reports (daily/monthly totals).
🎓 Lessons Learned
What Worked
- Stakeholder Alignment: Involved end-users early to define business requirements and key metrics.
- Incremental Development: Started with Sales data, then added Finance, then Inventory (iterative approach).
- Documentation: Created data dictionary explaining every field (reduced support tickets by 80%).
Challenges Overcome
- Data Quality: Source systems had 12% duplicate records; implemented fuzzy matching to deduplicate.
- Schema Changes: ERP vendor changed field names in update; added abstraction layer to handle schema evolution.
- Performance Degradation: Queries slowed as data grew; added partitioning and archived old data.
🚀 Future Enhancements
- Real-Time Streaming: Move from batch (nightly) to streaming (Apache Kafka) for real-time dashboards.
- Data Quality Framework: Automated data profiling and anomaly detection (Great Expectations).
- Cloud Migration: Migrate from on-prem PostgreSQL to cloud (Snowflake/BigQuery).
- Machine Learning: Add predictive models (sales forecasting, churn prediction) directly in warehouse.
📚 Technical Stack Deep Dive
- PostgreSQL: OLAP database optimized for analytical queries (Star Schema).
- Python (Pandas, SQLAlchemy): ETL scripting, data transformations.
- Power BI: Visualization layer, DAX measures, self-service reporting.
- Cron: Job scheduling for nightly ETL runs.
- Git: Version control for ETL scripts and schema migrations.