Data Science / Business Intelligence

Customer Churn & Retention Analytics

Predicting customer lifetime and churn risk through survival analysis and gradient boosting ML.

🎯 Business Challenge

The company was experiencing a 20% annual customer churn rate, but the existing CRM system only flagged customers after they had already cancelled. This reactive approach meant lost revenue and wasted retention budgets on customers who were already gone.

Key challenges included:

  • Time-to-Churn Blindness: Not knowing when a customer would churn made it impossible to time retention campaigns effectively.
  • Feature Overload: 80+ customer attributes with no clear indication of which actually predicted churn.
  • Censored Data Problem: Many customers were still active (hadn't churned yet), making traditional classification models incomplete.
"We were spending €50k/month on retention campaigns targeting customers who had already mentally checked out 6 months prior."

💡 Solution Architecture

I implemented a two-phase hybrid system combining survival analysis (for temporal prediction) with gradient boosting (for feature importance and risk scoring).

Phase 1: Survival Analysis (Kaplan-Meier & Cox Regression)

  • Kaplan-Meier Curves: Estimated the probability of survival (retention) over time for different customer segments.
  • Cox Proportional Hazards: Identified which features (contract type, support tickets, payment delays) accelerated churn risk.
  • This phase answered: "How long until this customer churns?"

Phase 2: Gradient Boosting Classifier

  • Trained on binary outcome (churned vs retained) using engineered features from survival analysis.
  • Features included: days since last login, support ticket velocity, payment history volatility.
  • This phase answered: "Why is this customer likely to churn?"

Technical Implementation

from lifelines import KaplanMeierFitter, CoxPHFitter
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd

# Survival Analysis - Time to Churn
kmf = KaplanMeierFitter()
kmf.fit(durations=data['tenure'], event_observed=data['churned'])

# Cox Regression for Feature Importance
cph = CoxPHFitter()
cph.fit(data[features + ['tenure', 'churned']], 'tenure', 'churned')

# Gradient Boosting for Churn Probability
X = data[engineered_features]
y = data['churned']

model = GradientBoostingClassifier(n_estimators=200, max_depth=5)
model.fit(X_train, y_train)

# Risk Score: Probability of churn in next 90 days
data['churn_risk_90d'] = model.predict_proba(X)[:, 1]

📈 Key Results & Business Impact

Quantified Outcomes

  • 30% Revenue at Risk: Identified high-risk customer segments representing €1.2M in annual recurring revenue.
  • 87% Prediction Accuracy: AUC score of 0.87, significantly outperforming the previous rule-based system (0.62).
  • 3-Month Warning Window: Could predict churn 90 days in advance with 75% accuracy, enabling proactive intervention.
  • €250k Annual Savings: Reduced wasted retention spending by targeting only high-probability saves.

Strategic Insights Uncovered

  • Support Ticket Paradox: Customers with 0 tickets were at higher risk than those with 1-2 (indicating disengagement).
  • Payment Method Predictor: Customers using manual bank transfers had 3x higher churn than auto-debit users.
  • Onboarding Critical Window: 80% of churns occurred within the first 6 months, pointing to onboarding issues.

🛠️ Technical Methodology

Feature Engineering

  1. Recency Metrics: Days since last purchase, login, support contact.
  2. Frequency Metrics: Average monthly transactions, support tickets per year.
  3. Monetary Metrics: ARPU (Average Revenue Per User), payment volatility.
  4. Derived Features: "Engagement Score" = (logins × transactions) / tenure.

Model Evaluation

  • Stratified K-Fold Cross-Validation: To handle class imbalance (only 20% churn rate).
  • Precision-Recall Trade-off: Optimized for recall (catching all potential churners) even at the cost of some false positives.
  • Time-Based Validation: Trained on months 1-10, validated on months 11-12 to simulate real deployment.

Deployment

  • Weekly batch scoring of entire customer base (~50k customers).
  • Top 500 highest-risk customers flagged for CRM team review.
  • Risk scores integrated into Power BI dashboards for executive reporting.

🎓 Lessons Learned

What Worked

  • Survival Analysis First: Understanding when customers churn helped prioritize feature engineering.
  • Business Alignment: Collaborated with retention team to set "actionable" risk thresholds (top 10% vs top 1%).
  • Explainability: SHAP values helped explain individual predictions to the CRM team ("This customer is high-risk because...").

Challenges Overcome

  • Data Quality Issues: 15% of customer records had missing tenure data, requiring imputation via account creation dates.
  • Class Imbalance: Only 20% churn rate meant the model initially over-predicted "no churn". Fixed with SMOTE oversampling.
  • Feature Leakage: Initial model included "last payment date" which leaked information about churn (churned customers obviously had old last payments).

🚀 Future Enhancements

  • Real-Time Scoring: Move from weekly batch to event-triggered scoring (e.g., score immediately after a support ticket closure).
  • Causal Inference: Use uplift modeling to identify which customers would actually respond to retention offers vs those who would stay anyway.
  • Deep Learning (LSTM): Capture temporal sequences (e.g., "login pattern suddenly changed") using recurrent neural networks.
  • A/B Testing Framework: Measure actual retention lift from model-driven campaigns vs control groups.

📚 Technical Stack Deep Dive

  • Lifelines: Python library for survival analysis (Kaplan-Meier, Cox regression).
  • Scikit-Learn: Gradient Boosting, hyperparameter tuning (GridSearchCV), model evaluation.
  • Pandas: Data wrangling, feature engineering, cohort analysis.
  • Matplotlib/Seaborn: Survival curves, feature importance plots, risk distributions.
  • SHAP: Model explainability for stakeholder communication.