🎯 Business Challenge
Financial fraud is one of the most asymmetric problems in data science: the dataset is 99.87% legitimate transactions, yet the cost of missing a fraudulent one vastly outweighs the cost of a false alarm. Standard supervised classifiers fail here because they optimise for accuracy — and predicting "not fraud" every single time gives 99.87% accuracy while catching zero fraud.
The core challenges were:
- Extreme Class Imbalance: Only 0.13% of the 6.3M PaySim transactions are fraudulent, making traditional classifiers useless out of the box.
- No Ground Truth at Inference Time: A real payment system cannot wait for a label. The model must make an unsupervised decision based solely on transaction behaviour.
- Latency Requirements: Production fraud scoring must happen in milliseconds — not seconds — to avoid blocking legitimate payments.
- Explainability Gap: A fraud flag without a risk level and anomaly score is operationally useless for analysts who need to triage alerts.
"A model that predicts 'not fraud' every time is 99.87% accurate — and completely worthless."
💡 Solution Architecture
I designed a three-component unsupervised ensemble served behind a production FastAPI layer with full MLflow experiment tracking. The ensemble approach was chosen specifically to reduce the false positive rate that any single anomaly detector would produce.
Component 1 — Isolation Forest
- Detects anomalies by isolating observations through random feature splits.
- Particularly effective for high-dimensional financial data with irregular distributions.
- Caught 285 true fraud cases in the test set independently.
Component 2 — PCA Reconstruction Error
- Compresses the transaction feature space and reconstructs it; legitimate transactions reconstruct well, fraudulent ones produce high residual error.
- Caught 301 true fraud cases independently — stronger on TRANSFER and CASH-OUT patterns.
Component 3 — Voting Ensemble (Final Model)
- A transaction is flagged as fraud only when multiple detectors agree, dramatically reducing false alarms.
- Final result: 318 true fraud detections with 8.3% precision — meaning 1 in 12 alerts is actionable, a significant improvement over random flagging.
MLOps Pipeline
PaySim Dataset → Feature Engineering → Isolation Forest + PCA → Voting Ensemble → FastAPI → /predict Feature Engineering
def to_X(t: TransactionIn):
type_encoded = TYPE_MAP.get(t.type.upper(), -1)
balance_diff_orig = t.newbalanceOrig - t.oldbalanceOrg
balance_diff_dest = t.newbalanceDest - t.oldbalanceDest
amount_vs_balance = t.amount / t.oldbalanceOrg if t.oldbalanceOrg > 0 else t.amount
return np.array([[
t.amount, t.step, type_encoded,
t.oldbalanceOrg, t.newbalanceOrig,
t.oldbalanceDest, t.newbalanceDest,
balance_diff_orig, balance_diff_dest, amount_vs_balance
]]) Three engineered features (balance_diff_orig, balance_diff_dest, amount_vs_balance) were the most predictive — a TRANSFER that drains the origin balance to zero is the single strongest fraud signal in PaySim data.
🚀 API Design & Production Readiness
Endpoints
- GET /health — Service health, model load status, uptime.
- GET /model/info — Active version, feature list, architecture description.
- POST /predict — Score a single transaction. Returns
is_fraud,fraud_probability,risk_level(LOW / MEDIUM / HIGH / CRITICAL),anomaly_score, andinference_ms. - POST /predict/batch — Score up to 500 transactions in a single call, returning aggregate fraud rate alongside per-transaction scores.
- POST /model/reload — Hot-reload the model bundle from disk without restarting the service.
Risk Stratification
def risk(probability: float) -> str:
if probability >= 0.80: return "CRITICAL"
if probability >= 0.60: return "HIGH"
if probability >= 0.35: return "MEDIUM"
return "LOW" Example Request / Response
# Request
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"amount": 9000, "step": 2, "type": "TRANSFER",
"oldbalanceOrg": 9000, "newbalanceOrig": 0,
"oldbalanceDest": 0, "newbalanceDest": 0}'
# Response
{
"transaction_id": "TX-1748291234567",
"is_fraud": true,
"fraud_probability": 0.9241,
"risk_level": "CRITICAL",
"anomaly_score": 0.4241,
"model_version": "v1.0.0",
"inference_ms": 3.2,
"timestamp": "2025-01-15T10:30:00Z"
} 📈 Model Performance & Results
Ensemble vs. Individual Detectors
- Isolation Forest alone: 285 true positives, ~10.0% precision.
- PCA Reconstruction alone: 301 true positives, ~10.6% precision.
- Voting Ensemble (final): 318 true positives, 8.3% precision — highest absolute recall while maintaining operationally useful precision.
Key Metrics at Default Threshold
- Recall: 64.6% — captured 318 of 492 actual fraud cases in the test set.
- Precision: 8.3% — acceptable for an unsupervised system on 0.13% base rate data; every flagged batch is reviewed by a human analyst.
- False Positives: 3,516 — legitimate transactions incorrectly flagged; filtered downstream by the risk stratification layer (CRITICAL/HIGH alerts only for auto-action).
Why Recall Is the Right Metric Here
In fraud detection, a missed fraud (false negative) carries a direct financial loss. A false alarm (false positive) costs analyst time — far cheaper. Optimising for recall at the cost of precision is the correct business trade-off, with the risk stratification layer serving as the triage mechanism.
🛠️ Technical Methodology
Dataset — PaySim
- 6.3 million rows of simulated mobile money transactions (Lopez-Rojas et al., 2016 — CC0 Public Domain).
- 5 transaction types: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER.
- Fraud rate: 0.13% — only TRANSFER and CASH-OUT transactions are ever fraudulent in PaySim.
Model Persistence & Versioning
- The full ensemble bundle (all three models + metadata) is serialised with Joblib to a single
.joblibfile. - Every training run is tracked in MLflow (parameters, metrics, artefacts) using a local SQLite backend.
- The API supports hot-reloading via
POST /model/reload— no downtime required for model updates.
Lifespan Management (FastAPI)
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info("🚀 Starting Fraud Detection API...")
load_model() # loads bundle from disk at startup
yield
# graceful shutdown logic here 🎓 Lessons Learned
What Worked
- Ensemble over Single Model: No single unsupervised detector was consistently better across all fraud patterns. The voting approach gave the best coverage.
- Feature Engineering Over Raw Features: The three engineered balance features (
balance_diff_orig,balance_diff_dest,amount_vs_balance) contributed more signal than the raw balances alone. - Risk Stratification: Mapping a continuous anomaly score to four discrete risk levels (LOW → CRITICAL) made the output operationally useful for analysts without requiring them to interpret raw probabilities.
Challenges Overcome
- Unsupervised Threshold Calibration: Without labelled data at inference time, setting the fraud threshold required backtesting against the PaySim test set and manually tuning the contamination parameter.
- Probability Calibration: Isolation Forest returns decision function scores, not true probabilities. I mapped these via
prob = clip(0.5 - raw_score, 0, 1)— a pragmatic approximation that produces monotonically consistent risk rankings. - Batch Efficiency: The naive approach of looping
predict()per transaction was ~40× slower than stacking all inputs into a single NumPy matrix and calling the ensemble once.
🚀 Future Enhancements
- Supervised Fine-Tuning: Once a labelled dataset accumulates from analyst reviews, add an XGBoost layer trained on confirmed fraud labels.
- Real-Time Streaming: Replace the REST batch endpoint with a Kafka consumer for true sub-second event-driven scoring.
- Drift Detection: Monitor feature distributions over time and trigger automatic retraining when statistical drift exceeds a threshold.
- SHAP Integration: Add per-prediction feature attribution so analysts understand why a transaction was flagged, not just that it was.
- Docker + CI/CD: Containerise the service and add a GitHub Actions pipeline that retrains, evaluates, and deploys on new data automatically.
📚 Technical Stack Deep Dive
- FastAPI: Async REST framework with automatic OpenAPI docs, Pydantic validation, and lifespan management.
- Scikit-Learn: Isolation Forest, PCA, and the VotingClassifier ensemble wrapper.
- MLflow: Experiment tracking, model versioning, and artefact storage (SQLite backend for portability).
- Joblib: Efficient serialisation of the full ensemble bundle including preprocessing objects.
- Pydantic: Strict input validation with field constraints (
amount > 0,step >= 1) — rejects malformed requests before they reach the model. - NumPy: Vectorised feature construction and batch matrix assembly for low-latency inference.