FraudShield — Real-Time Fraud Detection API - Case Study

🎯 Business Challenge

Financial fraud is one of the most asymmetric problems in data science: the dataset is 99.87% legitimate transactions, yet the cost of missing a fraudulent one vastly outweighs the cost of a false alarm. Standard supervised classifiers fail here because they optimise for accuracy — and predicting "not fraud" every single time gives 99.87% accuracy while catching zero fraud.

The core challenges were:

Extreme Class Imbalance: Only 0.13% of the 6.3M PaySim transactions are fraudulent, making traditional classifiers useless out of the box.
No Ground Truth at Inference Time: A real payment system cannot wait for a label. The model must make an unsupervised decision based solely on transaction behaviour.
Latency Requirements: Production fraud scoring must happen in milliseconds — not seconds — to avoid blocking legitimate payments.
Explainability Gap: A fraud flag without a risk level and anomaly score is operationally useless for analysts who need to triage alerts.

"A model that predicts 'not fraud' every time is 99.87% accurate — and completely worthless."

💡 Solution Architecture

I designed a three-component unsupervised ensemble served behind a production FastAPI layer with full MLflow experiment tracking. The ensemble approach was chosen specifically to reduce the false positive rate that any single anomaly detector would produce.

Component 1 — Isolation Forest

Detects anomalies by isolating observations through random feature splits.
Particularly effective for high-dimensional financial data with irregular distributions.
Caught 285 true fraud cases in the test set independently.

Component 2 — PCA Reconstruction Error

Compresses the transaction feature space and reconstructs it; legitimate transactions reconstruct well, fraudulent ones produce high residual error.
Caught 301 true fraud cases independently — stronger on TRANSFER and CASH-OUT patterns.

Component 3 — Voting Ensemble (Final Model)

A transaction is flagged as fraud only when multiple detectors agree, dramatically reducing false alarms.
Final result: 318 true fraud detections with 8.3% precision — meaning 1 in 12 alerts is actionable, a significant improvement over random flagging.

MLOps Pipeline

PaySim Dataset → Feature Engineering → Isolation Forest + PCA → Voting Ensemble → FastAPI → /predict

Feature Engineering

def to_X(t: TransactionIn):
    type_encoded    = TYPE_MAP.get(t.type.upper(), -1)
    balance_diff_orig = t.newbalanceOrig - t.oldbalanceOrg
    balance_diff_dest = t.newbalanceDest - t.oldbalanceDest
    amount_vs_balance = t.amount / t.oldbalanceOrg if t.oldbalanceOrg > 0 else t.amount

    return np.array([[
        t.amount, t.step, type_encoded,
        t.oldbalanceOrg, t.newbalanceOrig,
        t.oldbalanceDest, t.newbalanceDest,
        balance_diff_orig, balance_diff_dest, amount_vs_balance
    ]])

Three engineered features (balance_diff_orig, balance_diff_dest, amount_vs_balance) were the most predictive — a TRANSFER that drains the origin balance to zero is the single strongest fraud signal in PaySim data.

🚀 API Design & Production Readiness

Endpoints

GET /health — Service health, model load status, uptime.
GET /model/info — Active version, feature list, architecture description.
POST /predict — Score a single transaction. Returns is_fraud, fraud_probability, risk_level (LOW / MEDIUM / HIGH / CRITICAL), anomaly_score, and inference_ms.
POST /predict/batch — Score up to 500 transactions in a single call, returning aggregate fraud rate alongside per-transaction scores.
POST /model/reload — Hot-reload the model bundle from disk without restarting the service.

Risk Stratification

def risk(probability: float) -> str:
    if probability >= 0.80: return "CRITICAL"
    if probability >= 0.60: return "HIGH"
    if probability >= 0.35: return "MEDIUM"
    return "LOW"

Example Request / Response

# Request
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 9000, "step": 2, "type": "TRANSFER",
       "oldbalanceOrg": 9000, "newbalanceOrig": 0,
       "oldbalanceDest": 0, "newbalanceDest": 0}'

# Response
{
  "transaction_id": "TX-1748291234567",
  "is_fraud": true,
  "fraud_probability": 0.9241,
  "risk_level": "CRITICAL",
  "anomaly_score": 0.4241,
  "model_version": "v1.0.0",
  "inference_ms": 3.2,
  "timestamp": "2025-01-15T10:30:00Z"
}

📈 Model Performance & Results

Ensemble vs. Individual Detectors

Isolation Forest alone: 285 true positives, ~10.0% precision.
PCA Reconstruction alone: 301 true positives, ~10.6% precision.
Voting Ensemble (final): 318 true positives, 8.3% precision — highest absolute recall while maintaining operationally useful precision.

Key Metrics at Default Threshold

Recall: 64.6% — captured 318 of 492 actual fraud cases in the test set.
Precision: 8.3% — acceptable for an unsupervised system on 0.13% base rate data; every flagged batch is reviewed by a human analyst.
False Positives: 3,516 — legitimate transactions incorrectly flagged; filtered downstream by the risk stratification layer (CRITICAL/HIGH alerts only for auto-action).

Why Recall Is the Right Metric Here

In fraud detection, a missed fraud (false negative) carries a direct financial loss. A false alarm (false positive) costs analyst time — far cheaper. Optimising for recall at the cost of precision is the correct business trade-off, with the risk stratification layer serving as the triage mechanism.

🛠️ Technical Methodology

Dataset — PaySim

6.3 million rows of simulated mobile money transactions (Lopez-Rojas et al., 2016 — CC0 Public Domain).
5 transaction types: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER.
Fraud rate: 0.13% — only TRANSFER and CASH-OUT transactions are ever fraudulent in PaySim.

Model Persistence & Versioning

The full ensemble bundle (all three models + metadata) is serialised with Joblib to a single .joblib file.
Every training run is tracked in MLflow (parameters, metrics, artefacts) using a local SQLite backend.
The API supports hot-reloading via POST /model/reload — no downtime required for model updates.

Lifespan Management (FastAPI)

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("🚀 Starting Fraud Detection API...")
    load_model()  # loads bundle from disk at startup
    yield
    # graceful shutdown logic here

🎓 Lessons Learned

What Worked

Ensemble over Single Model: No single unsupervised detector was consistently better across all fraud patterns. The voting approach gave the best coverage.
Feature Engineering Over Raw Features: The three engineered balance features (balance_diff_orig, balance_diff_dest, amount_vs_balance) contributed more signal than the raw balances alone.
Risk Stratification: Mapping a continuous anomaly score to four discrete risk levels (LOW → CRITICAL) made the output operationally useful for analysts without requiring them to interpret raw probabilities.

Challenges Overcome

Unsupervised Threshold Calibration: Without labelled data at inference time, setting the fraud threshold required backtesting against the PaySim test set and manually tuning the contamination parameter.
Probability Calibration: Isolation Forest returns decision function scores, not true probabilities. I mapped these via prob = clip(0.5 - raw_score, 0, 1) — a pragmatic approximation that produces monotonically consistent risk rankings.
Batch Efficiency: The naive approach of looping predict() per transaction was ~40× slower than stacking all inputs into a single NumPy matrix and calling the ensemble once.

🚀 Future Enhancements

Supervised Fine-Tuning: Once a labelled dataset accumulates from analyst reviews, add an XGBoost layer trained on confirmed fraud labels.
Real-Time Streaming: Replace the REST batch endpoint with a Kafka consumer for true sub-second event-driven scoring.
Drift Detection: Monitor feature distributions over time and trigger automatic retraining when statistical drift exceeds a threshold.
SHAP Integration: Add per-prediction feature attribution so analysts understand why a transaction was flagged, not just that it was.
Docker + CI/CD: Containerise the service and add a GitHub Actions pipeline that retrains, evaluates, and deploys on new data automatically.

📚 Technical Stack Deep Dive

FastAPI: Async REST framework with automatic OpenAPI docs, Pydantic validation, and lifespan management.
Scikit-Learn: Isolation Forest, PCA, and the VotingClassifier ensemble wrapper.
MLflow: Experiment tracking, model versioning, and artefact storage (SQLite backend for portability).
Joblib: Efficient serialisation of the full ensemble bundle including preprocessing objects.
Pydantic: Strict input validation with field constraints (amount > 0, step >= 1) — rejects malformed requests before they reach the model.
NumPy: Vectorised feature construction and batch matrix assembly for low-latency inference.