Skip to content

Building a Churn Prediction Model from Zero

How I took a Customer Success team from zero predictive capability to proactive intervention with a deployed ML pipeline.

The Problem

The Customer Success team had zero predictive capability. None. CS reps were entirely reactive, learning a customer was at risk only after it was too late to do anything about it. No models, no scores, no early warning system of any kind.

Under private equity ownership, ARR is the metric that drives valuation multiples. Every churned customer is a direct hit. This was not a data science experiment. It was protecting the number that determines what the company is worth.

The Data Pipeline

I built a Snowflake table that combined two event types:

  • Renewals from the CRM (won renewal opportunities)
  • Churns from customer account records (date of customer loss)

Filtered to events from 2022 onward, then generated five timeframes for each event: at event, 1 month before, 3 months before, 6 months before, and 1 year before. Each timeframe was joined to the closest customer health score by date using a ranked join.

The interesting part was not capturing a single health score. It was structuring the same metric at different lead times, letting me answer not just "can we predict churn" but "how far in advance?" That second question is the one the business actually cares about.

Model Selection

I tested five model types across all timeframes:

  • Logistic Regression
  • Random Forest
  • Gradient Boosting
  • XGBoost
  • LightGBM

Tuning was done with GridSearchCV, train/test splits, and class weighting to handle imbalance (via sklearn's compute_class_weight).

Why Not Just Accuracy?

Accuracy lies when classes are imbalanced. If 90% of customers renew, a model that always predicts "no churn" scores 90% accuracy while being completely useless. So I evaluated on two metrics that actually matter:

  • AUC (discrimination): Can the model rank high-risk customers above low-risk ones?
  • Brier score (calibration): When the model says 40% churn probability, do roughly 40% actually churn?

Both matter because the CS team was using probability estimates to prioritize outreach. Those numbers had to be trustworthy, not just directionally correct.

Results

TimeframeBest ModelAUCBrier
At eventGradient Boosting~0.920.078
1 month beforeGradient Boosting~0.890.071
3 months beforeLogistic Regression0.7750.210
6 months beforeInsufficient data (36 samples, 6 churns)
12 months beforeInsufficient data

Key Decisions

Gradient Boosting won short-term, handling the non-linear relationship between health scores and churn better than Logistic Regression, consistently outperforming on both AUC and Brier score where there was sufficient training data.

Logistic Regression won at 3 months because Gradient Boosting was overfitting on the smaller sample size. I chose the right model for each use case, not a one-size-fits-all approach.

I refused to deploy 6/12-month models. With only 36 samples and 6 churns at the 6-month window, I recommended not deploying despite organizational appetite for longer-range prediction. I set a threshold of 150–200 samples with 30–40 churns and built a data collection plan. Deploying a model with that little data would erode trust when it inevitably produced bad predictions.

Deployment

The output lived in a Tableau dashboard with risk tiers and recommended actions. Not descriptive. Not diagnostic. Prescriptive: specific accounts, specific next steps, updated daily. The CS team used it every morning for over a year. This was not a model gathering dust in a notebook.

"Matt presented your findings clearly and confidently, answered follow up questions, and helped derive real impact on saving at-risk customers!"

— Direct Manager

What I'd Do Next

Expand the feature set beyond just the health score: product usage data, support ticket volume and sentiment (I built a separate NLP pipeline for ticket classification), contract value, number of integrations, time since last login.

Continue collecting data at 6- and 12-month timeframes. Explore time-series approaches that model the trajectory of the health score rather than just point-in-time snapshots.

Stack

Snowflake (data assembly) → Python (modeling: sklearn, GridSearchCV) → Tableau (visualization and delivery)

I built every layer: the SQL for data assembly, the Python for modeling, the Tableau dashboard for delivery. Zero predictive capability to prescriptive analytics with recommended actions per account, end to end.