The Problem
Financial fraud costs institutions billions annually. According to the Association of Certified Fraud Examiners, organizations lose an estimated 5% of revenue to fraud each year. The challenge is not just detecting fraud—it is detecting it accurately. A system that flags every tenth transaction as suspicious is useless in production. Banks need precision: catch the real fraud, let legitimate transactions flow.
When I set out to build a fraud detection system, I wanted something more than a textbook classifier. I wanted a pipeline that could handle the severe class imbalance inherent in fraud data (typically less than 0.2% of transactions are fraudulent), combine multiple detection strategies, and produce a single actionable risk score. The result is a hybrid system that achieved 93% precision with a 0.92 AUC on held-out test data.
Feature Engineering: The Real Work
Raw transaction data—amounts, timestamps, merchant categories—is not enough. The signal lives in the patterns. Feature engineering is where the real detective work happens, and it accounts for roughly 70% of the model's performance.
I engineered three categories of features:
Velocity Checks
How frequently is a card being used? A card that has three transactions in five minutes at three different merchants is suspicious. I computed rolling windows at multiple timescales:
def compute_velocity_features(df: pd.DataFrame) -> pd.DataFrame:
"""Compute transaction velocity at multiple time windows."""
df = df.sort_values(['card_id', 'timestamp'])
for window in ['1H', '6H', '24H', '7D']:
df[f'tx_count_{window}'] = (
df.groupby('card_id')['timestamp']
.transform(lambda x: x.rolling(window, min_periods=1).count())
)
df[f'unique_merchants_{window}'] = (
df.groupby('card_id')['merchant_id']
.transform(lambda x: x.rolling(window, min_periods=1).nunique())
)
return df
Amount Anomalies
Every cardholder has a spending profile. A $5,000 purchase from someone who typically spends $30-50 is a strong signal. Rather than using raw amounts, I computed z-scores relative to each cardholder's historical distribution:
def compute_amount_features(df: pd.DataFrame) -> pd.DataFrame:
"""Flag transactions that deviate from a cardholder's norm."""
stats = df.groupby('card_id')['amount'].agg(['mean', 'std'])
df = df.merge(stats, on='card_id', suffixes=('', '_hist'))
df['amount_zscore'] = (
(df['amount'] - df['mean']) / df['std'].clip(lower=1.0)
)
df['amount_ratio_to_avg'] = df['amount'] / df['mean'].clip(lower=0.01)
return df
Temporal Patterns
Fraud follows time-of-day and day-of-week patterns. Transactions at 3 AM on a Tuesday from a cardholder who exclusively shops during business hours deserve extra scrutiny. I encoded cyclical time features using sine and cosine transforms to preserve the circular nature of time:
def compute_time_features(df: pd.DataFrame) -> pd.DataFrame:
"""Extract cyclical time features."""
hour = df['timestamp'].dt.hour
dow = df['timestamp'].dt.dayofweek
df['hour_sin'] = np.sin(2 * np.pi * hour / 24)
df['hour_cos'] = np.cos(2 * np.pi * hour / 24)
df['dow_sin'] = np.sin(2 * np.pi * dow / 7)
df['dow_cos'] = np.cos(2 * np.pi * dow / 7)
# Flag unusual hours for this cardholder
card_hour_stats = df.groupby('card_id')['hour_sin'].agg(['mean', 'std'])
df = df.merge(card_hour_stats, on='card_id', suffixes=('', '_usual'))
df['unusual_time'] = (
np.abs(df['hour_sin'] - df['mean']) > 2 * df['std'].clip(lower=0.1)
).astype(int)
return df
Model Selection: Why a Hybrid Approach
I evaluated two fundamentally different approaches: Random Forest (supervised, learns from labeled fraud cases) and Isolation Forest (unsupervised, detects anomalies without labels). Each has distinct strengths.
Random Forest excels when you have labeled training data. It learns the specific patterns of known fraud types. However, it struggles with novel fraud patterns it has never seen. Isolation Forest, on the other hand, detects anything that looks "different" from normal transactions—making it effective against new attack vectors—but it produces more false positives because not every anomaly is fraud.
The solution is to combine both into a hybrid scoring system:
class HybridFraudScorer:
"""Combine supervised and unsupervised models into a single risk score."""
def __init__(self, rf_weight: float = 0.6, iso_weight: float = 0.4):
self.rf = RandomForestClassifier(
n_estimators=200,
max_depth=12,
class_weight='balanced_subsample',
random_state=42
)
self.iso = IsolationForest(
n_estimators=150,
contamination=0.002,
random_state=42
)
self.rf_weight = rf_weight
self.iso_weight = iso_weight
def fit(self, X: np.ndarray, y: np.ndarray) -> 'HybridFraudScorer':
self.rf.fit(X, y)
self.iso.fit(X[y == 0]) # Train only on legitimate transactions
return self
def score(self, X: np.ndarray) -> np.ndarray:
rf_proba = self.rf.predict_proba(X)[:, 1]
iso_scores = -self.iso.score_samples(X) # Higher = more anomalous
iso_normalized = (iso_scores - iso_scores.min()) / (
iso_scores.max() - iso_scores.min()
)
return self.rf_weight * rf_proba + self.iso_weight * iso_normalized
The Random Forest handles known fraud patterns with high confidence, while the Isolation Forest catches anomalous transactions that do not match any known pattern. The weighted combination produces a single risk score between 0 and 1.
Handling Class Imbalance
With fraud representing less than 0.2% of transactions, naive training would produce a model that simply predicts "not fraud" for everything and achieves 99.8% accuracy. Useless.
I addressed this with three complementary strategies: class_weight='balanced_subsample' in the Random Forest (which up-weights minority class samples), SMOTE (Synthetic Minority Over-sampling Technique) for generating synthetic fraud examples, and threshold tuning on the final risk score using precision-recall curves rather than the default 0.5 cutoff.
Results
On a held-out test set of 100,000 transactions (with realistic fraud prevalence):
- Precision: 93% — When the system flags a transaction, it is correct 93% of the time
- Recall: 87% — The system catches 87% of actual fraud
- AUC-ROC: 0.92 — Strong discriminative ability across all thresholds
- False positive rate: 0.08% — Only 8 in 10,000 legitimate transactions are incorrectly flagged
The hybrid approach outperformed either model alone. The Random Forest by itself achieved 0.89 AUC, and the Isolation Forest alone reached 0.81. The combination leverages complementary strengths.
Lessons Learned
Feature engineering dominates model selection. Switching from raw features to engineered velocity/amount/time features improved AUC from 0.74 to 0.89 with the same model. Switching models with the same features improved it by only 0.03.
Precision matters more than recall in production. A fraud analyst can investigate 50 flagged transactions per day. If half are false positives, you are wasting their time and eroding trust in the system. Optimizing for precision at a reasonable recall threshold is the right trade-off.
The unsupervised component is insurance. Fraud evolves. New attack patterns appear that supervised models have never seen. The Isolation Forest acts as a safety net, catching anomalies that do not match historical fraud but look structurally unusual.
Threshold tuning is a business decision, not a technical one. The optimal cutoff depends on the cost of false positives (analyst time, customer friction) versus false negatives (fraud losses). I exposed the threshold as a configurable parameter so that risk teams can adjust it based on their operational capacity.
The Class Imbalance Trap
This deserves its own section because it is the single most common mistake in fraud detection projects. In a typical transaction dataset, fraud represents less than 0.2% of records. A model that blindly predicts "not fraud" for every single transaction achieves 99.8% accuracy. This is the accuracy paradox, and it renders standard accuracy metrics meaningless for imbalanced classification problems.
The real challenge is not achieving high accuracy—it is achieving high precision and recall simultaneously on the minority class. Precision tells you what fraction of your fraud alerts are actually fraud (critical for analyst workload). Recall tells you what fraction of actual fraud you are catching (critical for loss prevention). The F1 score balances both, and the precision-recall AUC is a far better evaluation metric than ROC-AUC for imbalanced datasets.
I used three techniques to combat imbalance: class_weight='balanced_subsample' in the Random Forest, SMOTE oversampling of the minority class, and threshold optimization using the precision-recall curve. The threshold that maximized F1 on the validation set was 0.38—well below the default 0.5—because the model's probability estimates are naturally compressed toward the majority class.
Why Temporal Splits Matter
Most tutorials split data randomly into training and test sets. For fraud detection, this is a subtle but critical mistake. Random splitting creates temporal leakage: the model trains on transactions from June and July, then "predicts" fraud in May. In production, you never have future data available at prediction time.
I used a strict temporal split: training data from months 1-8, validation from month 9, and test from months 10-12. This simulates the real deployment scenario where the model only sees past transactions and must predict future ones. The performance drop from random splitting to temporal splitting was significant—AUC dropped from 0.96 to 0.92—but the temporal results are honest and reflect what you would actually see in production.
Temporal splitting also reveals concept drift. Fraud patterns evolve as attackers adapt to detection systems. A model trained on January data may perform well through March but degrade by June as new attack vectors emerge. Monitoring model performance over time and retraining on recent data is essential for maintaining production accuracy.
What Production Fraud Systems Need
Academic fraud detection models and production fraud systems have fundamentally different requirements. Building this project taught me what the gap looks like:
- Explainability: When an analyst reviews a flagged transaction, they need to know why the model flagged it. A probability score alone is not actionable. I integrated SHAP (SHapley Additive exPlanations) values to provide per-transaction feature attribution, so the top three reasons for each flag are surfaced alongside the risk score.
- Latency constraints: In production, fraud scoring happens in real-time as transactions are processed. The model must return a score in under 50ms. Random Forests are well-suited for this—prediction is just a series of tree traversals—but feature computation (especially rolling windows) requires careful engineering to avoid expensive database queries on every transaction.
- Drift detection: A model that worked last quarter may not work this quarter. Production systems need automated monitoring of input feature distributions and model performance metrics, with alerts when statistical drift exceeds a threshold. Without drift detection, a model can silently degrade for months.
- Feedback loops: Analyst decisions (confirm fraud / dismiss alert) should feed back into the training pipeline. This creates a virtuous cycle where the model improves from operational experience. But it also creates a selection bias: the model only receives feedback on transactions it flagged, not the ones it missed.
Comparison to Published Benchmarks
The most widely used public benchmark for credit card fraud detection is the Kaggle Credit Card Fraud Detection dataset (284,807 transactions, 492 fraudulent). Published results on this dataset typically achieve AUC-ROC between 0.95 and 0.98, but these numbers are inflated by random splitting and PCA-preprocessed features that remove real-world noise.
On my own dataset with temporal splitting and raw engineered features, the 0.92 AUC is competitive with production-grade systems. Research papers from major financial institutions report AUC between 0.90 and 0.95 on real transaction data, with the top performers using deep learning ensembles and graph neural networks to model cardholder behavior as a network. My hybrid Random Forest + Isolation Forest approach lands in the middle of this range while maintaining interpretability—a trade-off that most production teams prefer over marginal accuracy gains from black-box models.