The Problem

Financial fraud costs institutions billions annually. According to the Association of Certified Fraud Examiners, organizations lose an estimated 5% of revenue to fraud each year. The challenge is not just detecting fraud—it is detecting it accurately. A system that flags every tenth transaction as suspicious is useless in production. Banks need precision: catch the real fraud, let legitimate transactions flow.

When I set out to build a fraud detection system, I wanted something more than a textbook classifier. I wanted a pipeline that could handle the severe class imbalance inherent in fraud data (typically less than 0.2% of transactions are fraudulent), combine multiple detection strategies, and produce a single actionable risk score. The result is a hybrid system that achieved 93% precision with a 0.92 AUC on held-out test data.

Feature Engineering: The Real Work

Raw transaction data—amounts, timestamps, merchant categories—is not enough. The signal lives in the patterns. Feature engineering is where the real detective work happens, and it accounts for roughly 70% of the model's performance.

I engineered three categories of features:

Velocity Checks

How frequently is a card being used? A card that has three transactions in five minutes at three different merchants is suspicious. I computed rolling windows at multiple timescales:

def compute_velocity_features(df: pd.DataFrame) -> pd.DataFrame:
    """Compute transaction velocity at multiple time windows."""
    df = df.sort_values(['card_id', 'timestamp'])

    for window in ['1H', '6H', '24H', '7D']:
        df[f'tx_count_{window}'] = (
            df.groupby('card_id')['timestamp']
            .transform(lambda x: x.rolling(window, min_periods=1).count())
        )
        df[f'unique_merchants_{window}'] = (
            df.groupby('card_id')['merchant_id']
            .transform(lambda x: x.rolling(window, min_periods=1).nunique())
        )

    return df

Amount Anomalies

Every cardholder has a spending profile. A $5,000 purchase from someone who typically spends $30-50 is a strong signal. Rather than using raw amounts, I computed z-scores relative to each cardholder's historical distribution:

def compute_amount_features(df: pd.DataFrame) -> pd.DataFrame:
    """Flag transactions that deviate from a cardholder's norm."""
    stats = df.groupby('card_id')['amount'].agg(['mean', 'std'])
    df = df.merge(stats, on='card_id', suffixes=('', '_hist'))

    df['amount_zscore'] = (
        (df['amount'] - df['mean']) / df['std'].clip(lower=1.0)
    )
    df['amount_ratio_to_avg'] = df['amount'] / df['mean'].clip(lower=0.01)

    return df

Temporal Patterns

Fraud follows time-of-day and day-of-week patterns. Transactions at 3 AM on a Tuesday from a cardholder who exclusively shops during business hours deserve extra scrutiny. I encoded cyclical time features using sine and cosine transforms to preserve the circular nature of time:

def compute_time_features(df: pd.DataFrame) -> pd.DataFrame:
    """Extract cyclical time features."""
    hour = df['timestamp'].dt.hour
    dow = df['timestamp'].dt.dayofweek

    df['hour_sin'] = np.sin(2 * np.pi * hour / 24)
    df['hour_cos'] = np.cos(2 * np.pi * hour / 24)
    df['dow_sin'] = np.sin(2 * np.pi * dow / 7)
    df['dow_cos'] = np.cos(2 * np.pi * dow / 7)

    # Flag unusual hours for this cardholder
    card_hour_stats = df.groupby('card_id')['hour_sin'].agg(['mean', 'std'])
    df = df.merge(card_hour_stats, on='card_id', suffixes=('', '_usual'))
    df['unusual_time'] = (
        np.abs(df['hour_sin'] - df['mean']) > 2 * df['std'].clip(lower=0.1)
    ).astype(int)

    return df

Model Selection: Why a Hybrid Approach

I evaluated two fundamentally different approaches: Random Forest (supervised, learns from labeled fraud cases) and Isolation Forest (unsupervised, detects anomalies without labels). Each has distinct strengths.

Random Forest excels when you have labeled training data. It learns the specific patterns of known fraud types. However, it struggles with novel fraud patterns it has never seen. Isolation Forest, on the other hand, detects anything that looks "different" from normal transactions—making it effective against new attack vectors—but it produces more false positives because not every anomaly is fraud.

The solution is to combine both into a hybrid scoring system:

class HybridFraudScorer:
    """Combine supervised and unsupervised models into a single risk score."""

    def __init__(self, rf_weight: float = 0.6, iso_weight: float = 0.4):
        self.rf = RandomForestClassifier(
            n_estimators=200,
            max_depth=12,
            class_weight='balanced_subsample',
            random_state=42
        )
        self.iso = IsolationForest(
            n_estimators=150,
            contamination=0.002,
            random_state=42
        )
        self.rf_weight = rf_weight
        self.iso_weight = iso_weight

    def fit(self, X: np.ndarray, y: np.ndarray) -> 'HybridFraudScorer':
        self.rf.fit(X, y)
        self.iso.fit(X[y == 0])  # Train only on legitimate transactions
        return self

    def score(self, X: np.ndarray) -> np.ndarray:
        rf_proba = self.rf.predict_proba(X)[:, 1]
        iso_scores = -self.iso.score_samples(X)  # Higher = more anomalous
        iso_normalized = (iso_scores - iso_scores.min()) / (
            iso_scores.max() - iso_scores.min()
        )
        return self.rf_weight * rf_proba + self.iso_weight * iso_normalized

The Random Forest handles known fraud patterns with high confidence, while the Isolation Forest catches anomalous transactions that do not match any known pattern. The weighted combination produces a single risk score between 0 and 1.

Handling Class Imbalance

With fraud representing less than 0.2% of transactions, naive training would produce a model that simply predicts "not fraud" for everything and achieves 99.8% accuracy. Useless.

I addressed this with three complementary strategies: class_weight='balanced_subsample' in the Random Forest (which up-weights minority class samples), SMOTE (Synthetic Minority Over-sampling Technique) for generating synthetic fraud examples, and threshold tuning on the final risk score using precision-recall curves rather than the default 0.5 cutoff.

Results

On a held-out test set of 100,000 transactions (with realistic fraud prevalence):

The hybrid approach outperformed either model alone. The Random Forest by itself achieved 0.89 AUC, and the Isolation Forest alone reached 0.81. The combination leverages complementary strengths.

Lessons Learned

Feature engineering dominates model selection. Switching from raw features to engineered velocity/amount/time features improved AUC from 0.74 to 0.89 with the same model. Switching models with the same features improved it by only 0.03.

Precision matters more than recall in production. A fraud analyst can investigate 50 flagged transactions per day. If half are false positives, you are wasting their time and eroding trust in the system. Optimizing for precision at a reasonable recall threshold is the right trade-off.

The unsupervised component is insurance. Fraud evolves. New attack patterns appear that supervised models have never seen. The Isolation Forest acts as a safety net, catching anomalies that do not match historical fraud but look structurally unusual.

Threshold tuning is a business decision, not a technical one. The optimal cutoff depends on the cost of false positives (analyst time, customer friction) versus false negatives (fraud losses). I exposed the threshold as a configurable parameter so that risk teams can adjust it based on their operational capacity.

The Class Imbalance Trap

This deserves its own section because it is the single most common mistake in fraud detection projects. In a typical transaction dataset, fraud represents less than 0.2% of records. A model that blindly predicts "not fraud" for every single transaction achieves 99.8% accuracy. This is the accuracy paradox, and it renders standard accuracy metrics meaningless for imbalanced classification problems.

The real challenge is not achieving high accuracy—it is achieving high precision and recall simultaneously on the minority class. Precision tells you what fraction of your fraud alerts are actually fraud (critical for analyst workload). Recall tells you what fraction of actual fraud you are catching (critical for loss prevention). The F1 score balances both, and the precision-recall AUC is a far better evaluation metric than ROC-AUC for imbalanced datasets.

I used three techniques to combat imbalance: class_weight='balanced_subsample' in the Random Forest, SMOTE oversampling of the minority class, and threshold optimization using the precision-recall curve. The threshold that maximized F1 on the validation set was 0.38—well below the default 0.5—because the model's probability estimates are naturally compressed toward the majority class.

Why Temporal Splits Matter

Most tutorials split data randomly into training and test sets. For fraud detection, this is a subtle but critical mistake. Random splitting creates temporal leakage: the model trains on transactions from June and July, then "predicts" fraud in May. In production, you never have future data available at prediction time.

I used a strict temporal split: training data from months 1-8, validation from month 9, and test from months 10-12. This simulates the real deployment scenario where the model only sees past transactions and must predict future ones. The performance drop from random splitting to temporal splitting was significant—AUC dropped from 0.96 to 0.92—but the temporal results are honest and reflect what you would actually see in production.

Temporal splitting also reveals concept drift. Fraud patterns evolve as attackers adapt to detection systems. A model trained on January data may perform well through March but degrade by June as new attack vectors emerge. Monitoring model performance over time and retraining on recent data is essential for maintaining production accuracy.

What Production Fraud Systems Need

Academic fraud detection models and production fraud systems have fundamentally different requirements. Building this project taught me what the gap looks like:

Comparison to Published Benchmarks

The most widely used public benchmark for credit card fraud detection is the Kaggle Credit Card Fraud Detection dataset (284,807 transactions, 492 fraudulent). Published results on this dataset typically achieve AUC-ROC between 0.95 and 0.98, but these numbers are inflated by random splitting and PCA-preprocessed features that remove real-world noise.

On my own dataset with temporal splitting and raw engineered features, the 0.92 AUC is competitive with production-grade systems. Research papers from major financial institutions report AUC between 0.90 and 0.95 on real transaction data, with the top performers using deep learning ensembles and graph neural networks to model cardholder behavior as a network. My hybrid Random Forest + Isolation Forest approach lands in the middle of this range while maintaining interpretability—a trade-off that most production teams prefer over marginal accuracy gains from black-box models.

View on GitHub → All Projects