Are Outliers Always a Problem?

no...sometimes we need to treat them with respect

The subject dataset is a highly imbalanced, labelled credit card fraud detection data. The 5-point summary indicated the presence of extreme outliers.
The initial approach was to take a hard attitude towards the outliers and limit them using the 1.5*IQR rule. This gave us poor results.

We wrote a custom data prep function that was tunable via grid search. We found that we got the best recall for the widest quantile range that contained most of the outliers.

Below is a code fragment with the custom data-prep function. For the complete code in context refer to my Kaggle notebook.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def outlier_limiter_compressor(Xdf, amtx='log', v1_28x='custom', qrange=(0.10,0.90)):
    '''Input dataframe for transformation. 2 options to transform `Amount` and 3 transform `V1 to V28` '''
    cXdf = Xdf.copy()
    
    if v1_28x == 'custom':
        for col in V1_28:     # apply outlier limits to each column
            qlo,qhi = cXdf[col].quantile([qrange[0],qrange[1]]).values
            lowr,uppr = qlo-(qhi-qlo)*1.5, qhi+(qhi-qlo)*1.5
            # Hard limit outliers using 1.5 inter quantile rule
            cXdf[col] = cXdf[col].map(lambda x: lowr if x < lowr else uppr if x>uppr else x)
    elif v1_28x == 'standard':
        scaler = StandardScaler()
        cXdf[V1_28] = scaler.fit_transform(cXdf[V1_28])        
    elif v1_28x == 'robust':
        scaler = RobustScaler(quantile_range=(qrange[0]*100,qrange[1]*100))
        cXdf[V1_28] = scaler.fit_transform(cXdf[V1_28])
    else:
        raise ValueError("Invalid option for v1_28x: Use 'custom','standard' or 'robust'")
        
    # Now transform the Amount column to log
    if amtx == 'log':
        cXdf['Amount'] = cXdf['Amount'].map(lambda x: np.log10(x) if x>1 else x)
    elif amtx == 'minmax':
        # Or try the minmax scaler
        mmscaler = MinMaxScaler()
        cXdf['Amount'] = mmscaler.fit_transform(cXdf['Amount'].values.reshape(-1,1))
    else:
        raise ValueError("Invalid option for amtx: Use 'log' or 'minmax'")
    return cXdf

Takeaway

We achieved a recall of 80+% using a tuned data prep pipeline (using our custom function above) along with a class-weighted random forest ensemble for this heavily imbalanced (0.0017%) dataset. One important learning is that for a heavily imbalanced dataset, outliers need to be treated with care – they cannot be automatically removed, but their effect needs to be studied/mitigated (say by compressing them to a log scale) to suit the model. Perhaps, the outliers hold the key to the positive cases.