The subject dataset is a highly imbalanced, labelled credit card fraud detection
data. The 5-point summary indicated the presence of extreme outliers.
The initial approach was to take a hard attitude towards the outliers
and limit them using the 1.5*IQR rule. This gave us poor results.
We wrote a custom data prep function that was tunable via grid search. We
found that we got the best recall for the widest quantile range that contained most
of the outliers.
Below is a code fragment with the custom data-prep function. For the complete
code in context refer to my Kaggle notebook.
defoutlier_limiter_compressor(Xdf, amtx='log', v1_28x='custom', qrange=(0.10,0.90)):
'''Input dataframe for transformation. 2 options to transform `Amount` and 3 transform `V1 to V28` '''
cXdf = Xdf.copy()
if v1_28x =='custom':
for col in V1_28: # apply outlier limits to each column
qlo,qhi = cXdf[col].quantile([qrange[0],qrange[1]]).values
lowr,uppr = qlo-(qhi-qlo)*1.5, qhi+(qhi-qlo)*1.5# Hard limit outliers using 1.5 inter quantile rule
cXdf[col] = cXdf[col].map(lambda x: lowr if x < lowr else uppr if x>uppr else x)
elif v1_28x =='standard':
scaler = StandardScaler()
cXdf[V1_28] = scaler.fit_transform(cXdf[V1_28])
elif v1_28x =='robust':
scaler = RobustScaler(quantile_range=(qrange[0]*100,qrange[1]*100))
cXdf[V1_28] = scaler.fit_transform(cXdf[V1_28])
else:
raiseValueError("Invalid option for v1_28x: Use 'custom','standard' or 'robust'")
# Now transform the Amount column to logif amtx =='log':
cXdf['Amount'] = cXdf['Amount'].map(lambda x: np.log10(x) if x>1else x)
elif amtx =='minmax':
# Or try the minmax scaler mmscaler = MinMaxScaler()
cXdf['Amount'] = mmscaler.fit_transform(cXdf['Amount'].values.reshape(-1,1))
else:
raiseValueError("Invalid option for amtx: Use 'log' or 'minmax'")
return cXdf
Takeaway
We achieved a recall of 80+% using a tuned data prep pipeline (using our custom
function above) along with a class-weighted random forest ensemble for this heavily
imbalanced (0.0017%) dataset.
One important learning is that for a heavily imbalanced dataset, outliers need to
be treated with care – they cannot be automatically removed, but their effect
needs to be studied/mitigated (say by compressing them to a log scale) to suit
the model. Perhaps, the outliers hold the key to the positive cases.