Non-parametric One-way ANOVA

Correlate categorical predictor with continuous response variable

September 14, 2022

934 words/5 min read

Regression datasets often have a mix of categorical and continuous predictor variables. When the number of categorical variables is large, how do you pick the ones that are relevant to the regression(i.e. correlated to the response variable)?

The Ames Housing dataset is a property dataset containing 79 features of which 43 are categorical. The task is to predict the saleprice of houses using regression. As part of feature selection step, we would use one-way ANOVA to rank the categorical features so that we can pick the top ones.

There are two ways to do this:

The visual approach

In this approach, we plot boxplots showing the distribution of SalePrice across the values of various categorical variables.

Boxplot

To see the boxplots in context, go to my Kaggle notebook here

From the boxplots, we can easily see that Paved Drive and Sale Type(among others) have a clear influence on the SalePrice.

We can use visual approach when there are a few categorical variables. When the number of categorical variables(also called factors) is more than a handful, we need a quantitative approach to rank their influence on the target variable.

The quantitative approach

Traditional ANOVA has strict requirements in terms of the response variable distributions being Gaussian with equal variances. However, this condition is violated in our dataset.

Hence, we will use the Kruskal-Wallis test which is the non-parametric version of the one-way ANOVA.

We will follow the following steps:

Group the dataframe based on values of each categorical variable.
Extract the response variable column from each group.
Use these columns to do a one-way ANOVA.
Create a dataframe sorted by p-values then by h-score and pick the top n.

The following is a code fragment from the same notebook mentioned above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


import scipy.stats as stats
response_var = 'SalePrice'  #or 'SalePrice' log
CATVAL, DF = 0,1  #C style enum type. See note below.
arg_list = []  #argument list for f oneway() function.
res_list = []  #result list with tuples: (f-score, p-value)

for feat in cat_feat_names_fin:
    #create cat feature value wise groups of the main dataframe.
    group_list = list(housingdf_dc.groupby(feat))  #get the groups into a list one dataframe per
    #cat value.
    #Structure of this array: Array of tuples - first element is cat value, second element is dataframe.
    #Note: Use the enum values above to traverse through group list.
    
    for grp in group_list:  #group gets tuples: (cat-val, df with selected cat-val)
        arg_list.append(grp[DF][response_var].values)
        #arg_list now has a list of response variable columns, one for each category value.
        
    #Apply the one-way Kruskal-Wallis (non-parametric) ANOVA.
    h, p = stats.kruskal(*arg_list) 
    res_list.append((feat,h,p))
    
fvalue_df = pd.DataFrame(res_list, columns = ['feat', 'h', 'p'])
#We have a dataframe with f/h scores and p-values; we need to pick the best n indicating strongest
#correlation.
fvalue_df.sort_values(['p','h'], ascending = [True,False], inplace = True)
fvalue_df.reset_index(drop = True, inplace = True) #resetting the index gives us an advantage.
#If two categories are related as shown by the chi square test, we can pick the one that has a 
#lower value as that will be more strongly correlated to the response variable.
fvalue_df.iloc[:,:]

From line 7 we iterate over the list of categorical variables, splitting the main dataframe into subframes one for each category value. From line 14, we iterate over the subframes to extract the response variable vectors(one per category value) and append it to an argument list preparatory to calling the ANOVA function. Finally, in the highlighted lines, we calculate the F-score for that particular feature and append it to a list which we later convert to a sorted dataframe from lines 22 onwards.

We would then get a table of categorical features in decreasing order of importance as seen below.

	feat	h	p
0	Sale Condition	11862.088786	0.000000e+00
1	Sale Type	11649.484168	0.000000e+00
2	Paved Drive	11426.564666	0.000000e+00
3	Garage Cond	11268.378841	0.000000e+00
4	Garage Qual	11166.915197	0.000000e+00
5	Garage Finish	11030.495476	0.000000e+00
6	Garage Type	10328.708545	0.000000e+00
7	Functional	9674.236518	0.000000e+00
8	Kitchen Qual	9645.993093	0.000000e+00
9	Electrical	8721.048294	0.000000e+00
10	Central Air	8576.530553	0.000000e+00
11	Heating QC	8441.378149	0.000000e+00
12	Heating	7935.587928	0.000000e+00
13	BsmtFin Type 2	7924.419589	0.000000e+00
14	BsmtFin Type 1	7901.301287	0.000000e+00
15	Bsmt Exposure	7387.120701	0.000000e+00
16	Bsmt Cond	7124.662257	0.000000e+00
17	Bsmt Qual	7061.629472	0.000000e+00
18	Foundation	6096.974304	0.000000e+00
19	Exter Cond	5375.322697	0.000000e+00
20	Exter Qual	5344.786094	0.000000e+00
21	Mas Vnr Type	4407.873986	0.000000e+00
22	Exterior 2nd	4044.730460	0.000000e+00
23	Exterior 1st	3605.698606	0.000000e+00
24	Roof Matl	3166.997822	0.000000e+00
25	Roof Style	3150.723404	0.000000e+00
26	House Style	3072.309979	0.000000e+00
27	Bldg Type	2840.675901	0.000000e+00
28	Condition 2	2764.800185	0.000000e+00
29	Condition 1	2740.046612	0.000000e+00
30	Neighborhood	2630.599196	0.000000e+00
31	Lot Config	1406.872181	9.047556e-271
32	Land Slope	1414.790250	4.264197e-270
33	Land Contour	1358.608290	5.005876e-267
34	Utilities	1361.113269	4.387615e-265
35	Lot Shape	1298.090210	1.578868e-257
36	MS Zoning	1047.959037	2.291721e-209
37	Street	1055.446473	3.057046e-209
38	MS SubClass	676.652485	2.472914e-135

The list throws up some surprising and not-so-surprising insights!