
Evaluating relations between variables with ANOVA
Analysis of variance (ANOVA) is a statistical data analysis method invented by statistician Ronald Fisher. This method partitions data of a continuous variable using the values of one or more corresponding categorical variables to analyze variance. ANOVA is a form of linear modeling. If we are modeling with one categorical variable, we speak of one-way ANOVA. In this recipe, we will use two categorical variables so we have two-way ANOVA. In two-way ANOVA, we create a contingency table—a table containing counts for all combinations of the two categorical variables (we will see a contingency table example soon). The linear model is then given by the equation:

This is an additive model where μij is the mean of the continuous variable corresponding to one cell of the contingency table, μ is the mean for the whole data set, αi is the contribution of the first categorical variable, βj is the contribution of the second categorical variable, and ɣ ij is a cross-term. We will apply this model to weather data.
How to do it...
The following steps apply two-way ANOVA to wind speed as continuous variable, rain as a binary variable, and wind direction as categorical variable:
- The imports are as follows:
from statsmodels.formula.api import ols import dautil as dl from statsmodels.stats.anova import anova_lm import seaborn as sns import matplotlib.pyplot as plt from IPython.display import HTML
- Load the data and fit the model with
statsmodels
:df = dl.data.Weather.load().dropna() df['RAIN'] = df['RAIN'] > 0 formula = 'WIND_SPEED ~ C(RAIN) + C(WIND_DIR)' lm = ols(formula, df).fit() hb = dl.HTMLBuilder() hb.h1('ANOVA Applied to Weather Data') hb.h2('ANOVA results') hb.add_df(anova_lm(lm), index=True)
- Display a truncated contingency table and visualize the data with Seaborn:
df['WIND_DIR'] = dl.data.Weather.categorize_wind_dir(df) hb.h2('Truncated Contingency table') hb.add_df(df.groupby([df['RAIN'], df['WIND_DIR']]).count().head(3),index=True) sns.pointplot(y='WIND_SPEED', x='WIND_DIR', hue='RAIN', data=df[['WIND_SPEED', 'RAIN', 'WIND_DIR']]) HTML(hb.html)
Refer to the following screenshot for the end result (see anova.ipynb
file in this book's code bundle):

See also
- The Wikipedia page for two-way ANOVA at https://en.wikipedia.org/wiki/Two-way_analysis_of_variance (retrieved August 2015)
- The Wikipedia page about the contingency table is https://en.wikipedia.org/wiki/Contingency_table (retrieved August 2015)