Evaluating relations between variables with ANOVA_Python Data Analysis Cookbook-QQ阅读仙侠女生网

上QQ阅读APP看书，第一时间看更新

Evaluating relations between variables with ANOVA

Analysis of variance (ANOVA) is a statistical data analysis method invented by statistician Ronald Fisher. This method partitions data of a continuous variable using the values of one or more corresponding categorical variables to analyze variance. ANOVA is a form of linear modeling. If we are modeling with one categorical variable, we speak of one-way ANOVA. In this recipe, we will use two categorical variables so we have two-way ANOVA. In two-way ANOVA, we create a contingency table—a table containing counts for all combinations of the two categorical variables (we will see a contingency table example soon). The linear model is then given by the equation:

This is an additive model where μ_ij is the mean of the continuous variable corresponding to one cell of the contingency table, μ is the mean for the whole data set, α_i is the contribution of the first categorical variable, β_j is the contribution of the second categorical variable, and ɣ ij is a cross-term. We will apply this model to weather data.

How to do it...

The following steps apply two-way ANOVA to wind speed as continuous variable, rain as a binary variable, and wind direction as categorical variable:

The imports are as follows:

from statsmodels.formula.api import ols
import dautil as dl
from statsmodels.stats.anova import anova_lm
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import HTML

Load the data and fit the model with statsmodels:

df = dl.data.Weather.load().dropna()
df['RAIN'] = df['RAIN'] > 0
formula = 'WIND_SPEED ~ C(RAIN) + C(WIND_DIR)'
lm = ols(formula, df).fit()
hb = dl.HTMLBuilder()
hb.h1('ANOVA Applied to Weather Data')
hb.h2('ANOVA results')
hb.add_df(anova_lm(lm), index=True)

Display a truncated contingency table and visualize the data with Seaborn:

df['WIND_DIR'] = dl.data.Weather.categorize_wind_dir(df)
hb.h2('Truncated Contingency table')
hb.add_df(df.groupby([df['RAIN'], df['WIND_DIR']]).count().head(3),index=True)

sns.pointplot(y='WIND_SPEED', x='WIND_DIR',
              hue='RAIN', data=df[['WIND_SPEED', 'RAIN', 'WIND_DIR']])
HTML(hb.html)

Refer to the following screenshot for the end result (see anova.ipynb file in this book's code bundle):

Evaluating relations between variables with ANOVA

How to do it...

See also