I’m giving a talk for the Kansas City R Users Group on how to get a preliminary impression of relationships between pairs of variables. Here is the R code and output that I will use.
Simple measures of association
There are several different ways of measuring bivariate relationships in a descriptive fashion prior to data analysis. The methods can be largely grouped into measures of relationship between two continuous variables, two categorical variables and measures of a relationship between a categorical variable and a continuous variable.
The best graphical summary of two continuous variables is a scatterplot. You should include a smoothing curve or spline model to the graph to emphasize the general trend and any departures from linearity.
price sqft age
price 1.0 0.8 -0.2
sqft 0.8 1.0 0.0
age -0.2 0.0 1.0
Anything larger than 0.7 or smaller than -0.7 is a strong linear relationship. Anything between 0.3 and 0.7 or between -0.3 and -0.7 is a weak linear relationship. Anything between -0.3 and 0.3 represents little or no linear relationship.
The best graphical summary between a continuous variable and a categorical variable is a boxplot.
boxplot(home$price~home$features)
boxplot(home$price~home$northeast)
boxplot(home$price~home$custom_build)
boxplot(home$price~home$corner_lot)
If your categorical variable is binary, you can also use a scatterplot. The binary variable goes on the y axis and a trend line is critical.