Correlation on “Mtcars” dataset
So y'all ever wonder about the “Advanced Engineering Mathematics ” that you had to take and did not understand why in the world would you ever need that? Don’t you worry! I promise it is of great use and here is one reason why.
Major part of analyzing any of the data to find solutions to a business problem employs algorithms like correlation, regression, hypothesis testing,etc. Heard of them?….Yes, your fundamentals in linear algebra, statistics, vectors are used but just so all of these can be applied on huge volumes of data we try to build these algorithms in programming languages that are built for those purposes. So let us discuss how exactly are these concepts being used.
I will take the simple concept of correlation as an example in here. Correlation is the relation between two variables. By understanding the level of dependency of each variable on the other we can get a prediction of how they will be affected by one another in the future. Correlation can be both positive or negative. Positive is when one variable is increasing as the other increases, and vice versa when two variables are negatively correlated , the increase of one variable results in the depreciation of the other. I will now show you how the correlation of two variables can be found in R.
There are public datasets that are available to us which can be accessed directly from R Studio itself by just one command. I will use the “mtcars” dataset and we’ll learn how to perform multiple methods and algorithms on similar public datasets.
Before stating your code, you will have to import some packages in order to be able to access the functions that you need.
The following is the syntax for installing packes and loading them, and “corrplot” is the package that we need for correlation analysis.
Mtcars is a dataset that is related to automobiles and various factors of their performance. The data command can access all the public datasets as shown while the head function gives the top few rows of the dataset.
corrplot() is a function used for creating a correlation plot in R. There are various methods that can be used like circle, pie,color etc. I chose number here, as I feel it does two tasks at one go which means that it gives you the value of the correlation coefficients in the plot itself.
As important is getting the output of the analysis, so is interpreting information from it.
From the above image of the correlation plot, we can see that the variables wt and disp are highly correlated positively and the variables wt and mpg are also highly correlated but negatively. The attributes disp, cyl, hp and mpg are the variables that are highly dependent on the other variables when compared to the rest of them.
The variable c contains the correlation matrix of the data.
To get correlation between just two variables, we use the “cor” function