"A picture is worth a thousand words"A complex idea can be understood effectively with the help of visual representations. Exploratory Data Analysis (EDA) helps us to understand the nature of the data with the help of summary statistics and visualizations capturing the details which numbers can't.
In this post, let us explore
- Visualizing the data
- Summarizing the data
- Correlation matrix
Visualization
Depending upon the type of data, we can choose different types of graphs for visualization. I have listed some of the possible graphing options under different combinations of types of data:
- When both variables are continuous
- Example: Weight, Height. We can use scatter plots
- Distribution plots
- Kernel Density Estimation plots
- Joint plots
- Pair plots
- When one variable is categorical and the other is continuous
- Example: Place, Rainfall
- We can go for box plots
- Bar plots
- When both variables are categorical
- Cross-tabulation
- Correspondence analysis
- Heatmap
- Mosaic plots
mosaic plot |
Summarizing the data
- Use describe() option in pandas to summarize the data
- If the data set is only numerical, describe() will display summary statistics for all columns
- Even if all columns are categorical, describe() will display summary statistics for all columns
- But if both categorical and numerical columns are present, by default describe() will display summary statistics of only numerical columns. In that case, we can use describe(include='all')
Correlation Matrix
Correlation matrix provides the correlation coefficients among the variables. I prefer to have p-values along with correlation coefficients in the correlation matrix.
Following is the code from tozCSS answer on stackoverflow. This gives a correlation matrix along with correlation coefficients and p-value. I added only one more line of code to format correlation coefficient even it is not significant.
Summary
In this post, we have explored various visualization techniques, when to use which graph, how to get the summary statistics and correlation matrix.
If you have questions or suggestion, do share. I will be happy to respond.