Visualization is an important skill set for a data scientist. A good visualization can help in clearly communicating insights identified in the analysis also it is a good technique to better understand the dataset. Our brain is wired in a way that makes it easy for us to extract patterns or trends from visual data as compared to extracting details based on reading or other means.
In this article, I will be covering the visualization concept from the basics using python. Below are the steps to learn visualization from basic,
- Step 1: Importing data
- Step 2: Basic visualization using Matplotlib
- Step 3: More advanced visualizations, still using Matplotlib
- Step 4: Building quick visualizations for data analysis using Seaborn
- Step 5: Building interactive charts
By the end of this journey, you would be equipped with everything that is required to build a visualization. Though we will be not covering every single visualization that can be built you will be learning the concepts behind building a chart and hence it would be easy for you to build any new charts that are not covered in this article.
The scripts and the data used in this article can also be found in the git repository here. All data used in this article can be found in the “Data” folder within the mentioned git repository and the scripts are available in the folders ‘Day23, Day 24, and Day25’.
The first step is to read the required datasets. We can use pandas to read the data. Below is a simple command that can be used to read data from a CSV file
On reading the dataset it is important to transform it and make it suitable for the visualization we would apply. For example, let’s say we have sales details at the customer level and if we would want to build a chart that shows the day-wise sales trend then it is required to group the data and aggregate them at the day level and then use a trend chart.
Basic Visualization using Matplotlib
Let us start with some basic visualization. It is better to use the code ‘fig,ax=plt.subplots()’, where ‘plt.subplots()’ is a function which will return a tuple with figure and axes objects and assigned to the variables ‘fig’ and ‘ax’ respectively. Without using this as well you can print a chart but by using this you would be able to make changes to figure like you would be able to re-size the charts depending on how it looks and to save the chart as well. Similarly, the ‘ax’ variable here can be used to provide labels to the axes. Below is a simple example where I have passed the data as an array and have print it as a chart directly
In the above code at first, the required libraries are imported, and then the ‘plt.subplots()’ function is used to generate the figure and the axes objects and then the data is directly passed as an array to the axes object to print the chart. In the second chart, the axes variable ‘ax’ has taken inputs for labels specific to the x-axis, y-axis, and the title.
Now, let’s start using some real data and learn about building interesting charts and about customizing them to make it more intuitive. As explained, in most real-life use-cases the data would require some transformation to make it usable for the charts. Below is an example where I have used the Netflix data but have transformed the data to consolidate the number of movies and Tv shows by year wise. And then I have used the ‘plt.subplots()’ function but I have also added few additional details to make the chart more intuitive and self-explanatory.
There are a few more customization that can be done to the above chart like creating a dual-axis. In the above case, there isn’t much difference between the number of movies and TV shows hence the data appears OK, if there has been a huge difference between them then the chart will not be very clear in those cases we can make use of dual-axis so that the attribute with smaller values will also be scaled in line with the other one.
We can also make use of the Scatter Plot, to bring out any relationship between the variables that we are plotting. This plot helps in bringing the correlation between variables like what happens to one attribute when the other attribute is increasing/decreasing.
More advanced visualizations, still using Matplotlib
Once you are comfortable with the simple trend-charts we have covered so far you are ready to move to slightly more advanced charts and functionalities to better customization your visualization
The Bar Charts help us to compare multiple values at the same time by plotting them side-to-side. There are different kinds of Bar Charts,
- Vertical Bar Chart
- Horizontal Bar Chart
- Stacked Bar Chart
Below is an example of a Bar Chart, there are a number of customization added to this plot. They are,
- Axis labels and title are added
- Font size has been provided
- Figure size is provided as well (default chart would look much smaller and cluttered)
- A function is used to generate and add values to the top of each of the bars to help the viewers get the actual details
Horizontal and Stacked Bar Chart
Vertical Bar charts are most common but we can also make use of the horizontal bar charts especially when the data labels have a long name and it is very difficult to print them below a vertical bar. In the case of the stacked bar chart, the bars will be stacked on top of one another within a category. below is the example of implementing the horizontal and stacked bar charts. The below code also includes customization to the chart color.
Pie and Donut Chart
Pie charts are useful to show the proportion of different categories in the data and these pie charts can easily be modified to a Donut chart by covering the center part of the pie chart with a circle and re-aligning the text/values to suit the donut chart. Below is a simple example where I have implemented the pie chart and later modified it into a donut chart
Why is it important to learn Matplotlib?
Matplotlib is a very important visualization library in python because many other visualization libraries in python are dependent on matplotlib. Some of the advantages/benefits of learning matplotib are,
- It is easy to learn
- It is efficient
- It allows a lot of customizations hence possible to build almost any kind of visuals
- Libraries like Seaborn are built on top of Matpotlib
I have covered only the most essential visualization in Matplotlib but the important factor is by practicing these charts you would have acquired the knowledge for building much more visualization. Matplotlib supports a number of visualization here is the link to the gallery of all supported charts.
Building quick visualizations for data analysis using Seaborn
We have covered a variety of visualization using the Matplotlib library. I am not sure if you have noticed, though matplotlib offers high customization it involves a lot of coding and hence could be time-consuming especially when you are working on exploratory analysis and would want to make a few quick plots to understand the data better and make the decisions faster. That’s exactly what is offered by Seaborn library, here are some benefits of using the seaborn library,
- Default themes are still attractive
- Simple and quick to build visualizations especially for data analysis
- Its declarative API allows us to just focus on the key elements of the charts
There are few downsides too like it doesn’t offer much customization and it could lead to memory issues especially when we work on large datasets. But still, the benefit outweighs the disadvantages.
Visualizations with just one line code
Below are some simple visualizations that are implemented with just a single code using the seaborn library.
As shown in the above snapshot the visualizations are created with just a single line of code and they look quite presentable as well. The Seaborn library is widely used in the data analysis phase as we can build charts quickly with ease and with minimum/no effect to make the charts presentable. Visualization is key in the data analysis as they help in bringing out patterns in the data and the seaborn library fits apt for the purpose.
Heatmaps are another interesting visualization that is widely used on time-series data to bring out the seasonalities and other patterns in the dataset. However to build a heatmap we need to transform the data into a specific format to support heatmap plotting. Below is a sample code to transform the data to suit the heatmap plot and seaborn library used to build the heatmap
Pair Plot — my favorite functionality of Seaborn
I consider the pair plot as one of the best features of the seaborn library. It helps in comparison of each attribute in the dataset to every other attribute through visuals and again in a single line of code. Below is a sample code to build pair plots. The use of a pair plot might not be feasible when the dataset we are working on has a large number of columns. In those cases, the pair-plots can be used to analyze the relationship between a specific set of attributes alone.
Building interactive charts
While working on data science projects sometimes there would a requirement to share some visualization with the business teams. Dashboarding tools are widely used for this purpose but let’s say there is an interesting pattern that you have noticed while performing data analysis and would like to share it with the business user. If they are shared as an image there might not be much the business user can do but if they are shared as an interactive chart then it gives the business user power to look into the granular details by zooming in or out or use other functionality to interact with the chart. Below is an example where we are creating an HTML file as an output which includes the visualization that can be shared with any other user and they can be simply opened in a web browser.
If you are keen to learn about visualizations using python please check out my playlist below. It includes three videos, with a total tutorial length of just over one hour.