Author: Zoi Karampatzaki (Data Analyst)
In my experience as a consultant, I realized that people are not always aware of the existence of the scatterplot. Even if they know it exists, they don’t know how to ask for it. With this article, I want to broaden the horizons of people who do not have a data-related background and create the language for communicating with us.
First of all, the key word that always should be involved is the relationship. In fact, with a scatter plot, we visualize the relationship between two variables. For example, the height and weight of an individual are not independent of each other since, in general, the taller one is, the more they weigh. If we plot this relationship, we are going to get a set of dots that follow a specific pattern. This pattern can be linear or quadratic or even more complicated. A scatter plot is usually part of exploratory analysis, when visualizations help us understand the data better before we proceed to the statistics or machine learning methods.
The Scatter Chart in Power BI
In the default visuals, one can find the so called scatter chart and directly apply their metrics. Unlike other environments, such as Python and R, where only two variables are needed (the x and y axis), in Power BI, we also have the field “value”. In this field, we need to add the level in which the analysis is done. For instance, in a height-weight plot, the value will be the person ID. In a scatter plot that shows the relationship sales-profit, the value will be a product ID or a sale ID etc. In this case, we can visualize our data in many levels since the aggregation is done by Power BI automatically if needed. A legend can also be added to the graph, showing with different colors which dots belongs to a specific group (gender, age group etc).
Consider an example dataset where we aim to visualize the relationship between the number of activities and the number of deals that a user closed. The x and y axis fields were filled with the respective variables and the filed “value” contained the User ID. On the right part, in the “Visualizations” column, if we click on the lens icon and turn on the “Trend line”, we obtain the trend line of the data points. In our case, it can be observed that the trend line that was automatically created is not a good fit. One actually can observe two main trends instead of one.
Apart from the legend that can give the data distribution with respect to a classification that is already given in the dataset, one can conduct automatic clustering. By clicking on the 3 dots on the upper right part of the visual, one can choose “Automatically find clusters”. The program automatically creates a classification of the data and shows the labels as a new variable in the Fields section (right). When one drags and drops the new clustered variable in the “Legend” field, the clusters will be shown in the graph.
Finally, in order to get two trending lines instead of one, we click again the lens icon but this time, we turn off the option “combine series”. Now the trending lines are clearly more representative for the data. The visual shows that there are two categories of users. The ones of Cluster 1 that win less deals with many activities and the ones of Cluster 2 which, in contrast, close more deals without a high amount of activities.
To conclude, a scatter plot can be part of pre-linear regression analysis between two variables. It visualizes the relationship between two metrics. In this relationship, other factors might play an important role such as categorical variables. Power BI provides an automatic clustering option, where the dataset is split in different groups providing more accurate results.