Our goal for the first scatter plot is to compare the endorsements a mod receives with the total number of times that mod is downloaded. How are the two measures related to each other? My initial assumption is that they would be heavily correlated, with high total downloads corresponding to high endorsements, but how related will they be and how much variation will there be? Another interesting question is: Which mods stand out within these measures? There are sure to be a few outliers, data points that do not fall within the same area as the majority of points.
Learning Objectives
- Create a scatter plot.
- Use Columns and Rows to add dimensions/measures to a worksheet.
- Format scatter plot points.
- Format chart axes.
What happens if we forget to switch from tags+ to mods? We will be creating a point in the plot for each mod. In mods, one record corresponds to one mod, so this will be a straightforward process. In tags+, many mods will have multiple records, so our default solution would end up adding together the endorsements and downloads for each of these records, creating extremely inflated values based on the number of tags a mod uses. Changing from sum to average values might be the perfect solution to this, although it does add an extra step, but some mods may not have tags at all. These mods would not be in our tags+ dataset. The only solution for this issue (that does not involve creating a new dataset) is to use the mods dataset.
Scatter plots rely more heavily on Rows and Columns than our previous charts did. However, it is not necessary to know how we want the data arranged. We will prove this by adding both our measures to the Rows field.
This is the resulting chart:
Unfortunately, we only have one point on our chart. This is because Tableau does not know how to group our data. By default, it is treating everything in the dataset as one group, and therefore one point, taking the sum of all endorsements and the sum of all total downloads from the dataset. If we want it to do something different, we will have to tell Tableau how to group the data. Do we want a point for each category? For each country? For each mod? Since we do want a point for each mod, we need some kind of unique identifier so that no mods will be accidentally grouped together. This is another situation where Mod ID is extremely useful. In a pinch we could perhaps use the mod names, but we do not have a guarantee that these are always different.
Now we should see a point for each mod. As we can see, endorsements and total downloads are indeed heavily correlated. The most extreme outlier in our dataset, the point at the top right, is for the mod SkyUI, a complete overhaul of the Skyrim user interface.
We will apply several formatting options to make the scatter plot cleaner and easier to use. We will format the points, remove unnecessary grid lines, modify the tooltip, and edit our axes.
The circle outlines we are currently using for our data points feel very cluttered to me. Changing them to a simpler shape would make it easier to examine the visualization and understand what is happening. Something the outlines do very well, however, is indicate where there is a lot of point overlap. In the center the overlap is so extreme that we basically have a solid blue blob. If we use a simpler shape, like solid circles, such a blob would not have the same meaning, since we could achieve it with far fewer mods. We will mitigate this slightly by making our points partially transparent. That way we will see darker areas where there is overlap.
As with the bar chart, we have unnecessary grid lines - chartjunk. We will remove these.
Note that this time we do not need to switch to the Columns tab to change Grid Lines there. If you check, you will see that Grid Lines is already set to None. It would seem that the issue we had with bar charts is not an issue for scatter plots.
The biggest issue with our tooltip is that it currently displays Mod ID in addition to endorsements and total downloads. Although Mod ID is useful for telling Tableau how to group and plot our data, it is not a particularly meaningful value, either to us or our audience. Mod name would be much more informative and interesting.
Finally, we will make a few edits to the axes. There are two main changes worth making. The first is making the range fixed. It currently goes up to somewhere between 700,000 and 800,000, but if we filtered the dataset, perhaps showing only mods with the category Books and Scrolls, the endorsements range would adjust to better fit that limited dataset. Without taking note of the changed range, it would look like mods from Books and Scrolls had more endorsements (and total downloads) than they actually did. Keeping the range fixed will eliminate this problem. It does mean we may be left with extra whitespace depending on what subset of mods we are looking at, but I find this a small price to pay.
The other change we will make is to the axis ticks themselves. We have a tick mark every 100,000 endorsements, making for a fairly busy axis. These extra tick marks are not really helping us or our viewers, however. As long as I have a general idea of the range, that is good enough to estimate what is happening in the scatter plot. If I want to know exactly how many endorsements or downloads a mod has, I can hover over the data point and examine the tooltip. We will therefore set larger gaps between tick marks.