While bar charts are good for displaying one measurement for categorical data, they are not nearly so helpful for comparing different measurements to each other. For example, endorsements and downloads are two interesting measures we have not worked with yet. If we want to see how they relate to each other, even a side-by-side bar chart (where each category has two bars, one for endorsements and one for downloads) will not be very helpful in making any meaningful comparisons. Further, if we want to examine the relationship between the two measures for individual mods, rather than for entire categories of mods, we would be out of luck. We would need 10,000 bars! Instead, we will make use of the scatter plot, an excellent type of chart for finding the correlation between two measurements.
In this section we will make two scatter plots. The first will compare Endorsements to Total Downloads, and the second will compare a calculated field we will call Approval Ratio to Total Downloads.
Our goal for the Endorsements scatter plot is to compare the endorsemens a mod receives with the total number of times that mod is downloaded. How are the two measures related to each other? My initial assumption is that they would be heavily correlated, with high total downloads corresponding to high endorsements, but how related will they be and how much variation will there be? Another interesting question is: Which mods stand out within these measures? There are sure to be a few outliers, data points that do not fall within the same area as the majority of points.
Our goal for the first scatter plot is to compare the endorsements a mod receives with the total number of times that mod is downloaded. How are the two measures related to each other? My initial assumption is that they would be heavily correlated, with high total downloads corresponding to high endorsements, but how related will they be and how much variation will there be? Another interesting question is: Which mods stand out within these measures? There are sure to be a few outliers, data points that do not fall within the same area as the majority of points.
What happens if we use tags.csv instead of mods.csv? We will be creating a point in the plot for each mod. In mods.csv, one record corresponds to one mod, so this will be a straightforward process. In tags.csv, many mods will have multiple records, so our default solution would end up adding together the endorsements and downloads for each of these records, creating extremely inflated values based on the number of tags a mod uses. Changing from sum to average values might be the perfect solution to this, although it does add an extra step, but some mods may not have tags at all. These mods would not be in our tags dataset. The only solution for this issue (that does not involve creating a new dataset) is to use the mods dataset.
Scatter plots rely more heavily on the Rows and Columns fields (found at the top of the application) than our previous charts did. However, it is not necessary to know how we want the data arranged. We will prove this by adding both our measures to the Rows field.
This is the resulting chart:
Unfortunately, we only have one point on our chart. This is because Tableau does not know how to group our data. By default, it is treating everything in the dataset as one group, and therefore one point, taking the sum of all endorsements and the sum of all total downloads from the dataset. If we want it to do something different, we will have to tell Tableau how to group the data. Do we want a point for each category? For each country? For each mod? Since we do want a point for each mod, we need some kind of unique identifier so that no mods will accidentally be grouped together. This is another situation where Mod ID is extremely useful. In a pinch, we could perhaps use the mod names, but we do not have a guarantee that these will always be different.
Now we should see a point for each mod. As we can see, endorsements and total downloads are indeed heavily correlated. The most extreme outlier in our dataset, the point at the top right, is for the mod SkyUI, a complete overhaul of the Skyrim user interface.