Condensed R For Data Science: Data Visualisation

Louis Becker
Jul 7, 2018 in R
cross-validation grid_search clustering dimensionality_reduction machine_learning regression reinforcement_learning

Data Visualisation Chapter

This piece is part of a series that serves as a condensed help guide that I use to explore R and the tidyverse packages as I work through R for Data Science available here

First, we install/load the relevant packages by installing the tidyverse. This “package” effectively contains all packages developed by the Hadley Wickham:

library("tidyverse")

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.1     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data Visualisation Chapter

In this chapter we will work with the ggplot2 package, but I’m lazy so I just call the entire tidyverse.

We start with some scatter plots of the mpg dataset, Scatter 1.

ggplot(mpg) + geom_point(aes(displ, hwy)) + ggtitle("Scatter 1")

We could (and often should) be explicit about which parameters we are assigning variables to. Notice the mapping, x and y parameters that have been inserted into the code used to generate the previous graph.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + ggtitle("Scatter 1")

The function ggplot() creates an empty graph off the underlying data set. The geom_ functions then add a layer to that.

# Practice
ggplot(mpg) + geom_point(aes(y = hwy, x = cyl)) + ggtitle("Scatter 2")

Here is another example:

ggplot(mpg) + geom_point(aes(y = class, x = drv)) + ggtitle("Scatter 3")

By the way, Scatter 3 is not a useful plot because the variables shown here cannot have a linear relationship.

An epic feature of ggplot2 is that it Can add a third variable to a 2D plot by mapping this variable to an aesthetic – the visual property of the objects in my plot(size, shape, colour of points). This is described as the level. We can use color, size, shape and alpha (transparency) in the aesthetic function to indicate the third variable! This is known as scaling. In essence, just map the aesthetic and ggplot does the rest.

ggplot(mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) + ggtitle("Scatter 4")

Aesthetic mapping Exercises

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

In the above example, the dots are red despite the color = “blue” specification because it is specified inside the aes() field. Setting the colour and other such parameters uniformly should be outside this mapping field in the layer field.

Possible aesthetics:

alpha (transparency)
colour
fill (the fill arsthetic can also be used in a variable capacity to show different colours for different observations)
group (use categorical variable here. It can replace color, and groups objects on the plot. See further down.)
shape
size
stroke (understand a little)

You can also map subject to certain constraints e.g.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = cyl < 6))

This will show which of the observations adhere to a particular condition or not.

Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

ggplot(mpg) +
  geom_point(aes(x = cyl, y = hwy, col = trans)) +
  facet_wrap(~ class, nrow = 3)

Note: “The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.”

Facets can be done for two variables. To do this, add facet_grid() to the plot call. The first argument of this mapping is also a formula.

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl)

To facet with just one variable, use a “.” At the other end of the formula.

Geoms

“A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses.”

We can turn this…

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

…into this…

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess'

# Loess regression is the default

…by changing the geom. Every geom takes a mapping argument, but not every aes() works with every geom. For example, with line types:

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

## `geom_smooth()` using method = 'loess'

Many geoms use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects (or use color, as previously mentioned).

We can plot multiple geoms for the same data and layer them over each other. This, however, causes repetition in the code. If, for example, the axes change, every geom’s x and y specs would have to change.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess'

In the case of layering different geoms, we can specify parameters like the axes as global options in the ggplot chart object:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess'

 # Same data passed to different geoms.

Epicly enough, we can still add features to each geom that will affect that layer only. For example, notice the colour in the following:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()

## `geom_smooth()` using method = 'loess'

I can also subset the data in a particular geom with filter() from the dplyr package.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

## `geom_smooth()` using method = 'loess'

# se = TRUE/FALSE displays confidence interval around the geom_smooth line or not.

Geom Exercises

Here I write code to recreate the graphs given in the book.

# Plot 1
ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 5) +
  geom_smooth(se = F)

## `geom_smooth()` using method = 'loess'

# Plot 2
ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 5) +
  geom_smooth(se = F, aes(group = drv))

## `geom_smooth()` using method = 'loess'

# Plot 3
ggplot(mpg, aes(displ, hwy, colour =  drv)) +
  geom_point() +
  geom_smooth(se = F)

## `geom_smooth()` using method = 'loess'

# Plot 4
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(col = drv)) +
  geom_smooth(se = F)

## `geom_smooth()` using method = 'loess'

# Plot 5
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour =  drv)) +
  geom_smooth(se = F, aes(linetype = drv))

## `geom_smooth()` using method = 'loess'

# Plot 6 -> think it's with shape? verify
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour =  drv)) +
  geom_smooth(se = F, aes(linetype = drv))

## `geom_smooth()` using method = 'loess'

Statistical Transformations

Consider a bar chart based on the diamonds data set:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

As you can see, this graph does not use a y-variable. This is because a bar chart creates new values to plot by design. It works with the number of observations within a range or bin and displays the count per bin. This is the same for histograms and frequency polygons. The algorithm that calculates these new values for the graph is called a stat(), short short for statistical transformation.

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called “Computed variables”. That describes how it computes two new variables: count and prop.

Generally, stats and geoms are interchangable. We recreate the same plot with the following.

ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you would use stat() explicitly:

Overiding the default stat in a particular ggplot because you want a different stat. In the example below we change the default stat from “count” to “identity”. Identity uses the actual value assigned to a variable instead of its count.

# Create a small table with the tribble command.
demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

# Now plot the demo table, but override the default stat in the bar chart geom.
# The count geom is overriden to
ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

Override the default mapping of transformed variables e.g. display proportion on y-axis rather than count.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

You might want to draw greater attention to the statistical transformation in your code.

ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

Transformation Exercises

ggplot(data = diamonds) +
  geom_pointrange(mapping = aes(x = cut, y = depth, ymin = min(diamonds$depth), ymax = max(diamonds$depth))) #Close enough

# Automatically groups according to cut, or x variable.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))

# Needs to group relative to whole x variable. Can say group=1 or group = "x"
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..,  group = 1))

Position Adjustments

You can colour a bar chart using the colour or fill aesthetics. The fill fill argument is more useful, since you can map a variable to this aesthetic to display different colours. A stacked bar chart of sorts.

# The "colour aesthetic" adds the border.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, colour = cut))

# The "fill" aesthetic adds the interior colour.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

The fill aesthetic can also be used as a variable to show different combinations.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.

position = identity

position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them vertically. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.

# Position = identity uses the raw value of the observation. Here, it places the bar at it's exact location
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 0.2, position = "identity")

# Here we specify the fill as NA. and empty the bars.
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
  geom_bar(fill = NA, position = "identity")

The identity position adjustment is more useful for 2d geoms, like points, where it is the default.

position = fill

position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

position = dodge

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Something interesting: position = jitter

Recall our first scatterplot.

ggplot(mpg) + geom_point(aes(displ, hwy)) + ggtitle("Scatter 1")

This plot contains only 126 points even though there are 234 observations in the dataset. The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it hard to see where the mass of the data is.

This gridding and overplooting can be avoided by setting position to “jitter”. Position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise. Here is an expample:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales.

Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = "jitter"): geom_jitter().

To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.

Position Adjustment exercises

# How to fix this
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

# Do this:
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()

# Default postion is dodge, putting data displays next to each other.
ggplot(data = mpg, mapping = aes(x = trans, y = hwy)) +
  geom_boxplot()

Coordinate Systems

The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.

Coord_flip()

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

# Normal Plot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

# Add new coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip()

Coord_quickmap()

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.

Coord_polar()

coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

bar <- ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, fill = cut),
    show.legend = FALSE,
    width = 1
  ) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

# flipped bar chart
bar + coord_flip()

# Coxcomb chart
bar + coord_polar()

Coordinate Systems Exercises

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline() + #adds line to graph by slope and intercept
  coord_fixed() # Default ratio is 1. 1 unit of x is the same as 1 unit of y

StatsLab