5 Activity 2
5.1 Box plots
The nottem
data set records the monthly average air temperatures at Nottingham Castle from 1920 to 1939.
Exercise 5.1
- Per the output, what was the average temperature in March of 1926?
- Using the
min()
function, find the lowest average temperature during this period of time. I.e.min(nottem)
. Do the same withmax()
to find the highest average temperature. - Writing
nottem == min(nottem)
we check each entry of the data for whether it equals the minimum. The month that has the minimum will display asTRUE
in the output. Which month and year did the lowest average temperature occur?
If you accidentally use only 1 equals sign you will overwrite the nottem
dataset. You can reset it with rm(nottem)
- Which month and year did the highest average temperatures occur?
Next, we will use box and whisker plots to compare the median temperatures per month. If we just called boxplot(nottem)
right now, it would do a single boxplot of the entire data set. So we have to prepare our data by converting it into a matrix first.
Exercise 5.2
- Reading the plot, what are the top 5 months in order sorted by the median of their average temperatures? Note: the median is the line in the middle of the box.
- It’s a bit hard to compare January and December at the moment, change the last line to read
boxplot(temperatures[,c("Jan", "Dec")])
and then state which of these two months has the higher median average temperature and which has the higher maximum average temperature.
Now let’s look at the numbers.
- Between January and December, which has the greater mean average temperature?
- For this particular data, would you say the means are generally close to or far from the medians?
5.2 Histograms
The faithful
dataset contains data on the length of eruption times and waiting times in between eruptions for the Old Faithful geyser in Yellowstone National Park.
As always, we’ll start by asking R to show us what’s in the variable:
You’ll notice that R gives us a very long list. If we wanted to get just a sneak peak at the first 10 rows, we could have instead used the function: head(faithful, 10)
.
Now that we know what the dataset looks like, we see that there are two columns: eruptions
(time of each eruption in mins) and waiting
(time to next eruption in mins). Let’s extract these variables and use a histogram to help us picture the data.
Exercise 5.3
- Notice that the data doesn’t have just one hill but two hills to it. What is the name for this shape of data?
- Use the
summary
function to get the median and mean of each column. Per the histogram, is there a lot of data near the mean/median or does the data lie elsewhere in the picture?
In the code box before this, change
hist
toboxplot
for both eruption time and waiting time. Do these box plots give an accurate picture of how the data looks and is distributed? (Remember: exactly 50% of the data will be positioned in the box and 25% on each whisker.)For comparison, here are some histograms of the Nottingham Castle temperatures. These have one hill so they’re uni___?
5.3 Scatter Plots
One question you could ask when seeing the Faithful histograms is whether longer waiting times lead to longer eruption times. In order to visualize that question, we’re going to need to plot both variables at the same time. The tool for this is the scatter plot which puts a point at each \((x, y)\) coordinate where \(x\) is waiting time and \(y\) is an eruption time.
Exercise 5.4
- Per the scatter plot, are points which have a lower waiting time (further to the left) associated with a higher or lower eruption time (up/down)?
- If the waiting time to the latest eruption is 80 minutes, about how long does the trend line predict the eruption time will be?
We will discuss scatter plots in more detail later in the course.