16.5 Exercises

16.5.1 A simple function

With a sample of data on a continuous scale, it can be useful to plot a histogram and then superimpose a normal density function to assess the suitability of the normal model. For the purpose of the exercise, we can simply simulate some random data using the rnorm function.

x <- rnorm(50)

In plotting the histogram, it will help to use the additional probability argument in the hist function so that the histogram is scaled to have area 1. To superimpose the normal density, we can define a grid of values along the horizontal axis and then compute the values of the normal density function there, using dnorm. The normal density curve can then be drawn simply by connecting these positions with lines, using the lines function. The code below does this.

xgrid <- seq(min(x), max(x), length = 100)
ygrid <- dnorm(xgrid)
hist(x, probability = TRUE)
lines(xgrid, ygrid, lwd = 2, col = "blue")
  • Turn the code into a function, with a name of your choosing, with a single argument x, the data to be plotted. As the code above assumes the mean and standard deviation of the normal density to be 0 and 1 respectively, amend the code to draw a density function whose mean and standard deviation match the sample mean and sample standard deviation of the data. [The mean and sd functions will be helpful here, as will the mean and sd arguments of the dnorm function.] Test the function out with data of your choice (or simulated data).
  • Add arguments to the function to allow the user to control the colour and line width of the plotted normal density function.
  • Amend the code of the histogram and normal density example above to ensure that (a) the density curve is never truncated because its peak is higher than the highest histogram bar and (b) that the normal density always stretches out to at least +/-3 sd’s. [*This task is a little trickier. Notice that if you set the argument plot = FALSE in an initial call to hist and store the result then you can get information on the height and spread of the bars, among other things.]

16.5.2 A simple function: thresholds

Write a function which accepts a vector and a single numeric value as input and which then calculates how many elements of the vector exceed the threshold set by the given numerical value. Apply this function to demographic data across the world to calculate the number of countries whose population exceeds 50 million. Code to read and collate UN data is available at the end of Section 4.1. If you prefer, a simpler option is to use the gapminder data, available in the gapminder package.

Plot this pattern over time.

16.5.3 Improving the sim.max function

Consider how you might improve the coding of the sim.max function. Here are two possibilities.

  1. Generate a full set n observations from the lognormal distribution. Now remove those observations which do not lie below the threshold. This can be used as a good starting point for the samp vector, rather than starting from nothing.

  2. Amend the code within the while loop to generate as many observations from the lognormal function which are needed, again removing those which do not lie above the threshold. This should greatly reduce the number of loops required.