16.5 Exercises
16.5.1 A simple function
With a sample of data on a continuous scale, it can be useful to plot a histogram and then superimpose a normal density function to assess the suitability of the normal model. For the purpose of the exercise, we can simply simulate some random data using the rnorm
function.
In plotting the histogram, it will help to use the additional probability
argument in the hist
function so that the histogram is scaled to have area 1. To superimpose the normal density, we can define a grid of values along the horizontal axis and then compute the values of the normal density function there, using dnorm
. The normal density curve can then be drawn simply by connecting these positions with lines, using the lines
function. The code below does this.
xgrid <- seq(min(x), max(x), length = 100)
ygrid <- dnorm(xgrid)
hist(x, probability = TRUE)
lines(xgrid, ygrid, lwd = 2, col = "blue")
- Turn the code into a function, with a name of your choosing, with a single argument
x
, the data to be plotted. As the code above assumes the mean and standard deviation of the normal density to be 0 and 1 respectively, amend the code to draw a density function whose mean and standard deviation match the sample mean and sample standard deviation of the data. [Themean
andsd
functions will be helpful here, as will themean
andsd
arguments of thednorm
function.] Test the function out with data of your choice (or simulated data). - Add arguments to the function to allow the user to control the colour and line width of the plotted normal density function.
- Amend the code of the histogram and normal density example above to ensure that (a) the density curve is never truncated because its peak is higher than the highest histogram bar and (b) that the normal density always stretches out to at least +/-3 sd’s. [*This task is a little trickier. Notice that if you set the argument
plot = FALSE
in an initial call tohist
and store the result then you can get information on the height and spread of the bars, among other things.]
16.5.2 A simple function: thresholds
Write a function which accepts a vector and a single numeric value as input and which then calculates how many elements of the vector exceed the threshold set by the given numerical value. Apply this function to demographic data across the world to calculate the number of countries whose population exceeds 50 million. Code to read and collate UN data is available at the end of Section 4.1. If you prefer, a simpler option is to use the gapminder
data, available in the gapminder
package.
Plot this pattern over time.
16.5.3 Improving the sim.max function
Consider how you might improve the coding of the sim.max
function. Here are two possibilities.
Generate a full set
n
observations from the lognormal distribution. Now remove those observations which do not lie below the threshold. This can be used as a good starting point for thesamp
vector, rather than starting from nothing.Amend the code within the while loop to generate as many observations from the lognormal function which are needed, again removing those which do not lie above the threshold. This should greatly reduce the number of loops required.