8.2 Comparing proportions


Example: Smoking and lung cancer
In a famous historical study of the association between smoking and lung cancer, Doll & Hill compared the numbers of smokers and non-smokers in samples of lung cancer patients and controls. The data for females are shown below.

                              cases  controls
                     smokers    41      28
                non-smokers     19      32

Is there evidence of a link between smoking and lung cancer?


Details of how these data were collected are given in the paper. There are interesting questions here about what constitutes an appropriate control group. In fact other hospital patients, not suffering from lung cancer, were used.

The principal question of interest is whether the proportion of smokers among the cases is different from the proportion of smokers among the controls. We denote the underlying true proportion among the cases and controls by p1 and p2 respectively, with corresponding sample sizes n1 and n2. We can estimate the true proportions by the sample proportions, p^1=41/60=0.683p^1=28/60=0.467 We can also calculate the standard error of each sample proportion as se1=p^1(1p^1)/n1=0.683×0.317/60=0.060se2=p^1(1p^1)/n1=0.467×0.533/60=0.064

However, it is the difference between the two groups which is of interest to us. We have a natural estimate in the differences of the proportions p1p2 in the difference of the sample proportions p^1p^2=0.6830.467=0.216. We can also calculate the standard error of this difference by combing the individual standard errors, as follows: sedifference=se12+se22 Notice that the squared standard errors are added together, despite the fact that the estimates of the proportions are being subtracted. This is because we are measuring the uncertainty involved and so the uncertainty of the difference combines the uncertainties of the individual components. With the present data this gives sedifference=0.0602+0.0642=0.088 A 95% confidence interval for the difference in proportions is then: 0.216±2×0.088i.e.0.216±0.176i.e.(0.040,0.392) Since this confidence interval does not contain 0, we therefore have clear evidence that the proportions of smokers in the cases and control groups are different.