Statistical Modelling
A conceptual, visual and practical introduction
The University of Glasgowadrian.bowman@glasgow.ac.uk
14 November, 2024
Preface: what’s the problem?
The world abounds with data - but scientific and investigative work does not begin with data. It is true that data which is already available may give ideas and suggest hypotheses, but a serious project will start by thinking carefully about well defined objectives. In other words, scientific work begins with a clearly stated problem.
Statistical modelling refers to the process by which we collect and use data to gain insight into the problem we have defined and to lead us towards a conclusion. In an interesting paper on scientific and statistical methods, MacKay and Oldford (2000) structured the process in the following broad steps, referred to by the acronym PPDAC. Some of the key questions which are likely to arise in each step have been highlighted.
- Problem
- What are the key questions we would like to address?
- What is the context in which these questions are framed?
- Plan
- How should our experiment be designed?
- What data should be collected and how much?
- Data
- How should the data be checked for validity and consistency?
- What methods of visual exploration should we use?
- Analysis
- How should an appropriate model be constructed?
- What does analysis of our model tell us about our problem?
- Conclusion
- How should we report to others on our conclusions?
- What limitations and caveats should we highlight?
This broad view of statistical modelling sets the agenda for the book.
There are many textbooks and resources available which discuss statistical modelling - why another one? This book aims takes a particular approach.
- The focus is on conceptual understanding of the main ideas behind statistical thinking and modelling. Some technical details are provided for those who are interested but engagement with this material is optional.
- There is a strong theme of visual communication of both data and the concepts behind statistical models.
- The approach is also practical with extensive reference to the widely used statistical computing environment R as a means of engaging with the concepts and implementing the methods discussed. In particular, there is extensive use of real datasets. The aim is to discuss interesting and scientifically important questions, where possible with the data used in published papers. As data become increasingly ‘open’, datsets are read from publicly available sources wherever possible.
The target audience is those who need statistical methods to understand data. A good example is PhD students who are well motivated to analyse their experimental data. The scientific contexts of the examples come from a wide range of application areas, including the life sciences, the social sciences and topics of general interest. The aim is to provide examples which are both interesting and accessible, without the need for detailed technical knowledge of a particular area.
Throughout the book there are frequent references to the widely used statistical computing system R. It is possible to read this book without using R but it is primarily intended that the reader will use this powerful system to engage with the examples and exercises and with the whole process of statistical modelling. A description of how to install R, and the popular system RStudio which helpfully manages some aspects of the environment, is available immediately after this preface. A gentle introduction to R is provided in Chapter 2.
Acknowledgements
Please note that this is a work in progress. Please forgive some rough edges in the presentation here and there.