Code
# load libraries
library(ggplot2)
library(titanic)lm() function in REthan Tse
December 21, 2025
December 24, 2025
I think linear models are among the most commonly used tools in biomedical research, and I think it would be to the benefit of many to have a better understanding of such a fundamental tool.
Linear models also sit at the core of many statistical methods and machine learning models, which is why they appear so prominently in statistics and data science courses. Althought they appear simple, they are deceptively complex, with many statisticians spending their entire careers learning and developing techniques to improve linear models.
As such, decades of methodological work have expanded their flexibility, allowing them to handle surprisingly complex experimental designs while remaining relatively easy to interpret. This balance between expressiveness and interpretability is a major reason they continue to be used so widely.
Throughout my training, I have generally been encouraged to start with linear models and to move on to more complex approaches only when simpler ones are clearly inadequate. In practice, especially in biomedical research, more sophisticated models often need to perform substantially better than linear models to justify the added complexity and loss of interpretability. This trade-off becomes particularly important when the goal is inference—understanding relationships in the data—rather than prediction.
In this post, I focus on the most basic case: linear regression, which can be viewed as a building block for more advanced linear modeling frameworks.
Because the emphasis here is on [computation], I will keep the theoretical discussion light. Readers interested in the statistical foundations of linear models can refer to upcoming posts in the [theory] series.
We will work primarily with the stats::lm() function in R (which I think is a language very well suited for classical statistical analysis), and explore how it behaves in practice, its many function arguments, and how to interpret its output.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
Call:
stats::lm(formula = Fare ~ Age, data = titanic_train)
Coefficients:
(Intercept) Age
24.30 0.35
When we call stats::lm in R, we are calling the lm() function from the base stats package. When we “fit” a linear model, we are
This writeup represents my knowledge of the topic, and I by no means claim to be an expert.
Please Email me with any comments.