Survival Analysis
Content is still being updated.
- Survival analysis concerns a special kind of outcome variable: the time until an event occurs.
- For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cancer.
- We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment.
- Sounds like a regression problem. But there is an important complication: some of the patients have survived until the end of the study. Such a patient's survival time is said to be censored.
- We do not want to discard this subset of surviving patients, since the fact that they survived at least five years amounts to valuable information. # Survival and Censoring Times
- For each individual, we suppose that there is a true failure or
event time
, as well as a true censoring time . - The survival time represents the time at which the event of interest occurs (such as death).
- By contrast, the censoring is the time at which censoring occurs: for example, the time at which the patient drops out of the study or the study ends.
- We observe either the survival time T or else the censoring time C.
Specifically, we observe the random variable
If the event occurs before censoring (i.e. T < C) then we observe the true survival time T; if censoring occurs before the event (T > C) then we observe the censoring time. We also observe a status indicator, Finally, in our dataset we observe n pairs , which we denote as ## A Closer Look at Censoring - Suppose that a number of patients drop out of a cancer study early because they are very sick.
- An analysis that does not take into consideration the reason why the patients dropped out will likely overestimate the true average survival time.
- Similarly, suppose that males who are very sick are more likely to drop out of the study than females who are very sick. Then a comparison of male and female survival times may wrongly suggest that males survive longer than females.
- In general, we need to assume that, conditional on the features, the event time T is independent of the censoring time C. The two examples above violate the assumption of independent censoring.
The Survival Curve
- The survival function (or curve) is defined as
- This decreasing function quantifies the probability of surviving past time t.
- For example, suppose that a company is interested in modeling customer churn. Let T represent the time that a customer cancels a subscription to the company's service.
- Then S(t) represents the probability that a customer cancels later than time t. The larger the value of S(t), the less likely that the customer will cancel before time t. ## Kaplan-Meier Survival Curve
The Log-Rank Test
The Proportional Hazards Model
The Hazard Function
The hazard function or hazard rate - also known as
the force of mortality — is formally defined as