Survival Analysis

Zhao Cong

Content is still being updated.

  • Survival analysis concerns a special kind of outcome variable: the time until an event occurs.
  • For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cancer.
  • We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment.
  • Sounds like a regression problem. But there is an important complication: some of the patients have survived until the end of the study. Such a patient's survival time is said to be censored.
  • We do not want to discard this subset of surviving patients, since the fact that they survived at least five years amounts to valuable information. # Survival and Censoring Times
  • For each individual, we suppose that there is a true failure or event time , as well as a true censoring time .
  • The survival time represents the time at which the event of interest occurs (such as death).
  • By contrast, the censoring is the time at which censoring occurs: for example, the time at which the patient drops out of the study or the study ends.
  • We observe either the survival time T or else the censoring time C. Specifically, we observe the random variable If the event occurs before censoring (i.e. T < C) then we observe the true survival time T; if censoring occurs before the event (T > C) then we observe the censoring time. We also observe a status indicator, Finally, in our dataset we observe n pairs , which we denote as ## A Closer Look at Censoring
  • Suppose that a number of patients drop out of a cancer study early because they are very sick.
  • An analysis that does not take into consideration the reason why the patients dropped out will likely overestimate the true average survival time.
  • Similarly, suppose that males who are very sick are more likely to drop out of the study than females who are very sick. Then a comparison of male and female survival times may wrongly suggest that males survive longer than females.
  • In general, we need to assume that, conditional on the features, the event time T is independent of the censoring time C. The two examples above violate the assumption of independent censoring.

The Survival Curve

  • The survival function (or curve) is defined as
  • This decreasing function quantifies the probability of surviving past time t.
  • For example, suppose that a company is interested in modeling customer churn. Let T represent the time that a customer cancels a subscription to the company's service.
  • Then S(t) represents the probability that a customer cancels later than time t. The larger the value of S(t), the less likely that the customer will cancel before time t. ## Kaplan-Meier Survival Curve

The Log-Rank Test

The Proportional Hazards Model

The Hazard Function

The hazard function or hazard rate - also known as the force of mortality — is formally defined as where T is the (true) survival time. It is the death rate in the instant after time t, given survival up to that time. The hazard function is the basis for the Proportional Hazards Model, discussed next.

The proportional hazards assumption states

is an unspecified function, known as the baseline hazard. It is the hazard function for an individual with features The name proportional hazards arises from the fact that the hazard function for an individual with feature vector xi is some unknown function hot) times the factor ;. The quantity is called the relative risk for the feature vector,relative to that for the feature vector # Partial Likelihood # AUC for Survival Analysis: the C-index # References An Introduction to Statistical Learning An Introduction to Statistical Learning with Python(Online Course)