Logistic Regression

Regression analysis is used to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression, rather than linear regression, should be used. Logistic regression (sometimes called the logistic model or logit model) analyzes the relationship between multiple independent variables and a dependent variable and estimates the probability of occurrence of an event by fitting data to a logistic curve.

Logistic Regression

Logistic regression allows the researcher to test models to predict categorical outcomes with two or more categories, such as male/ female, young/ old, presence/ absence of a condition, or success/ failure.

There can only be a single dependent variable with logistic regression. The dependent variable is usually dichotomous, that is, the dependent variable can take the value 1 with a probability of success 0, or the value 0 with probability of failure 1. This type of variable is called a binary variable. The independent variables can be either categorical or continuous, or a mix of both in one model. Since logistic regression calculates the probability or success over the probability of failure, the impact of predictor variables is usually explained in terms of odds ratios. In this way, logistic regression estimates the odds of a certain event occurring.

Logistic regression uses maximum likelihood estimation to estimate the parameters most likely to have generated the observed data. It is based on the assumption that the underlying relationships among variables are an S-shaped probability curve. It transforms the probability of an event occurring into its odds. Odds reflect the ratio of two probabilities: the probability of an event occurring, to the probability that it will not occur.

Logistic regression is part of a category of statistical models called generalized linear models. It is a complex model used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. It makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex, and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer's propensity to purchase a product.

There are binary logistic and multinomial logistic types. For binary logistic regression, this response variable can have only two categories. For multinomial logistic regression, there may be two or more categories, usually more. It is important to be careful to specify the desired reference category, which should be meaningful. Binomial logistic regression by default predicts the higher of the two categories of the dependent (usually 1), using the lower (usually 0) as the reference category. Multinomial logistic regression by default predicts all categories of the dependent except the highest, which is used as a reference category.

The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most concise model possible. To accomplish this goal, a model is created that includes all predictor variables that are useful in predicting the response variable. Several different options are available during model creation. Variables can be entered into the model in the order specified by the researcher or logistic regression can test the fit of the model after each coefficient is added or deleted, called stepwise regression.

Research Examples

As an example of logistic regression, consider a study whose goal is to model the response to a drug as a function of the dose of the drug administered. The target (dependent) variable has a value 1 if the patient is successfully treated by the drug and 0 if the treatment is not successful.

Logistic regression is often used in epidemiological studies where the result of the analysis is the probability of developing cancer after controlling for other associated risks. Logistic regression also provides knowledge of the relationships and strengths among the variables (e.G., smoking 10 packs a day places an individual at a higher risk of developing cancer than working in an asbestos mine).

Predicting the 10-year risk of death from heart disease based on the predictor variables of age, sex, and cholesterol levels. In this model, increasing age is associated with an increased risk of death from heart disease, female sex is associated with a decreased risk of death from heart disease, and increasing cholesterol is associated with an increased risk of death.

Assumptions of Logistic Regression

  • True conditional probabilities are a logistic function of the independent variables.
  • No important variables are omitted.
  • No outliers.
  • Independent variables measured without error.
  • Observations are independent (no singularity).
  • Independent variables are not linear combinations of each other.
  • Multicollinearity - no redundancy among the independent variables.
  • Large enough sample

Statistics

Omnibus tests of model coefficients: "goodness of fit" test - significance value should be less than .05

The Hosmer-Lemeshow statistic evaluates the goodness-of-fit. Poor fit is indicated by a significance value less than .05. To support the model, a significance value greater than .05 is needed. Thus, the test statistic is a chi-square statistic with a desirable outcome of non-significance, indicating that the model prediction does not significantly differ from the observed.

A wald test is used to test the statistical significance of each coefficient (b) in the model. A wald test calculates a z statistic. This z value is then squared, yielding a Wald statistic with a chi-square distribution.

The likelihood-ratio test uses the ratio of the maximized value of the likelihood function for the full model (l1) over the maximized value of the likelihood function for the simpler model (l0). This log transformation of the likelihood functions yields a chi-squared statistic.

Effect size - cox & snell r square and nagelkerke r square: pseudo r square statistics.

Research Information to Report

Assumptions. Odds ratio, confidence interval. Independent variables, dependent variable. Chi-square value, probability, significance. Sample number. Cox and snell r square, nagelkerke r squared values. Statistical significance.

Presenting the results from logistic regression (pallant, 2007, p. 178)

Direct logistic regression was performed to assess the impact of a number of factors on the likelihood that respondents would report that they had a problem with their sleep. The model contained five independent variables (sex, age, problems getting to sleep, problems staying asleep and hours of sleep per weeknight). The full model containing all predictors was statistically significant, x2 (5, n = 241) = 76.02, p

References

Pallant, J. (2007). SPSS Survival Manual. new york: mcgraw-hill education.

Polit, D. F., & Beck, c. T. (2008). Nursing research: generating and assessing evidence for nursing practice (8th ed.). Philadelphia: wolters kluwer health.

49 Articles   5,349 Posts

Share this post


Share on other sites