Regression and Generalized Linear Models


Prerequisites:

Linear Algebra, Probability and Statistics (random variables, distributions, inference).


Course description:

This course is designed for advanced mathematics and statistics students who intend to pursue research or professional work in data analysis and statistical modeling. The course bridges the gap between the theoretical foundations of the General Linear Model (Gaussian response) and the broader framework of Generalized Linear Models (GLMs) for non-Gaussian data. The teaching objective is to enable students to master the rigorous mathematical underpinnings of regression analysis—utilizing matrix algebra and projection geometry—while simultaneously developing practical skills in analyzing categorical and count data. Students will use GLMs in a variety of real-world applied situations and become familiar with modern statistical software (e.g., R or SAS). Topics progress from classical linear regression to contingency tables, logistic regression, loglinear models, and advanced methods such as random effects and generalized additive models.


Learning objectives:

By the end of this course, students will be able to:

·  Construct and analyze linear regression models using matrix notation, deriving properties of estimators (unbiasedness, variance) and proving the Gauss-Markov theorem.

·  Evaluate model adequacy through rigorous diagnostic techniques, including residual analysis, influence measures (Cook's distance, leverage), and multicollinearity assessment.

·  Generalize linear model concepts to non-Gaussian data by defining the components of Generalized Linear Models (random component, systematic component, link function) and the exponential family of distributions.

·  Apply specific GLM techniques—including logistic regression for binary/multinomial data and loglinear models for contingency tables—to interpret relationships in categorical data.

·  Synthesize advanced modeling concepts, such as random effects, shrinkage methods, and tree-based classification, to solve complex problems involving clustered or high-dimensional data.

· Implement statistical models using standard software packages, interpreting output to draw valid statistical inferences.


Detailed topics covered:

Part I: Advanced Linear Regression and Modern Extensions

· Classical Theory (Review): Geometry of least squares (projections), Gauss-Markov theorem, hypothesis testing in general linear models ($t$ and $F$-tests, contiguity), distribution of quadratic forms, ANOVA.

· Modern Model Selection and Regularization: Bias-variance trade-off, Ridge Regression (theory and Bayesian interpretation), The Lasso (L1 penalty) and sparsity, Elastic Net, SCAD penalty, cross-validation strategies for tuning parameter selection.

· High-Dimensional Inference: Regression in the $p > n$ setting, variable selection consistency, Oracle properties, false discovery rates (FDR) in regression.

· Non-Parametric and Semi-Parametric Regression: Kernel density estimation, local polynomial regression (Loess), regression splines (B-splines, natural cubic splines), smoothing splines and reproducing kernel Hilbert spaces (RKHS).

· Robust Regression: M-estimators, breakdown points, influence functions, robust regression for heavy-tailed error distributions.

Part II: Categorical Data Analysis and GLMs

· Foundations of GLMs: Exponential dispersion family, link functions, likelihood equations, Newton-Raphson and Fisher scoring algorithms, deviance and goodness-of-fit.

· Inference for Contingency Tables: Fisher’s Exact Test (hypergeometric distribution), exact conditional inference for logistic regression, small-sample inference, and Simpson’s paradox.

· Logistic Regression Extensions: Probit and complementary log-log models, conditional logistic regression for matched pairs, separation of points (infinite estimates) and penalized likelihood solutions.

· Multinomial and Ordinal Response Models: Nominal responses (baseline-category logit models), ordinal responses (cumulative logit models, proportional odds assumption).

· Bayesian Analysis for Categorical Data: Prior specification for GLM parameters, Bayesian inference for proportions, posterior computation (MCMC introduction), Bayesian model averaging.

· Loglinear and Graphical Models: Inference for loglinear models, connection to graphical models (conditional independence graphs), collapsibility.

Part III: Complex Data Structures

· Correlated and Clustered Data: Marginal models vs. Subject-specific models, Generalized Estimating Equations (GEE) for non-normal longitudinal data, sandwich variance estimators.

· Generalized Linear Mixed Models (GLMM): Random effects structure, integral approximation methods (Laplace approximation, Quadrature), prediction of random effects (BLUPs).

· Advanced Classification and Smoothing: Generalized Additive Models (GAMs) for categorical data, Classification Trees (CART), Random Forests for classification, Supervised machine learning connections.

· Specialized Models: Zero-inflated Poisson and Negative Binomial models (for excess zeros), Bradley-Terry models for paired preferences, Rasch models.