Huijing Wang

Statistical inference under latent class models, with application to risk assessment in cancer survivorship studies 

Motivated by a cancer survivorship program, this PhD thesis aims to develop methodology for risk assessment, classification, and prediction. We formulate the primary data collected from a cohort with two underlying categories, theat-risk and not-at-risk classes, using latent class models, and we conduct both cross-sectional analysis and longitudinal analysis. We begin with a maximum pseudo-likelihood estimator (pseudo-MLE) as an alternative to the maximum likelihood estimator (MLE) under a mixture Poisson distribution with event counts. The pseudo-MLE utilizes supplementary information on the not-at-risk class from a different population. It reduces the computational intensity and potentially increases the estimation efficiency. To obtain statistical methods that are more robust than likelihood-based methods to distribution misspecification, we adapt the well-established generalized estimating equations (GEE) approach under the mean-variance model corresponding to the mixture Poisson distribution. The inherent computing and efficiency issues in the application of GEEs motivate two sets of extended GEE estimating functions, using the primary data supplemented by information from the second population alone or together with the available information on individuals in the cohort who are deemed to belong to the at-risk class. We derive asymptotic properties of the proposed pseudo-MLE and the estimators from the extended GEEs, and we estimate their variances by extended Huber sandwich estimators. We use simulation to examine the finite-sample properties of the estimators in terms of both efficiency and robustness. The simulation studies verify the consistency of the proposed parameter estimators and their variance estimators. They also show that the pseudo-MLE has efficiency comparable to that of the MLE, and the extended GEE estimators are robust to distribution misspecification while maintaining satisfactory efficiency. Further, we present an extension of the favourable extended GEE estimator to longitudinal settings by adjusting for within-subject correlation.

The proposed methodology is illustrated with physician claims from the cancer program that motivated this research. We fit different latent class models for the counts and costs of the physician visits and apply the proposed estimators. We use the parameter estimates to identify the risk of subsequent and ongoing problems arising from the subjects’ initial cancer diagnoses. We perform risk classification and prediction using the fitted latent class models.

Keywords: Cross-Sectional Analysis; Even Count; Extended GEE Estimation; Likelihood and Pseudo-Likelihood Estimation; Longitudinal Analysis; Medical Cost; Medical Insurance Information; Physician Claims; Risk Factor; Robust Variance Estimation