Variable-Weighted Ultrametric Optimization for Mixed-Type Data: Continuous, Ordinal, Nominal, Binary Symmetric and Binary Asymmetric

Eric Sayre successfully defended his Ph.D. thesis entitled "Variable-Weighted Ultrametric Optimization for Mixed-Type Data: Continuous, Ordinal, Nominal, Binary Symmetric and Binary Asymmetric" on 17 July 2009.

Scientific research begins with hypothesis generation, for which cluster analysis (CA) can be used. Traditionally, CA involves continuous variables weighted equally, and the subjective choice of linkage and stopping rules. Variable weighting for cluster analysis (VWCA), beginning with De Soete (1985/6), produces weights that may be useful for hypothesis generation. De Soete’s VWCA optimized ultrametricity, a property of better separated clusters, without requiring CA.

We developed variable-weighted ultrametric optimization for mixed-type data (VWUO-MD), starting with a variable-weighted, multivariate distance for data with any number of continuous, ordinal, nominal, binary symmetric and binary asymmetric (e.g., rare disease) variables. In Monte Carlo simulations we found that weights are consistent with a priori relationships between variables, under several distributions. On some relationships (e.g., single group linear), the method performs poorly. Compared to De Soete, VWUO-MD better penalizes for 0-weights, and better ensures a unique solution with a strategic random restart procedure. The bootstrap covariance matrix is slightly conservative. For mixtures of at least four continuous/nominal variables, a U-statistic-based covariance matrix performs well. Point estimates and covariances are invariant to column/category/record order and affine transformations.

We analyzed of a subset of the Joint Canada/United States Survey of Health: working, mature students 50+ years old who received health services in the past year (n=167), split into training and testing segments. Prescreening within types and backwards elimination with VWUO-MD reduced the space. The final 14 variable weights were plotted as a scree plot. On the testing segment, a model was fit from the upper scree plot variables. Similar models were fit from the lower scree plot, prescreening and backwards elimination reject variables. Models were ordered on overall statistical significance and the upper model had the best fit, indicating that VWUO-MD had successfully mined these data for hypotheses. We learned that reduction in activities due to a long term health condition was associated with consultations with a mental health professional in the past year (odds ratio=12.25, 95% CI=1.67, 90.02).

While needing additional research, in its present form VWUO-MD produces variable weights that may be informative for hypothesis generation on data with varied mixtures of data types.

This type of interdisciplinary work is a hallmark of our program in Applied Statistics at Simon Fraser University. For more information, please contact Eric Sayre (ecsayre@stat.sfu.ca) or his supervisors Larry Weldon/Richard Lockhart (weldon@stat.sfu.ca or lockhart@stat.sfu.ca), Department of Statistics and Actuarial Science.

2009-08-03