Longitudinal consensus clustering with flexmix — longitudinal_consensus

This function performs longitudinal clustering with flexmix. To get robust results, the data is subsampled and the clustering is performed on this subsample. The results are combined in a consensus matrix and a final hierarchical clustering step performed on this matrix. In this, it follows the approach from the ConsensusClusterPlus package.

Usage

longitudinal_consensus_cluster(
  data = NULL,
  id_column = NULL,
  max_k = 3,
  reps = 10,
  p_item = 0.8,
  model_list = NULL,
  flexmix_formula = as.formula("~s(visit, k = 4) | patient_id"),
  title = "untitled_consensus_cluster",
  final_linkage = c("average", "ward.D", "ward.D2", "single", "complete", "mcquitty",
    "median", "centroid"),
  seed = 3794,
  verbose = FALSE
)

Arguments

data: a data.frame with one or several observations per subject. It needs to contain one column that specifies to which subject the entry (row) belongs to. This ID column is specified in id_column. Otherwise, there are no restrictions on the column names, as the model is specified in flexmix_formula.
id_column: name (character vector) of the ID column in data to identify all observations of one subject
max_k: maximum number of clusters, default is 3
reps: number of repetitions, default is 10
p_item: fraction of samples contained in subsampled sample, default is 0.8
model_list: either one flexmix driver or a list of flexmix drivers of class FLXMR
flexmix_formula: a formula object that describes the flexmix model relative to the formula in the flexmix drivers (the dot in the flexmix drivers is replaced, see the example). That means that you usually only specify the right-hand side of the formula here. However, this is not enforced or checked to give you more flexibility over the flexmix interface
title: name of the clustering; used if writeTable = TRUE
final_linkage: linkage used for the last hierarchical clustering step on the consensus matrix; has to be average, ward.D, ward.D2, single, complete, mcquitty, median or centroid. The default is average
seed: seed for reproducibility
verbose: boolean if status messages should be displayed. Default is FALSE

Value

An object (list) of class lcc with length maxk. The first entry general_information contains the entries:

`consensus_matrices`	a list of all consensus matrices (for all specified clusters)

`cluster_assignments`	a `data.frame` with an ID column named after `id_column` and a column for every specified number of clusters, e.g. `assignment_num_clus_2`

`call`	the call/all arguments how `longitudinal_consensus_cluster` was called

The other entries correspond to the number of specified clusters (e.g. the second entry corresponds to 2 specified clusters) and each contains a list with the following entries:

`consensus_matrix`	the consensus matrix

`consensus_tree`	the result of the hierarchical clustering on the consensus matrix

`consensus_class`	the resulting class for every observation

`found_flexmix_clusters`	a vector of the actual found number of clusters by `flexmix` (which can deviate from the specified number)

Details

The data types longitudinal_consensus_cluster can handle depends on how the flexmix models are set up, in principle all data types are supported for which there is a flexmix driver with the desired outcome variable.

If you follow the dimension reduction approach outlined in vignette("Example clustering analysis", package = "longmixr"), the input data types depend on what FAMD from the FactoMineR package can handle. FAMD accepts numeric variables and treats all other variables as factor variables which it can handle as well.

Examples

set.seed(5)
test_data <- data.frame(patient_id = rep(1:10, each = 4),
visit = rep(1:4, 10),
var_1 = c(rnorm(20, -1), rnorm(20, 3)) +
rep(seq(from = 0, to = 1.5, length.out = 4), 10),
var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) +
rep(seq(from = 1.5, to = 0, length.out = 4), 10))
model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")),
flexmix::FLXMRmgcv(as.formula("var_2 ~ .")))
clustering <- longitudinal_consensus_cluster(
data = test_data,
id_column = "patient_id",
max_k = 2,
reps = 3,
model_list = model_list,
flexmix_formula = as.formula("~s(visit, k = 4) | patient_id"))
#> 2 : *
#> 2 : *
#> 2 : *
# not run
# plot(clustering)
# end not run