Longitudinal consensus clustering with flexmix
Source:R/longitudinal_consensus_cluster.R
longitudinal_consensus_cluster.Rd
This function performs longitudinal clustering with flexmix. To get robust
results, the data is subsampled and the clustering is performed on this
subsample. The results are combined in a consensus matrix and a final
hierarchical clustering step performed on this matrix. In this, it follows
the approach from the ConsensusClusterPlus
package.
Usage
longitudinal_consensus_cluster(
data = NULL,
id_column = NULL,
max_k = 3,
reps = 10,
p_item = 0.8,
model_list = NULL,
flexmix_formula = as.formula("~s(visit, k = 4) | patient_id"),
title = "untitled_consensus_cluster",
final_linkage = c("average", "ward.D", "ward.D2", "single", "complete", "mcquitty",
"median", "centroid"),
seed = 3794,
verbose = FALSE
)
Arguments
- data
a
data.frame
with one or several observations per subject. It needs to contain one column that specifies to which subject the entry (row) belongs to. This ID column is specified inid_column
. Otherwise, there are no restrictions on the column names, as the model is specified inflexmix_formula
.- id_column
name (character vector) of the ID column in
data
to identify all observations of one subject- max_k
maximum number of clusters, default is
3
- reps
number of repetitions, default is
10
- p_item
fraction of samples contained in subsampled sample, default is
0.8
- model_list
either one
flexmix
driver or a list offlexmix
drivers of classFLXMR
- flexmix_formula
a
formula
object that describes theflexmix
model relative to the formula in the flexmix drivers (the dot in the flexmix drivers is replaced, see the example). That means that you usually only specify the right-hand side of the formula here. However, this is not enforced or checked to give you more flexibility over theflexmix
interface- title
name of the clustering; used if
writeTable = TRUE
- final_linkage
linkage used for the last hierarchical clustering step on the consensus matrix; has to be
average, ward.D, ward.D2, single, complete, mcquitty, median
orcentroid
. The default isaverage
- seed
seed for reproducibility
- verbose
boolean
if status messages should be displayed. Default isFALSE
Value
An object (list) of class lcc
with length maxk
.
The first entry general_information
contains the entries:
consensus_matrices | a list of all consensus matrices (for all specified clusters) |
cluster_assignments | a data.frame with an ID column named after id_column and a column for every specified number of clusters, e.g. assignment_num_clus_2 |
call | the call/all arguments how longitudinal_consensus_cluster was called |
The other entries correspond to the number of specified clusters (e.g. the second entry corresponds to 2 specified clusters) and each contains a list with the following entries:
consensus_matrix | the consensus matrix |
consensus_tree | the result of the hierarchical clustering on the consensus matrix |
consensus_class | the resulting class for every observation |
found_flexmix_clusters | a vector of the actual found number of clusters by flexmix (which can deviate from the specified number) |
Details
The data types longitudinal_consensus_cluster
can handle depends on
how the flexmix
models are set up, in principle all data types are
supported for which there is a flexmix
driver with the desired
outcome variable.
If you follow the dimension reduction approach outlined in
vignette("Example clustering analysis", package = "longmixr")
, the
input data types depend on what FAMD
from the FactoMineR
package can handle. FAMD
accepts numeric
variables and treats
all other variables as factor
variables which it can handle as well.
Examples
set.seed(5)
test_data <- data.frame(patient_id = rep(1:10, each = 4),
visit = rep(1:4, 10),
var_1 = c(rnorm(20, -1), rnorm(20, 3)) +
rep(seq(from = 0, to = 1.5, length.out = 4), 10),
var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) +
rep(seq(from = 1.5, to = 0, length.out = 4), 10))
model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")),
flexmix::FLXMRmgcv(as.formula("var_2 ~ .")))
clustering <- longitudinal_consensus_cluster(
data = test_data,
id_column = "patient_id",
max_k = 2,
reps = 3,
model_list = model_list,
flexmix_formula = as.formula("~s(visit, k = 4) | patient_id"))
#> 2 : *
#> 2 : *
#> 2 : *
# not run
# plot(clustering)
# end not run