--- title: "Fitting Pedigree-Based Variance Component Models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Fitting Pedigree-Based Variance Component Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE) ``` # Introduction A core goal of behavior genetics is to decompose observed phenotypic variance into genetic and environmental components. Extended pedigrees — multi-generational families with known parentage — provide richer information about relatedness than twin studies alone and allow researchers to estimate a wider array of variance components. The `BGmisc` package provides a complete pipeline for pedigree-based variance component modeling: 1. **Compute** relatedness matrices with `ped2add()`, `ped2cn()`, `ped2mit()`, and `ped2ce()` 2. **Check identification** with `identifyComponentModel()` 3. **Build** OpenMx group models with `buildOneFamilyGroup()` 4. **Fit** the combined model with `buildPedigreeMx()` and `mxRun()` (or the wrapper `fitPedigreeModel()`) This vignette builds up in three parts: a single-family model, a two-family multi-group model, and a scaled-up simulation study. An extension vignette `vignette("v61_pedigree_model_fitting", package = "BGmisc")` covers more complex models using squirrel data from the Kluane Red Squirrel Project. ## Prerequisites ```{r setup} library(BGmisc) has_openmx <- requireNamespace("OpenMx", quietly = TRUE) has_mvtnorm <- requireNamespace("mvtnorm", quietly = TRUE) if (has_openmx) { library(OpenMx) } else { message( "OpenMx is not installed. Code will be shown but not executed.\n", "Install with: install.packages('OpenMx')" ) } run_models <- has_openmx ``` # Data Requirements at a Glance Before diving in, here is a concise reference for what your data must look like. Return to this section whenever something is unclear. ## Pedigree data frame One row per individual, at minimum: | Column | Type | Description | |--------|------|-------------| | `ID` | integer/character | Unique individual identifier | | `dadID` | same as `ID` | Father's ID; `NA` if unknown | | `momID` | same as `ID` | Mother's ID; `NA` if unknown | | `sex` | integer | `0` = male, `1` = female | | `famID` | integer/character | Extended family identifier | Column names are flexible — pass `personID = "myID"`, `famID = "famID"`, etc. to the `ped2*` functions. > **Tip:** Use `checkIDs()`, `checkSex()`, and `checkParents()` to validate your pedigree before computing relatedness matrices. ## Relatedness matrices Each matrix returned by `ped2add()`, `ped2cn()`, etc. is square ($n \times n$), symmetric, and **labeled**: `rownames` and `colnames` are the individual IDs. These labels are the link between phenotype data and the model. ## Phenotype data format The model-fitting functions expect phenotype data as a **one-row matrix**, where: - Each **column** is one individual. - **Column names** must exactly match `rownames(add_matrix)` for that family — same IDs, same order. - The vector of those column names is called `obs_ids` throughout this vignette. If your raw data is in long format (one row per person), you need to pivot it wide — this is shown step by step in Part 1 below. ## Missing phenotypes Individuals present in the pedigree only to trace relatedness (e.g., deceased ancestors) must be **removed from the matrices** before fitting. The `ped2*` functions include every individual; you subset them explicitly to those with observed data. --- # Part 1: Single-Family Model We start with one family from the built-in `hazard` dataset. Every step is written out explicitly — so you can see exactly what happens. ```{r load-hazard, eval = run_models} data(hazard) # Two families: famID 1 and famID 2 table(hazard$famID) ``` ## Step 1: Subset to one family and inspect ```{r single-fam-subset, eval = run_models} fam1 <- subset(hazard, famID == 1) # What does the pedigree look like? head(fam1[, c("famID", "ID", "sex", "dadID", "momID", "DA2")]) ``` `DA2` is the phenotype we will model. It is binary in the actual dataset; for a real continuous-trait analysis you would substitute your own measure. ## Step 2: Compute relatedness matrices ```{r single-fam-matrices, eval = run_models} # Additive genetic relatedness add_mat1 <- ped2add(fam1, sparse = FALSE, famID = "famID", personID = "ID", momID = "momID", dadID = "dadID", sex = "sex", keep_ids = fam1$ID ) # keep all individuals for now # Common nuclear environment (1 if same biological parents, 0 otherwise) cn_mat1 <- ped2cn(fam1, sparse = FALSE, famID = "famID", personID = "ID", momID = "momID", dadID = "dadID", sex = "sex", keep_ids = fam1$ID ) mt_mat1 <- ped2mit(fam1, sparse = FALSE, famID = "famID", personID = "ID", momID = "momID", dadID = "dadID", sex = "sex", keep_ids = fam1$ID ) # The matrices are square and labeled with individual IDs dim(add_mat1) rownames(add_mat1) ``` ## Step 3: Prepare the phenotype data The model needs a **one-row matrix** with one column per individual, in the same order as the relatedness matrix rows. ```{r single-fam-pheno, eval = run_models} # Individual IDs in the order the matrices use id_order1 <- rownames(add_mat1) # Pull phenotype values, reordered to match the matrix pheno_vals1 <- fam1$DA2[match(id_order1, as.character(fam1$ID))] names(pheno_vals1) <- id_order1 # Who actually has an observed phenotype? observed1 <- !is.na(pheno_vals1) cat("Individuals in pedigree:", length(id_order1), "\n") cat("Individuals with observed phenotype:", sum(observed1), "\n") # obs_ids: the IDs of observed individuals — these will be column names # They must be syntactically valid R names (OpenMx requirement) obs_ids1 <- make.names(id_order1[observed1]) # One-row matrix: one column per observed individual. # as.double() is required — OpenMx will error if the matrix storage mode # is integer rather than double, even if the values look numeric. pheno_row1 <- matrix( as.double(pheno_vals1[observed1]), nrow = 1, dimnames = list(NULL, obs_ids1) ) pheno_row1[1, 1:6] # first few values ``` ## Step 4: Subset matrices to observed individuals Individuals without a phenotype must be removed from the matrices — their rows and columns are dropped. ```{r single-fam-subset-mat, eval = run_models} # Use the original (non-make.names) IDs to index the matrix raw_obs_ids1 <- id_order1[observed1] add_mat1_obs <- add_mat1[raw_obs_ids1, raw_obs_ids1] cn_mat1_obs <- cn_mat1[raw_obs_ids1, raw_obs_ids1] mt_mat1_obs <- mt_mat1[raw_obs_ids1, raw_obs_ids1] # Rename rows/cols to match obs_ids1 (the make.names versions) rownames(add_mat1_obs) <- colnames(add_mat1_obs) <- obs_ids1 rownames(cn_mat1_obs) <- colnames(cn_mat1_obs) <- obs_ids1 rownames(mt_mat1_obs) <- colnames(mt_mat1_obs) <- obs_ids1 dim(add_mat1_obs) ``` ## Step 5: Check model identification Before fitting, verify that the variance components you want to estimate are statistically identifiable given this pedigree's structure. ```{r single-fam-identify, eval = run_models} id_check1 <- identifyComponentModel( A = add_mat1_obs, Cn = cn_mat1_obs, Mt = mt_mat1_obs, E = diag(1, nrow(add_mat1_obs)) ) id_check1 ``` ## Step 6: Build and fit the model ```{r single-fam-fit, eval = run_models} # Starting values for variance components start_vars <- list( ad2 = 0.3, # additive genetic cn2 = 0.1, # common nuclear environment ce2 = 0, # common extended (not estimated here) mt2 = 0.1, # mitochondrial dd2 = 0, # dominance (not estimated here) am2 = 0, # A x Mt interaction (not estimated here) ee2 = 0.5 # unique environment ) # Build the group model for family 1 group1 <- buildOneFamilyGroup( group_name = "family1", Addmat = add_mat1_obs, Nucmat = cn_mat1_obs, Mtdmat = mt_mat1_obs, full_df_row = pheno_row1, obs_ids = obs_ids1 ) # Wrap in a full pedigree model and fit model1 <- buildPedigreeMx( model_name = "SingleFamilyModel", vars = start_vars, group_models = list(group1) ) fitted1 <- mxRun(model1) summary(fitted1) ``` The estimated variance components are in `fitted1$ModelOne`: ```{r single-fam-results, eval = run_models} cat("Additive genetic (Vad):", fitted1$ModelOne$Vad$values, "\n") cat("Common nuclear (Vcn):", fitted1$ModelOne$Vcn$values, "\n") cat("Mitochondrial (Vmt):", fitted1$ModelOne$Vmt$values, "\n") cat("Unique environ. (Ver):", fitted1$ModelOne$Ver$values, "\n") ``` With a single family, estimates have wide uncertainty. Part 2 adds a second family to improve precision. # Part 2: Two-Family Multi-Group Model Multi-group modeling means that **variance component parameters are shared** across families, but each family has its own relatedness matrix and phenotype data. We write out the steps for family 2 explicitly — the same steps as Part 1, just on different data — so you can see exactly what repeats and what changes. ## Family 2: matrices ```{r two-fam-matrices, eval = run_models} fam2 <- subset(hazard, famID == 2) add_mat2 <- ped2add(fam2, sparse = FALSE, famID = "famID", personID = "ID", momID = "momID", dadID = "dadID", sex = "sex", keep_ids = fam2$ID ) cn_mat2 <- ped2cn(fam2, sparse = FALSE, famID = "famID", personID = "ID", momID = "momID", dadID = "dadID", sex = "sex", keep_ids = fam2$ID ) mt_mat2 <- ped2mit(fam2, sparse = FALSE, famID = "famID", personID = "ID", momID = "momID", dadID = "dadID", sex = "sex", keep_ids = fam2$ID ) dim(add_mat2) ``` ## Family 2: phenotype data ```{r two-fam-pheno, eval = run_models} id_order2 <- rownames(add_mat2) pheno_vals2 <- fam2$DA2[match(id_order2, as.character(fam2$ID))] names(pheno_vals2) <- id_order2 observed2 <- !is.na(pheno_vals2) obs_ids2 <- make.names(id_order2[observed2]) raw_obs_ids2 <- id_order2[observed2] pheno_row2 <- matrix( as.double(pheno_vals2[observed2]), nrow = 1, dimnames = list(NULL, obs_ids2) ) cat("Family 2 observed:", sum(observed2), "of", length(id_order2), "\n") ``` ## Family 2: subset matrices ```{r two-fam-subset-mat, eval = run_models} add_mat2_obs <- add_mat2[raw_obs_ids2, raw_obs_ids2] cn_mat2_obs <- cn_mat2[raw_obs_ids2, raw_obs_ids2] mt_mat2_obs <- mt_mat2[raw_obs_ids2, raw_obs_ids2] rownames(add_mat2_obs) <- colnames(add_mat2_obs) <- obs_ids2 rownames(cn_mat2_obs) <- colnames(cn_mat2_obs) <- obs_ids2 rownames(mt_mat2_obs) <- colnames(mt_mat2_obs) <- obs_ids2 ``` ## Check identification across both families ```{r two-fam-identify, eval = run_models} n_total <- nrow(add_mat1_obs) + nrow(add_mat2_obs) id_check2 <- identifyComponentModel( A = list(add_mat1_obs, add_mat2_obs), Cn = list(cn_mat1_obs, cn_mat2_obs), Mt = list(mt_mat1_obs, mt_mat2_obs), E = diag(1, n_total) ) id_check2 ``` ## Build and fit the two-family model Each family gets its own group model. The key difference from Part 1: `buildPedigreeMx()` now receives both groups, and `mxFitFunctionMultigroup()` combines their likelihoods. The variance component parameters (`Vad`, `Vcn`, `Ver`) are estimated jointly. ```{r two-fam-fit, eval = run_models} # Family 2 group model (identical call structure to family 1) group2 <- buildOneFamilyGroup( group_name = "family2", Addmat = add_mat2_obs, Nucmat = cn_mat2_obs, Mtdmat = mt_mat2_obs, full_df_row = pheno_row2, obs_ids = obs_ids2 ) # Both groups share the same variance component parameters model2 <- buildPedigreeMx( model_name = "TwoFamilyModel", vars = start_vars, group_models = list(group1, group2) ) fitted2 <- mxRun(model2) summary(fitted2) ``` ```{r two-fam-results, eval = run_models} cat("Additive genetic (Vad):", fitted2$ModelOne$Vad$values, "\n") cat("Common nuclear (Vcn):", fitted2$ModelOne$Vcn$values, "\n") cat("Mitochondrial (Vmt):", fitted2$ModelOne$Vmt$values, "\n") cat("Unique environ. (Ver):", fitted2$ModelOne$Ver$values, "\n") ``` The estimates from both families together are generally more stable than those from either family alone. --- # Part 3: Scaling Up — Many Families and Parameter Recovery Once you understand the pattern from Parts 1 and 2, you can scale to many families using a loop. We switch to **simulated data** here so we have known true parameter values to check our estimates against. ## Simulate pedigrees and data ```{r simulate-many, eval = has_mvtnorm && run_models} library(mvtnorm) set.seed(2024) # True variance components true_var <- list( ad2 = 0.40, cn2 = 0.10, ee2 = 0.40, ce2 = 0, mt2 = 0.10, dd2 = 0, am2 = 0 ) n_families <- 15 # Pre-allocate storage add_list <- vector("list", n_families) cn_list <- vector("list", n_families) mt_list <- vector("list", n_families) obs_ids_list <- vector("list", n_families) pheno_list <- vector("list", n_families) for (i in seq_len(n_families)) { ped_i <- simulatePedigree(kpc = 3, Ngen = 4, marR = 0.6) A_i <- as.matrix(ped2add(ped_i, sparse = FALSE)) Cn_i <- as.matrix(ped2cn(ped_i, sparse = FALSE)) Mt_i <- as.matrix(ped2mit(ped_i, sparse = FALSE)) n_i <- nrow(A_i) # Implied covariance from true parameters V_i <- true_var$ad2 * A_i + true_var$cn2 * Cn_i + true_var$mt2 * Mt_i + true_var$ee2 * diag(1, n_i) y_i <- rmvnorm(1, sigma = V_i) ids_i <- make.names(rownames(A_i)) rownames(A_i) <- colnames(A_i) <- ids_i rownames(Cn_i) <- colnames(Cn_i) <- ids_i rownames(Mt_i) <- colnames(Mt_i) <- ids_i add_list[[i]] <- A_i cn_list[[i]] <- Cn_i mt_list[[i]] <- Mt_i obs_ids_list[[i]] <- ids_i pheno_list[[i]] <- matrix(y_i, nrow = 1, dimnames = list(NULL, ids_i)) } cat("Family sizes:", vapply(add_list, nrow, integer(1)), "\n") ``` ## Build and fit the multi-family model ```{r fit-many, eval = has_mvtnorm && run_models} group_models <- lapply(seq_len(n_families), function(i) { buildOneFamilyGroup( group_name = paste0("ped", i), Addmat = add_list[[i]], Nucmat = cn_list[[i]], Mtdmat = mt_list[[i]], full_df_row = pheno_list[[i]], obs_ids = obs_ids_list[[i]] ) }) multi_model <- buildPedigreeMx( model_name = "MultiPedigreeModel", vars = start_vars, group_models = group_models ) fitted_multi <- mxRun(multi_model) ``` ## Compare estimates to true values ```{r compare-many, eval = has_mvtnorm && run_models} data.frame( Component = c( "Additive genetic (Vad)", "Common nuclear (Vcn)", "Mitochondrial (Vmt)", "Unique environ. (Ver)" ), True = c( true_var$ad2, true_var$cn2, true_var$mt2, true_var$ee2 ), Estimated = round(c( fitted_multi$ModelOne$Vad$values, fitted_multi$ModelOne$Vcn$values, fitted_multi$ModelOne$Vmt$values, fitted_multi$ModelOne$Ver$values ), 4) ) ``` With ten families, estimates should be substantially closer to the true values than with a single family. ## Using the high-level wrapper For convenience, `fitPedigreeModel()` wraps the build and fit steps together, and uses `mxTryHard()` for robust optimization: ```{r fit-wrapper, eval = has_mvtnorm && run_models} fitted_easy <- fitPedigreeModel( model_name = "EasyFit", vars = start_vars, group_models = group_models, tryhard = TRUE ) summary(fitted_easy) ``` # Understanding the Covariance Structure The model estimated above is defined by: $$V = \sigma^2_a \mathbf{A} + \sigma^2_{cn} \mathbf{C}_n + \sigma^2_e \mathbf{I}$$ where $\mathbf{A}$ is the additive genetic relatedness matrix, $\mathbf{C}_n$ is the common nuclear environment matrix, $\mathbf{I}$ is the identity matrix (unique environment), and the $\sigma^2$ terms are the variance components estimated by the model. This is an extension of the classical twin model to arbitrary pedigree structures. Additional components can be added by passing `Extmat` (common extended environment), `Mtdmat` (mitochondrial), or `Dmgmat` (dominance) to `buildOneFamilyGroup()`. See `vignette("v1_modelingvariancecomponents", package = "BGmisc")` for background on identification. # Summary | Part | Data | Key point | |------|------|-----------| | 1 — Single family | `hazard`, family 1 | Full pipeline, one group model | | 2 — Two families | `hazard`, families 1 & 2 | Same steps twice; `buildPedigreeMx()` combines them | | 3 — Many families | Simulated | Loop over families; verify parameter recovery | The helper functions (`buildOneFamilyGroup()`, `buildPedigreeMx()`, `fitPedigreeModel()`) handle the translation from pedigree relatedness matrices into OpenMx model specifications. The pattern is always the same: one call to `buildOneFamilyGroup()` per family, then one call to `buildPedigreeMx()` to combine them.