---
title: "Fitting Pedigree-Based Variance Component Models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fitting Pedigree-Based Variance Component Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Introduction

A core goal of behavior genetics is to decompose observed phenotypic variance into genetic and environmental components. Extended pedigrees — multi-generational families with known parentage — provide richer information about relatedness than twin studies alone and allow researchers to estimate a wider array of variance components. 

The `BGmisc` package provides a complete pipeline for pedigree-based variance component modeling:

1. **Compute** relatedness matrices with `ped2add()`, `ped2cn()`, `ped2mit()`, and `ped2ce()`
2. **Check identification** with `identifyComponentModel()`
3. **Build** OpenMx group models with `buildOneFamilyGroup()`
4. **Fit** the combined model with `buildPedigreeMx()` and `mxRun()` (or the wrapper `fitPedigreeModel()`)

This vignette builds up in three parts: a single-family model, a two-family multi-group model, and a scaled-up simulation study. An extension vignette  `vignette("v61_pedigree_model_fitting", package = "BGmisc")` covers more complex models using squirrel data from the Kluane Red Squirrel Project.

## Prerequisites

```{r setup}
library(BGmisc)

has_openmx <- requireNamespace("OpenMx", quietly = TRUE)
has_mvtnorm <- requireNamespace("mvtnorm", quietly = TRUE)

if (has_openmx) {
  library(OpenMx)
} else {
  message(
    "OpenMx is not installed. Code will be shown but not executed.\n",
    "Install with: install.packages('OpenMx')"
  )
}

run_models <- has_openmx
```


# Data Requirements at a Glance

Before diving in, here is a concise reference for what your data must look like. Return to this section whenever something is unclear.

## Pedigree data frame

One row per individual, at minimum:

| Column | Type | Description |
|--------|------|-------------|
| `ID` | integer/character | Unique individual identifier |
| `dadID` | same as `ID` | Father's ID; `NA` if unknown |
| `momID` | same as `ID` | Mother's ID; `NA` if unknown |
| `sex` | integer | `0` = male, `1` = female |
| `famID` | integer/character | Extended family identifier |

Column names are flexible — pass `personID = "myID"`, `famID = "famID"`, etc. to the `ped2*` functions.

> **Tip:** Use `checkIDs()`, `checkSex()`, and `checkParents()` to validate your pedigree before computing relatedness matrices.

## Relatedness matrices

Each matrix returned by `ped2add()`, `ped2cn()`, etc. is square ($n \times n$), symmetric, and **labeled**: `rownames` and `colnames` are the individual IDs. These labels are the link between phenotype data and the model.

## Phenotype data format

The model-fitting functions expect phenotype data as a **one-row matrix**, where:

- Each **column** is one individual.
- **Column names** must exactly match `rownames(add_matrix)` for that family — same IDs, same order.
- The vector of those column names is called `obs_ids` throughout this vignette.

If your raw data is in long format (one row per person), you need to pivot it wide — this is shown step by step in Part 1 below.

## Missing phenotypes

Individuals present in the pedigree only to trace relatedness (e.g., deceased ancestors) must be **removed from the matrices** before fitting. The `ped2*` functions include every individual; you subset them explicitly to those with observed data.

---

# Part 1: Single-Family Model

We start with one family from the built-in `hazard` dataset. Every step is written out explicitly — so you can see exactly what happens.

```{r load-hazard, eval = run_models}
data(hazard)

# Two families: famID 1 and famID 2
table(hazard$famID)
```

## Step 1: Subset to one family and inspect

```{r single-fam-subset, eval = run_models}
fam1 <- subset(hazard, famID == 1)

# What does the pedigree look like?
head(fam1[, c("famID", "ID", "sex", "dadID", "momID", "DA2")])
```

`DA2` is the phenotype we will model. It is binary in the actual dataset; for a real continuous-trait analysis you would substitute your own measure.

## Step 2: Compute relatedness matrices

```{r single-fam-matrices, eval = run_models}
# Additive genetic relatedness
add_mat1 <- ped2add(fam1,
  sparse = FALSE,
  famID = "famID", personID = "ID",
  momID = "momID", dadID = "dadID", sex = "sex",
  keep_ids = fam1$ID
) # keep all individuals for now


# Common nuclear environment (1 if same biological parents, 0 otherwise)
cn_mat1 <- ped2cn(fam1,
  sparse = FALSE,
  famID = "famID", personID = "ID",
  momID = "momID", dadID = "dadID", sex = "sex",
  keep_ids = fam1$ID
)

mt_mat1 <- ped2mit(fam1,
  sparse = FALSE,
  famID = "famID", personID = "ID",
  momID = "momID", dadID = "dadID", sex = "sex",
  keep_ids = fam1$ID
)

# The matrices are square and labeled with individual IDs
dim(add_mat1)
rownames(add_mat1)
```

## Step 3: Prepare the phenotype data

The model needs a **one-row matrix** with one column per individual, in the same order as the relatedness matrix rows.

```{r single-fam-pheno, eval = run_models}
# Individual IDs in the order the matrices use
id_order1 <- rownames(add_mat1)

# Pull phenotype values, reordered to match the matrix
pheno_vals1 <- fam1$DA2[match(id_order1, as.character(fam1$ID))]
names(pheno_vals1) <- id_order1

# Who actually has an observed phenotype?
observed1 <- !is.na(pheno_vals1)
cat("Individuals in pedigree:", length(id_order1), "\n")
cat("Individuals with observed phenotype:", sum(observed1), "\n")

# obs_ids: the IDs of observed individuals — these will be column names
# They must be syntactically valid R names (OpenMx requirement)
obs_ids1 <- make.names(id_order1[observed1])

# One-row matrix: one column per observed individual.
# as.double() is required — OpenMx will error if the matrix storage mode
# is integer rather than double, even if the values look numeric.
pheno_row1 <- matrix(
  as.double(pheno_vals1[observed1]),
  nrow = 1,
  dimnames = list(NULL, obs_ids1)
)
pheno_row1[1, 1:6] # first few values
```

## Step 4: Subset matrices to observed individuals

Individuals without a phenotype must be removed from the matrices — their rows and columns are dropped.

```{r single-fam-subset-mat, eval = run_models}
# Use the original (non-make.names) IDs to index the matrix
raw_obs_ids1 <- id_order1[observed1]

add_mat1_obs <- add_mat1[raw_obs_ids1, raw_obs_ids1]
cn_mat1_obs <- cn_mat1[raw_obs_ids1, raw_obs_ids1]
mt_mat1_obs <- mt_mat1[raw_obs_ids1, raw_obs_ids1]

# Rename rows/cols to match obs_ids1 (the make.names versions)
rownames(add_mat1_obs) <- colnames(add_mat1_obs) <- obs_ids1
rownames(cn_mat1_obs) <- colnames(cn_mat1_obs) <- obs_ids1
rownames(mt_mat1_obs) <- colnames(mt_mat1_obs) <- obs_ids1

dim(add_mat1_obs)
```

## Step 5: Check model identification

Before fitting, verify that the variance components you want to estimate are statistically identifiable given this pedigree's structure.

```{r single-fam-identify, eval = run_models}
id_check1 <- identifyComponentModel(
  A  = add_mat1_obs,
  Cn = cn_mat1_obs,
  Mt = mt_mat1_obs,
  E  = diag(1, nrow(add_mat1_obs))
)
id_check1
```

## Step 6: Build and fit the model

```{r single-fam-fit, eval = run_models}
# Starting values for variance components
start_vars <- list(
  ad2 = 0.3, # additive genetic
  cn2 = 0.1, # common nuclear environment
  ce2 = 0, # common extended (not estimated here)
  mt2 = 0.1, # mitochondrial
  dd2 = 0, # dominance (not estimated here)
  am2 = 0, # A x Mt interaction (not estimated here)
  ee2 = 0.5 # unique environment
)

# Build the group model for family 1
group1 <- buildOneFamilyGroup(
  group_name  = "family1",
  Addmat      = add_mat1_obs,
  Nucmat      = cn_mat1_obs,
  Mtdmat      = mt_mat1_obs,
  full_df_row = pheno_row1,
  obs_ids     = obs_ids1
)

# Wrap in a full pedigree model and fit
model1 <- buildPedigreeMx(
  model_name   = "SingleFamilyModel",
  vars         = start_vars,
  group_models = list(group1)
)

fitted1 <- mxRun(model1)
summary(fitted1)
```

The estimated variance components are in `fitted1$ModelOne`:

```{r single-fam-results, eval = run_models}
cat("Additive genetic (Vad):", fitted1$ModelOne$Vad$values, "\n")
cat("Common nuclear  (Vcn):", fitted1$ModelOne$Vcn$values, "\n")
cat("Mitochondrial (Vmt):", fitted1$ModelOne$Vmt$values, "\n")
cat("Unique environ. (Ver):", fitted1$ModelOne$Ver$values, "\n")
```

With a single family, estimates have wide uncertainty. Part 2 adds a second family to improve precision.


# Part 2: Two-Family Multi-Group Model

Multi-group modeling means that **variance component parameters are shared** across families, but each family has its own relatedness matrix and phenotype data. We write out the steps for family 2 explicitly — the same steps as Part 1, just on different data — so you can see exactly what repeats and what changes.

## Family 2: matrices

```{r two-fam-matrices, eval = run_models}
fam2 <- subset(hazard, famID == 2)

add_mat2 <- ped2add(fam2,
  sparse = FALSE,
  famID = "famID", personID = "ID",
  momID = "momID", dadID = "dadID", sex = "sex",
  keep_ids = fam2$ID
)

cn_mat2 <- ped2cn(fam2,
  sparse = FALSE,
  famID = "famID", personID = "ID",
  momID = "momID", dadID = "dadID", sex = "sex",
  keep_ids = fam2$ID
)

mt_mat2 <- ped2mit(fam2,
  sparse = FALSE,
  famID = "famID", personID = "ID",
  momID = "momID", dadID = "dadID", sex = "sex",
  keep_ids = fam2$ID
)

dim(add_mat2)
```

## Family 2: phenotype data

```{r two-fam-pheno, eval = run_models}
id_order2 <- rownames(add_mat2)

pheno_vals2 <- fam2$DA2[match(id_order2, as.character(fam2$ID))]
names(pheno_vals2) <- id_order2

observed2 <- !is.na(pheno_vals2)
obs_ids2 <- make.names(id_order2[observed2])
raw_obs_ids2 <- id_order2[observed2]

pheno_row2 <- matrix(
  as.double(pheno_vals2[observed2]),
  nrow = 1,
  dimnames = list(NULL, obs_ids2)
)

cat("Family 2 observed:", sum(observed2), "of", length(id_order2), "\n")
```

## Family 2: subset matrices

```{r two-fam-subset-mat, eval = run_models}
add_mat2_obs <- add_mat2[raw_obs_ids2, raw_obs_ids2]
cn_mat2_obs <- cn_mat2[raw_obs_ids2, raw_obs_ids2]
mt_mat2_obs <- mt_mat2[raw_obs_ids2, raw_obs_ids2]

rownames(add_mat2_obs) <- colnames(add_mat2_obs) <- obs_ids2
rownames(cn_mat2_obs) <- colnames(cn_mat2_obs) <- obs_ids2
rownames(mt_mat2_obs) <- colnames(mt_mat2_obs) <- obs_ids2
```

## Check identification across both families

```{r two-fam-identify, eval = run_models}
n_total <- nrow(add_mat1_obs) + nrow(add_mat2_obs)

id_check2 <- identifyComponentModel(
  A  = list(add_mat1_obs, add_mat2_obs),
  Cn = list(cn_mat1_obs, cn_mat2_obs),
  Mt = list(mt_mat1_obs, mt_mat2_obs),
  E  = diag(1, n_total)
)
id_check2
```

## Build and fit the two-family model

Each family gets its own group model. The key difference from Part 1: `buildPedigreeMx()` now receives both groups, and `mxFitFunctionMultigroup()` combines their likelihoods. The variance component parameters (`Vad`, `Vcn`, `Ver`) are estimated jointly.

```{r two-fam-fit, eval = run_models}
# Family 2 group model (identical call structure to family 1)
group2 <- buildOneFamilyGroup(
  group_name  = "family2",
  Addmat      = add_mat2_obs,
  Nucmat      = cn_mat2_obs,
  Mtdmat      = mt_mat2_obs,
  full_df_row = pheno_row2,
  obs_ids     = obs_ids2
)

# Both groups share the same variance component parameters
model2 <- buildPedigreeMx(
  model_name   = "TwoFamilyModel",
  vars         = start_vars,
  group_models = list(group1, group2)
)

fitted2 <- mxRun(model2)
summary(fitted2)
```

```{r two-fam-results, eval = run_models}
cat("Additive genetic (Vad):", fitted2$ModelOne$Vad$values, "\n")
cat("Common nuclear  (Vcn):", fitted2$ModelOne$Vcn$values, "\n")
cat("Mitochondrial (Vmt):", fitted2$ModelOne$Vmt$values, "\n")
cat("Unique environ. (Ver):", fitted2$ModelOne$Ver$values, "\n")
```

The estimates from both families together are generally more stable than those from either family alone.

---


# Part 3: Scaling Up — Many Families and Parameter Recovery

Once you understand the pattern from Parts 1 and 2, you can scale to many families using a loop. We switch to **simulated data** here so we have known true parameter values to check our estimates against.

## Simulate pedigrees and data

```{r simulate-many, eval = has_mvtnorm && run_models}
library(mvtnorm)
set.seed(2024)

# True variance components
true_var <- list(
  ad2 = 0.40, cn2 = 0.10, ee2 = 0.40,
  ce2 = 0, mt2 = 0.10, dd2 = 0, am2 = 0
)

n_families <- 15

# Pre-allocate storage
add_list <- vector("list", n_families)
cn_list <- vector("list", n_families)
mt_list <- vector("list", n_families)
obs_ids_list <- vector("list", n_families)
pheno_list <- vector("list", n_families)

for (i in seq_len(n_families)) {
  ped_i <- simulatePedigree(kpc = 3, Ngen = 4, marR = 0.6)

  A_i <- as.matrix(ped2add(ped_i, sparse = FALSE))
  Cn_i <- as.matrix(ped2cn(ped_i, sparse = FALSE))
  Mt_i <- as.matrix(ped2mit(ped_i, sparse = FALSE))
  n_i <- nrow(A_i)

  # Implied covariance from true parameters
  V_i <- true_var$ad2 * A_i +
    true_var$cn2 * Cn_i +
    true_var$mt2 * Mt_i +
    true_var$ee2 * diag(1, n_i)

  y_i <- rmvnorm(1, sigma = V_i)

  ids_i <- make.names(rownames(A_i))
  rownames(A_i) <- colnames(A_i) <- ids_i
  rownames(Cn_i) <- colnames(Cn_i) <- ids_i
  rownames(Mt_i) <- colnames(Mt_i) <- ids_i

  add_list[[i]] <- A_i
  cn_list[[i]] <- Cn_i
  mt_list[[i]] <- Mt_i
  obs_ids_list[[i]] <- ids_i
  pheno_list[[i]] <- matrix(y_i, nrow = 1, dimnames = list(NULL, ids_i))
}

cat("Family sizes:", vapply(add_list, nrow, integer(1)), "\n")
```

## Build and fit the multi-family model

```{r fit-many, eval = has_mvtnorm && run_models}
group_models <- lapply(seq_len(n_families), function(i) {
  buildOneFamilyGroup(
    group_name  = paste0("ped", i),
    Addmat      = add_list[[i]],
    Nucmat      = cn_list[[i]],
    Mtdmat      = mt_list[[i]],
    full_df_row = pheno_list[[i]],
    obs_ids     = obs_ids_list[[i]]
  )
})

multi_model <- buildPedigreeMx(
  model_name   = "MultiPedigreeModel",
  vars         = start_vars,
  group_models = group_models
)

fitted_multi <- mxRun(multi_model)
```

## Compare estimates to true values

```{r compare-many, eval = has_mvtnorm && run_models}
data.frame(
  Component = c(
    "Additive genetic (Vad)",
    "Common nuclear  (Vcn)",
    "Mitochondrial (Vmt)",
    "Unique environ. (Ver)"
  ),
  True = c(
    true_var$ad2, true_var$cn2,
    true_var$mt2,
    true_var$ee2
  ),
  Estimated = round(c(
    fitted_multi$ModelOne$Vad$values,
    fitted_multi$ModelOne$Vcn$values,
    fitted_multi$ModelOne$Vmt$values,
    fitted_multi$ModelOne$Ver$values
  ), 4)
)
```

With ten families, estimates should be substantially closer to the true values than with a single family.

## Using the high-level wrapper

For convenience, `fitPedigreeModel()` wraps the build and fit steps together, and uses `mxTryHard()` for robust optimization:

```{r fit-wrapper, eval = has_mvtnorm && run_models}
fitted_easy <- fitPedigreeModel(
  model_name   = "EasyFit",
  vars         = start_vars,
  group_models = group_models,
  tryhard      = TRUE
)
summary(fitted_easy)
```


# Understanding the Covariance Structure

The model estimated above is defined by:

$$V = \sigma^2_a \mathbf{A} + \sigma^2_{cn} \mathbf{C}_n + \sigma^2_e \mathbf{I}$$

where $\mathbf{A}$ is the additive genetic relatedness matrix, $\mathbf{C}_n$ is the common nuclear environment matrix, $\mathbf{I}$ is the identity matrix (unique environment), and the $\sigma^2$ terms are the variance components estimated by the model.

This is an extension of the classical twin model to arbitrary pedigree structures. Additional components can be added by passing `Extmat` (common extended environment), `Mtdmat` (mitochondrial), or `Dmgmat` (dominance) to `buildOneFamilyGroup()`. See `vignette("v1_modelingvariancecomponents", package = "BGmisc")` for background on identification.


# Summary

| Part | Data | Key point |
|------|------|-----------|
| 1 — Single family | `hazard`, family 1 | Full pipeline, one group model |
| 2 — Two families | `hazard`, families 1 & 2 | Same steps twice; `buildPedigreeMx()` combines them |
| 3 — Many families | Simulated | Loop over families; verify parameter recovery |

The helper functions (`buildOneFamilyGroup()`, `buildPedigreeMx()`, `fitPedigreeModel()`) handle the translation from pedigree relatedness matrices into OpenMx model specifications. The pattern is always the same: one call to `buildOneFamilyGroup()` per family, then one call to `buildPedigreeMx()` to combine them.