Sum of sister clade differences

Brief description: This is a script that aids the evaluation of distribution of taxonomic groups in a phylogenetic tree. Here, the metric of choice is the sum of sister clade differences, considering taxonomic groups (here phyla) as binary traits (presence/absense from a tree tip). For each taxonomic group, a vector of presence/absence is passed together with the phylogenetic tree into a custom function and a for() loop is used to compute the metric for all considered taxonomic groups.

Extract the function and work with own data. Consult the tutorial for any data processing ideas.

Section 1 : Preparation

Step 1: Let’s see all the necessary packages needed for the analysis and load them. If they are not installed, we will install them first.

# Here we store the necessary and installed packages 
necessary_packages <- c("readxl","dplyr","phytools", "ape","caper") 
installed_packages <- rownames(installed.packages())

# Here we check if the necessary packages are installed. If not, R installs them 
for (pkg in necessary_packages) {
  if (! pkg %in% installed_packages) {
    tryCatch({
      install.packages(pkg)
    }, error = function(e) {
      BiocManager::install(pkg)
    })
  }
}

# Here, for every package included in the necessary_packages variable, R will load it. The character.only = TRUE ensures that library() will try to interpret the package names as characters (as it needs for it to work) 
for (pkg in necessary_packages) {
  library(pkg, character.only = TRUE)
  print(paste("Loaded", pkg))
}

## [1] "Loaded readxl"

## [1] "Loaded dplyr"

## [1] "Loaded phytools"
## [1] "Loaded ape"

## [1] "Loaded caper"

Section 2 : Data Import

Step 1: Let’s create a phylogenetic tree. It will be a small one created from scratch.

# This is the topology in Newick format
example_tree_topology <- "((((A,B), (C,(D,E))), (F, ((G,(H,I)),(J,K)))), ((L,(M,N)), ((((O,P),Q), (R,S)), (T,U))));"

# This will read the topology as a tree
example_tree <- read.tree(text = example_tree_topology)

# Let's see the tree
plot(example_tree)

# We can also label the internal nodes and tips
plot(example_tree)
nodelabels(text = (length(example_tree$tip.label) + 1):(length(example_tree$tip.label) + example_tree$Nnode), cex = 0.75)
tiplabels(text = 1:length(example_tree$tip.label), cex = 1.25, adj = -0)

# We will assign a vector with binary values (0/1) representing the location of a specific taxonomic group
example_tree_trait <- tibble(Tip = LETTERS[1:21], Trait = c(0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0))
  
tip_states <- example_tree_trait$Trait

example_tree_trait

## # A tibble: 21 × 2
##    Tip   Trait
##    <chr> <dbl>
##  1 A         0
##  2 B         0
##  3 C         0
##  4 D         1
##  5 E         1
##  6 F         0
##  7 G         0
##  8 H         0
##  9 I         0
## 10 J         0
## # ℹ 11 more rows

Section 3 : Creating the function

Step 1: Let’s create the function.

sister.clade.diff <- function(tree, tip.states) {
  
  if (!inherits(tree, "phylo")) {
    stop("Argument 'tree' must be of type 'phylo'")   # This stops the user if they haven't provided a phylogenetic tree in the correct format
  } else {
    if (!inherits(tip.states, "numeric")) {
      stop("Argument 'tip.states' must be of type 'numeric'")  # This stops the user if they haven't provided a numeric vector for presence/absense
    } else {
      if (! isTRUE(all(tip.states %in% c(0,1)))) {
        stop("Argument 'tip.states' must contain numeric values of 0 or 1")  # This stops the user if they haven't provided a numeric vector for presence/absense with values 0 or 1
      } else {
        if (length(tip.states) != length(tree$tip.label)) {
          stop("Number of values in the tip.states argument must agree with the number of tips in the phylo argument")  # This stops the user if the number of tips in the tree and the number of presence/absense values do not agree
        }
      }
    }
  }

  # We initiate some vectors to store the node values for (i) difference of sister clades, (ii) sum of trait values in sister clades, (iii) number of descendant tips and (iv) mean of trait values in sister clades
  # Each vector has the same number of spots as the number of internal nodes in the tree
  node_diff_values_list <- rep(NA, tree$Nnode)
  node_sum_values_list <- rep(NA, tree$Nnode)
  node_child_number_list <- rep(NA, tree$Nnode)
  node_mean_values_list <- rep(NA, tree$Nnode)
  
  # As long as any of the spots is not filled with a valid value, the loop will run
  while (any(is.na(node_diff_values_list))) {
    
    # For all internal nodes in reverse order 
    for (node in (length(tree$tip.label) + tree$Nnode) : (length(tree$tip.label) + 1)) {
      
      # It is time to designate the 2 child nodes. This we will retrieve from the phylo object edge info
      childs <- tree$edge[tree$edge[,1] == node, ][,2]
      
      # If the child nodes are numbered between 1 and maximum tip labels, it means they are both tips. In this case, sister clade difference of the node is easily computed by the difference in child trait values
      if (childs[1] %in% 1:length(tree$tip.label) & childs[2] %in% 1:length(tree$tip.label)) {
        
        print(paste('Processing node', node))
        
        # Filling the correct spot in the vectors with the respective metric. For example, metrics for the 12th internal node (node number 12 + N of tree tips) will be always in the 12th spot in the vectors
        node_diff_values_list[node - length(tree$tip.label)] <- abs(tip.states[childs[1]] - tip.states[childs[2]])
        node_sum_values_list[node - length(tree$tip.label)] <- sum(tip.states[childs[1]],tip.states[childs[2]])
        node_child_number_list[node - length(tree$tip.label)] <- 2
        node_mean_values_list[node - length(tree$tip.label)] <- node_sum_values_list[node - length(tree$tip.label)]/node_child_number_list[node - length(tree$tip.label)]
        
      } else {
        
        # If not both child nodes are tips, the loop will just skip to the next node
        next
      }
    }
    
    # This is the second step, after all nodes connecting only to tips are processed. Now we will iterate for all other intenral nodes starting from the edge of the tree towards the root (hence the reverse order in the for() loop range for node)
    for (node in (length(tree$tip.label) + tree$Nnode) : (length(tree$tip.label) + 1)) {
      childs <- tree$edge[tree$edge[,1] == node, ][,2]
      
      # If not both child nodes are numberd between the number range for tips, then computation shall be different, as mean values of clades are to be considered.
      # The second part of the if() condition secures that both child tips have computed values in the respective vectors, so the computation for the node in question can proceed
      if (
        ! ( childs[1] %in% 1:length(tree$tip.label) & childs[2] %in% 1:length(tree$tip.label))  & 
        ( ! any(is.na(node_sum_values_list[c(childs[childs[] > length(tree$tip.label)])-length(tree$tip.label)])))) {
        
        print(paste('Processing node', node))
        
        # We will get all descendants of the node in question (the two direct child nodes/tips plus all subsequent descendants). We will remove all descendants numbered outside of the tip range, so only tip descendants will be retained and internal nodes will be omitted
        descendants <- getDescendants(tree, node)
        descendants <- descendants[!descendants > length(tree$tip.label)]
        
        # Number of child will here be set as number of tip descendants. Sum and mean values will be calculated as before
        node_child_number_list[node - length(tree$tip.label)] <- length(descendants)
        node_sum_values_list[node - length(tree$tip.label)] <- sum(tip.states[descendants])
        node_mean_values_list[node - length(tree$tip.label)] <- node_sum_values_list[node - length(tree$tip.label)] / node_child_number_list[node - length(tree$tip.label)]
        
        # We will set a temporary vector v
        v <- c()
        # For the two immediate child nodes, if one is a tip, we will set its value inside v as NA
        for (child in childs) {
          if (child <= length(tree$tip.label)) {
            v[grep(child, childs)] <- NA
          } else {
            v[grep(child, childs)] <- child
          }
        }
        
        # We will create a 4-value vector, with trait value and mean value for both child nodes. If one is a node, the index for tip.states will be out of bounds and return NA. If one is a tip, its value in v will be NA and the index for the node_mean_values_list will not work, thus returning NA
        child_values <- c(tip.states[childs[1]], 
                          node_mean_values_list[v[1]-length(tree$tip.label)],
                          tip.states[childs[2]],
                          node_mean_values_list[v[2]-length(tree$tip.label)])
        
        # We will only keep the valid values
        child_values <- child_values[!is.na(child_values)]
        
        # We will calculate the difference. Both can be means, or one can be mean and one can be trait value of tip
        node_diff_values_list[node - length(tree$tip.label)] <- abs(child_values[1] - child_values[2])
      }
    }
  }
  
  # We will create a reporting table for number of tip descendants, sum and mean of difference and the difference of sister clades for each internal node
  internal_nodes_info <- tibble(Node = seq((length(tree$tip.label)+1),((length(tree$tip.label) + tree$Nnode)),1), Descendants = node_child_number_list, Sum = node_sum_values_list, Mean = node_mean_values_list, Diff = node_diff_values_list)
  
  # We will calculate the sum of sister clade differences and normalise based on the abundance of presence cases (1) in the dataset
  Sum_Of_Sister_Clade_Differences <- sum(internal_nodes_info$Diff)
  Normalised_Sum_Of_Sister_Clade_Differences <- Sum_Of_Sister_Clade_Differences * (1-(length(tip.states[tip.states == 1])/length(tip.states)))
  
  # We will print the values rounded
  print(paste("Sum of differences is", round(Sum_Of_Sister_Clade_Differences,3), "normalised to", round(Normalised_Sum_Of_Sister_Clade_Differences,3), "based on relative abundance of positive cases"))  
  
  # We will create a list to export to the global environment. It will help us retrieve and store values when running the sister.clade.diff() function in a for() loop for multiple taxonomic groups in the same tree
  list_to_export <- list(node_metrics = internal_nodes_info, sumdiff = Sum_Of_Sister_Clade_Differences, norm_sumdiff = Normalised_Sum_Of_Sister_Clade_Differences)
  
  assign("sister.clade.diff.output", list_to_export, envir = .GlobalEnv)
  
  # This will return the first 20 rows of the reporting table and the absolute and normalised sum of sister clades differences in the R console/terminal
    return(list(internal_nodes_info = head(internal_nodes_info, 20), 
         Sum_Of_Sister_Clade_Differences = Sum_Of_Sister_Clade_Differences, 
         Normalised_Sum_Of_Sister_Clade_Differences = Normalised_Sum_Of_Sister_Clade_Differences))
  }

Equation for computing and normalising the sum of sister clade differences. Here n denotes normalized, T is the number of tips in the tree, N is the number of internal nodes, Ti is the number of tips for phylum i. Mean trait a and mean trait b refer to the mean trait values of all descendant tips of the two child nodes a and b, or the trait value if a or b is tip

Step 2: Let’s test the function. We will first deliberately give wrong arguments to see that the function reports errors correctly

# This is expected to give error due to wrong tip.states vector
sister.clade.diff(tree = example_tree, tip.states = example_tree)

## Error in sister.clade.diff(tree = example_tree, tip.states = example_tree): Argument 'tip.states' must be of type 'numeric'

# This is expected to give error due to no phylo
phylo_wrong <- c(1,2,3,4,4)
sister.clade.diff(tree = phylo_wrong, tip.states = phyla)

## Error in sister.clade.diff(tree = phylo_wrong, tip.states = phyla): Argument 'tree' must be of type 'phylo'

# This is expected to give error due to wrong numerical vector for tip.states, as it does not only include 0,1
tip_states_wrong <- c(1,0,0,1,2,0,0)
sister.clade.diff(tree = example_tree, tip.states = tip_states_wrong)

## Error in sister.clade.diff(tree = example_tree, tip.states = tip_states_wrong): Argument 'tip.states' must contain numeric values of 0 or 1

# This is expected to give error due to the vector for tip.states having different number of values than the number of tips in the tree
tip_states_small <- c(0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
sister.clade.diff(tree = example_tree, tip.states = tip_states_small)

## Error in sister.clade.diff(tree = example_tree, tip.states = tip_states_small): Number of values in the tip.states argument must agree with the number of tips in the phylo argument

# Remove all test objects
rm(phylo_wrong, tip_states_wrong, tip_states_small)

Step 3: Let’s test the function. We will now run it with the correect arguments and the example tree and trait values

# This is expected to work
sister.clade.diff(tree = example_tree, tip.states = tip_states)

## [1] "Processing node 41"
## [1] "Processing node 40"
## [1] "Processing node 39"
## [1] "Processing node 35"
## [1] "Processing node 32"
## [1] "Processing node 31"
## [1] "Processing node 27"
## [1] "Processing node 25"
## [1] "Processing node 38"
## [1] "Processing node 37"
## [1] "Processing node 36"
## [1] "Processing node 34"
## [1] "Processing node 33"
## [1] "Processing node 30"
## [1] "Processing node 29"
## [1] "Processing node 28"
## [1] "Processing node 26"
## [1] "Processing node 24"
## [1] "Processing node 23"
## [1] "Processing node 22"
## [1] "Sum of differences is 4.213 normalised to 3.21 based on relative abundance of positive cases"

## $internal_nodes_info
## # A tibble: 20 × 5
##     Node Descendants   Sum  Mean  Diff
##    <dbl>       <dbl> <dbl> <dbl> <dbl>
##  1    22          21     5 0.238 0.118
##  2    23          11     2 0.182 0.4  
##  3    24           5     2 0.4   0.667
##  4    25           2     0 0     0    
##  5    26           3     2 0.667 1    
##  6    27           2     2 1     0    
##  7    28           6     0 0     0    
##  8    29           5     0 0     0    
##  9    30           3     0 0     0    
## 10    31           2     0 0     0    
## 11    32           2     0 0     0    
## 12    33          10     3 0.3   0.429
## 13    34           3     0 0     0    
## 14    35           2     0 0     0    
## 15    36           7     3 0.429 0.6  
## 16    37           5     3 0.6   1    
## 17    38           3     3 1     0    
## 18    39           2     2 1     0    
## 19    40           2     0 0     0    
## 20    41           2     0 0     0    
## 
## $Sum_Of_Sister_Clade_Differences
## [1] 4.21342
## 
## $Normalised_Sum_Of_Sister_Clade_Differences
## [1] 3.210225

Step 4: We can use the function inside a for() loop for multiple taxonomic groups in the same tree

# Let's add two more taxonomic groups in different tips of the tree
B <- c(1,1,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1)
C <- c(0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0)
example_tree_trait_three_traits <- cbind(example_tree_trait, B, C)
colnames(example_tree_trait_three_traits) <- c("Tip", "A", "B", "C")

# We now have a more rich trait dataframe
head(example_tree_trait_three_traits, 15)

##    Tip A B C
## 1    A 0 1 0
## 2    B 0 1 0
## 3    C 0 0 0
## 4    D 1 0 0
## 5    E 1 0 0
## 6    F 0 1 0
## 7    G 0 0 1
## 8    H 0 1 0
## 9    I 0 1 0
## 10   J 0 0 1
## 11   K 0 0 1
## 12   L 0 0 0
## 13   M 0 0 0
## 14   N 0 0 0
## 15   O 1 0 0

# We will prepare an empty list and two empty vectors to store info from each iteration of the for loop. They will store (i) the report tables, (ii) sum of sister clade differences and (iii) normalised sum of sister clade differences respectively
node_metrics_df_list <- list()
sumdiff_list <- c()
norm_sumdiff_list <- c()

# For each phylum in the dataset
for (phylum in colnames(example_tree_trait_three_traits)[-1]) {
  # Extract its tip_states vector from the dataframe
  tip_states <- example_tree_trait_three_traits[,phylum]
  # Run the custom function
  sister.clade.diff(example_tree, tip.states = tip_states)
  # Save the stored objects in the list and vectors
  node_metrics_df_list[[phylum]] <- sister.clade.diff.output$node_metrics
  sumdiff_list[phylum] <- sister.clade.diff.output$sumdiff
  norm_sumdiff_list[phylum] <- sister.clade.diff.output$norm_sumdiff
}

## [1] "Processing node 41"
## [1] "Processing node 40"
## [1] "Processing node 39"
## [1] "Processing node 35"
## [1] "Processing node 32"
## [1] "Processing node 31"
## [1] "Processing node 27"
## [1] "Processing node 25"
## [1] "Processing node 38"
## [1] "Processing node 37"
## [1] "Processing node 36"
## [1] "Processing node 34"
## [1] "Processing node 33"
## [1] "Processing node 30"
## [1] "Processing node 29"
## [1] "Processing node 28"
## [1] "Processing node 26"
## [1] "Processing node 24"
## [1] "Processing node 23"
## [1] "Processing node 22"
## [1] "Sum of differences is 4.213 normalised to 3.21 based on relative abundance of positive cases"
## [1] "Processing node 41"
## [1] "Processing node 40"
## [1] "Processing node 39"
## [1] "Processing node 35"
## [1] "Processing node 32"
## [1] "Processing node 31"
## [1] "Processing node 27"
## [1] "Processing node 25"
## [1] "Processing node 38"
## [1] "Processing node 37"
## [1] "Processing node 36"
## [1] "Processing node 34"
## [1] "Processing node 33"
## [1] "Processing node 30"
## [1] "Processing node 29"
## [1] "Processing node 28"
## [1] "Processing node 26"
## [1] "Processing node 24"
## [1] "Processing node 23"
## [1] "Processing node 22"
## [1] "Sum of differences is 5.364 normalised to 3.831 based on relative abundance of positive cases"
## [1] "Processing node 41"
## [1] "Processing node 40"
## [1] "Processing node 39"
## [1] "Processing node 35"
## [1] "Processing node 32"
## [1] "Processing node 31"
## [1] "Processing node 27"
## [1] "Processing node 25"
## [1] "Processing node 38"
## [1] "Processing node 37"
## [1] "Processing node 36"
## [1] "Processing node 34"
## [1] "Processing node 33"
## [1] "Processing node 30"
## [1] "Processing node 29"
## [1] "Processing node 28"
## [1] "Processing node 26"
## [1] "Processing node 24"
## [1] "Processing node 23"
## [1] "Processing node 22"
## [1] "Sum of differences is 4.582 normalised to 3.709 based on relative abundance of positive cases"

# Get a table phyla plus the sum of sister clade differences and th enormalised sum of sister clade differences values
all_phyla_diff_metrices <- tibble(Phylum = colnames(example_tree_trait_three_traits)[-1], Sum.of.differences = sumdiff_list, Normalised.Sum.of.Differences = norm_sumdiff_list)

all_phyla_diff_metrices

## # A tibble: 3 × 3
##   Phylum Sum.of.differences Normalised.Sum.of.Differences
##   <chr>               <dbl>                         <dbl>
## 1 A                    4.21                          3.21
## 2 B                    5.36                          3.83
## 3 C                    4.58                          3.71

Thanks for using this script! Please always remember to credit the creators of the scripts you use. And buy a programmer a beer 🍺 ! They really need one!

Sum of sister clade differences

Alex Kanellopoulos

2025-06-26