7  Statistics

Statistics often feels like a solitary pursuit, hopefully this page helps ecologists share and learn data science skills. © Steve Markham

Statistical tests are undertaken to improve ecological understanding, separating the single from the noise. It enhances the ecological description when writing reports, making the report more evidenced based and therefore more robust. When the data is not clear, or there is too much data to see, good statistics brings comprehension. Then there is a tendency to perceive connections or meaningful patterns between unrelated or random things, termed apophenia; here statistical testing can demonstrate it is random or unrelated.

There are many reference books for ecologists on statistics such as the gentle introductions Statistics for Terrified Biologists (Emden 2008), Practical Statistics for Field Biology (Fowler, Cohen, and Jarvis 1998) and Choosing and Using Statistics: A Biologist’s Guide (Dytham 2011). More comprehensive guides to ecological statistics are available from Numerical Ecology (Legendre and Legendre 2012) and Analysiing Ecological Data (Zuur, Ieno, and Smith 2007). The Numerical Ecology with R (Borcard, Gillet, and Legendre 2011) has a focus on ecological statistics using the R environment, while recent theory and application can be found in Ecological Statistics (Fox, Negrete-Yankelevich, and And 2015).

7.1 Data Exploration for Statistics.

Before undertaking data analysis it is prudent to perform some exploration of the data. Data exploration will help identify potential problems in the data and assist in deciding what type of analysis to conduct.

A six step protocol for data exploration based on (Zuur, Ieno, and Elphick 2010) is listed below:

  1. Outliers
  2. Normality
  3. An excess of zeros
  4. Multicollinearity
  5. Relationships among response and independent variables
  6. Independence of the response variable.

7.2 Summary Statistics

The Common Pipistrelle passes have been aggregated for every 30 minutes; Table 7.1 shows summary statistics for these bat passes during September and October. The statistics for September and October look similar and this is confirmed by the kernel density (Glossary B.2.4) plot shown in Figure 7.1.

Show the code
Table <- agg_data  %>%
  group_by(MonthFull, DateTime30) %>% 
  tally() %>% 
  rename(Month = MonthFull) %>% 
  summarise(Min = min(n, na.rm = T),
            Q1 = quantile(n, c(0.25), na.rm = T),
            Mean = mean(n, na.rm = T),
            Median = median(n, na.rm = T),
            Q3 = quantile(n, c(0.75), na.rm = T),
            Max = max(n, na.rm = T),
            sd = sd(n, na.rm = T),
            Nr = n())
Table %>% 
flextable() %>%
  fontsize(size = 9.5, part = "all") %>% 
  width(j = 1, width = 1) %>%
  width(j = 2:9, width = 0.6) %>%
  bg(bg = "black", part = "header") %>%
  color(color = "white", part = "header") %>% 
  colformat_int(j = 9, big.mark = " ") %>% 
  colformat_double(j = c(4, 8), digits = 2)
Table 7.1: Summary Statistics for Common Pipistrelle Passes

Month

Min

Q1

Mean

Median

Q3

Max

sd

Nr

September

1

1

1.92

1

2

8

1.55

83

October

1

1

3.08

2

4

16

3.28

91

Where:

  • Min:- Minimum value.
  • Q1:- Lower quartile, or first quartile (Q1), the value under which 25% of data points are found when they are arranged in increasing order (Glossary B.3.1).
  • Mean:- Arithmetic mean of a dataset is the sum of all values divided by the total number of values.
  • Median:- The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).
  • Q3:- Upper quartile, or third quartile (Q3), is the value under which 75% of data points are found when arranged in increasing order (Glossary B.3.1).
  • Max:- Maximum value.
  • sd:- Standard deviation, a summary measure of the differences of each observation from the mean.
  • Nr:- Number of values.
Show the code
agg_data2 <- agg_data %>% 
  group_by(MonthFull, DateTime30) %>%
  tally() %>% 
  rename(Month = MonthFull) %>% 
  ungroup() %>% 
  pivot_wider(names_from = "Month", values_from = "n")


Sep <- agg_data2 %>% 
  select(September) %>% 
  filter(!is.na(September)) %>% 
  mutate(Passes = as.numeric(September),
         Month = "September")

Oct <- agg_data2 %>% 
  select(October) %>% 
  filter(!is.na(October)) %>% 
  mutate(Passes = as.numeric(October),
         Month = "October")

Count30 <- bind_rows(Sep, Oct)

ggplot(Count30, aes(Passes, fill = Month )) +
  geom_density(alpha = 0.5) +
  scale_fill_nejm() +
  labs(x = "Passes (Number / 30 minutes)") +
  theme_bw() +
  theme(legend.position = "bottom")
Figure 7.1: Comon Pipistrelle Passes September and October

7.3 Comparing Two Groups

Using the Wilcoxon signed rank test (Glossary B.5.1) we can decide, using a hypothesis test (Glossary B.5.3), whether the population distributions, summarised in Table 7.1 and Figure 7.1 are identical. An advantage of the Wilcoxon signed rank test is that it can be undertaken without assuming the data follows the normal distribution; the density plot in Figure 7.1 shows the data is unlikely to be normal. The two data samples for September and October need to be independent; the number of passes recorded separately in September and October are distinct populations where the passes do not affect each other.

It is assumed The survey effort (length of time the bat detectors where deployed and the locations are the same for both months)

Without assuming the data to have normal distribution, decide at .01 significance level (p < 0.01)1 if the observations of Common pipistrelle bats (every 30 minutes) for September and October have identical data distribution.

  • The null hypothesis H0: is that the common pipistrelle observation data for September and October are identical populations.
  • The alternate hypothesis H1: is that the common pipistrelle observation data of September and October are not identical populations.

To test the hypothesis, we apply the wilcox.test function to compare the independent samples.

Show the code
library(broom)

mw <- wilcox.test(Passes ~ Month, data=Count30) 

#Tidy the out put into a table
tidy(mw) %>% 
  gt() %>% 
  tab_style(
    style = list(
      cell_fill(color = "black"),
      cell_text(color = "white", weight = "bold")
      ),
    locations = cells_column_labels(
      columns = c(everything())
    )
  ) 
Table 7.2: Summary Statistics for Common Pipistrell Passes
statistic p.value method alternative
4533 0.0129358 Wilcoxon rank sum test with continuity correction two.sided

Results from the the wilcox.test are shown in Table 7.2; as the p-value (Glossary B.5.4) turns out to be 0.0129358, and is not less than the .01 significance level2, we do not reject the null hypothesis. At .01 significance level (p < 0.01), we conclude that the common pipistrelle observation data for September and October are identical populations.

7.4 Comparing More Than Two Groups

7.4.1 An Obvious Difference

The statics data in the iBats package has some interesting Barbastelle (Barbastella barbastellus) bat activity, it would be interesting and aid our understanding if we compare the activity between static locations and see if they are significantly different.

Before undertaking any statistical test always visualise the data the statistical test is applied to; here the box plot is used (Glossary B.2.2). Figure 7.2 shows the Barbastelle recorded at several locations.

The average bat pass rate per hour can been calculated for each night and each static location using the formula shown in Equation 7.1:

\[AverageActivity = \frac{Batpasses}{Nightlength} \tag{7.1}\]

Where: Batpasses = number of bat pass during the night at the location
Nightlength = length of the night in decimal hours
AverageActivity = average (mean) number of bat passes per hour for each night there was activity

Show the code
# Add data and time information to the statics data using the iBats::date_time_info
statics_plus <- iBats::date_time_info(statics)

# Add sun and night time metrics to the statics data using the iBats::sun_night_metrics() function.
statics_plus <- iBats::sun_night_metrics(statics_plus)


graph_data <- statics_plus %>%
    filter(Species == "Barbastella barbastellus") %>% 
    group_by(Description, Night, night_length_hr) %>%
    # count number of passes per night by species - makes coloumn "n""
    tally() %>% 
    # calculate average bat passes per hour for each Night and species
    mutate(ave_act_per_hr = n / night_length_hr) %>%
    # Remove Night Length column from the Table
    select(-night_length_hr, -n) 

ggplot(graph_data, aes(y = ave_act_per_hr, x = Description)) +
                  geom_jitter(fill = "#1f78b4", #Barbastelle colour
                              colour = "black", 
                              shape = 23, 
                              alpha = 0.7, 
                              size = 3) +
                  geom_boxplot(colour = "grey30", fill = "transparent") + 
                  labs(y = "Barbastelle Activity  \n(Nightly Average Passes per Hour)") +
                  theme_bw() + 
                  theme(legend.position = "none", # No legend
                  axis.text.x = element_text(size=12, face="bold"), 
                  axis.text.y = element_text(size=12,face="bold"), 
                  axis.title.y = element_text(face="bold"), 
                  axis.title.x = element_blank(), # no y title (just bat names)
                  panel.grid.major = element_blank(), #remove grid lines
                  panel.grid.minor = element_blank())
Figure 7.2: Barbastelle Activity at Each Static (Nightly Average Passes per Hour)

The comparison of locations is undertaken with the Kruskal-Wallis test (Glossary B.5.2). The Kruskal–Wallis test is a rank-based test that is similar to the Wilcoxon signed rank test but can be applied to one-way data with more than two groups. If there are just two locations the Wilcoxon signed rank test should be applied. The Kruskal–Wallis test may be used when there are only two samples, but the Wilcoxon signed rank test is more powerful for two samples and is preferred. Both tests assume that the observations are independent. The probability threshold for statistical significance, which should always be chosen before the test is undertaken, is: P < 0.05.

  • The Null Hypothesis: bat pass rates per hour are from distributions with the same median.

  • The Alternative Hypothesis: bat pass rates per hour are from distributions with a different median.

The function kruskal.test, from base R, is used to undertake the Kruskal-Wallis test. A rule of thumb for the Kruskal–Wallis test is each group, (in this case case the number of nightly average bats pace vales for each static location) must have a sample size of 5 or more.

Show the code
# Add data and time information to the statics data using the iBats::date_time_info
statics_plus <- iBats::date_time_info(statics)

# Add sun and night time metrics to the statics data using the iBats::sun_night_metrics() function.
statics_plus <- iBats::sun_night_metrics(statics_plus)


test_data <- statics_plus %>%
  filter(Species == "Barbastella barbastellus") %>%
  group_by(Description, Night, night_length_hr) %>%
  # count number of passes per night by species - makes coloumn "n""
  tally() %>%
  # calculate average bat passes per hour for each Night and species
  mutate(ave_act_per_hr = n / night_length_hr) %>%
  # Remove Night Length column from the Table
  select(-night_length_hr, -n)


# Check at least 2 locations and a minimum of 5 observations per location
# Only do KW on locations with 5 or more observations
# if just two locations do Mann Whitney
check_data <- test_data %>%
  group_by(Description) %>%
  tally()

# filter for Statics with more than 5 values
StaticsWithPlus5 <- check_data %>%
  filter(n >= 5) %>%
  pull(Description)

test_data <- test_data %>%
  filter(Description %in% StaticsWithPlus5)

# Extract the p-value from the kruskal.test
stat_pvalue <- kruskal.test(ave_act_per_hr ~ Description, data = test_data)$p.value

With reference to Figure 7.2 there are several static locations where the activity can be compared, this is more than two locations, therefore the Kruskal-Wallis test is undertaken rather than the Mann-Whitney-Wilcoxon test. Location 1 with only three results is excluded from the test.

The Kruskal-Wallis test undertaken for the Barbastelle at the following static locations: Static 2, Static 3, Static 4, and Static 5 produced a P value (9.17e-08) less than the chosen threshold for statistical significance of 0.05; therefore the null hypothesis is rejected, activity between some static locations is likely to be different.

What the Kruskal-Wallis test does not indicate, is which static locations are different; to determine this we need to undertake post hoc testing, this can be undertaken with the Dunn’s Test (Glossary B.5.2).

Show the code
library(dunn.test)

dunn_result <- dunn.test(test_data$ave_act_per_hr, factor(test_data$Description), method = "bonferroni")

Results of the Dunn’s test, performed after the Kruskal-Wallis test (post hoc), are shown in Table 7.3.

Show the code
df <- tibble(dunn_result$comparisons, dunn_result$P.adjusted)

colnames(df) <- c("Comparison", "P.adj")

resultsTable <- df %>%
  filter(P.adj < 0.05) %>%
  select(`Locations with a significant difference (P<0.05)` = Comparison, `adjusted P ` = P.adj)

resultsTable %>%
  flextable(col_keys = colnames(.)) %>%
  fontsize(part = "header", size = 12) %>%
  fontsize(part = "body", size = 12) %>%
  bold(part = "header") %>%
  autofit(add_w = 0.1, add_h = 0.1) %>%
  bg(bg = "black", part = "header") %>%
  color(color = "white", part = "header") %>%
  align(align = "center", part = "header") %>%
  align(j = 2, align = "right", part = "body") %>%
  bold(j = 1, bold = TRUE, part = "body")     
Table 7.3: Results of Post-hoc testing with the Dunn’s Test

Locations with a significant difference (P<0.05)

adjusted P

Static 2 - Static 4

0.0000002372373

Static 3 - Static 4

0.0004837731033

Static 4 - Static 5

0.0013231129871

The Dunn’s Test carries out multiple comparisons therefore a P value adjustment needs to be made to avoid a false significant result. For this P value adjustment the Bonferroni method is applied; a simple technique for controlling the overall probability of a false significant result when multiple comparisons are to be carried out.

Table 7.3 gives the results of the post hoc testing with the Dunn’s test and shows that Barbastelle activity at Location 4 is significantly different (at P<0.05) to activity at Locations 2, 3, and 5.

The statistical tests show that Barbastelle bat activity recorded at Location 4 is significantly different; this knowledge is evidence based and can be stated with confidence when reporting.

7.4.2 Less Obvious Difference

Figure 7.3 shows Common pipistrelle activity for the static locations comparison of activity can be undertaken with the Kruskal-Wallis test with the following hypothesis:

  • The Null Hypothesis: bat pass rates per hour are from distributions with the same median.

  • The Alternative Hypothesis: bat pass rates per hour are from distributions with a different median.

Show the code
# Add data and time information to the statics data using the iBats::date_time_info
statics_plus <- iBats::date_time_info(statics)

# Add sun and night time metrics to the statics data using the iBats::sun_night_metrics() function.
statics_plus <- iBats::sun_night_metrics(statics_plus)


graph_data <- statics_plus %>%
    filter(Species == "Pipistrellus pipistrellus") %>% 
    group_by(Description, Night, night_length_hr) %>%
    # count number of passes per night by species - makes coloumn "n""
    tally() %>% 
    # calculate average bat passes per hour for each Night and species
    mutate(ave_act_per_hr = n / night_length_hr) %>%
    # Remove Night Length column from the Table
    select(-night_length_hr, -n) 

ggplot(graph_data, aes(y = ave_act_per_hr, x = Description)) +
                  geom_jitter(fill = "#ffff99", #Common Pipistrelle colour
                              colour = "black", 
                              shape = 23, 
                              alpha = 0.7, 
                              size = 3) +
                  geom_boxplot(colour = "grey30", fill = "transparent") + 
                  labs(y = "Common Pipistrelle Activity  \n(Nightly Average Passes per Hour)") +
                  theme_bw() + 
                  theme(legend.position = "none", # No legend
                  axis.text.x = element_text(size=12, face="bold"), 
                  axis.text.y = element_text(size=12,face="bold"), 
                  axis.title.y = element_text(face="bold"), 
                  axis.title.x = element_blank(), # no y title (just bat names)
                  panel.grid.major = element_blank(), #remove grid lines
                  panel.grid.minor = element_blank())
Figure 7.3: Common Pipistrelle Activity at Each Static (Nightly Average Passes per Hour)
Show the code
# Add data and time information to the statics data using the iBats::date_time_info
statics_plus <- iBats::date_time_info(statics)

# Add sun and night time metrics to the statics data using the iBats::sun_night_metrics() function.
statics_plus <- iBats::sun_night_metrics(statics_plus)


test_data <- statics_plus %>%
  filter(Species == "Pipistrellus pipistrellus") %>%
  group_by(Description, Night, night_length_hr) %>%
  # count number of passes per night by species - makes coloumn "n""
  tally() %>%
  # calculate average bat passes per hour for each Night and species
  mutate(ave_act_per_hr = n / night_length_hr) %>%
  # Remove Night Length column from the Table
  select(-night_length_hr, -n)


# Check at least 2 locations and a minimum of 5 observations per location
# Only do KW on locations with 5 or more observations
# if just two locations do Mann Whitney
check_data <- test_data %>%
  group_by(Description) %>%
  tally()

# filter for Statics with more than 5 values
StaticsWithPlus5 <- check_data %>%
  filter(n >= 5) %>%
  pull(Description)

test_data <- test_data %>%
  filter(Description %in% StaticsWithPlus5)

# Extract the p-value from the kruskal.test
stat_pvalue <- kruskal.test(ave_act_per_hr ~ Description, data = test_data)$p.value

The Kruskal-Wallis test undertaken for the Common pipistrelle at the following static locations: Static 1, Static 2, Static 3, Static 4, and Static 5 produced a P value (3.30e-16) less than the chosen threshold for statistical significance of 0.05; therefore the null hypothesis is rejected, activity between some static locations is likely to be different. The Dunn’s test can be applied to determine the static locations that are different.

Show the code
dunn_result <- dunn.test(test_data$ave_act_per_hr, factor(test_data$Description), method = "bonferroni")

Results of the Dunn’s test, performed after the Kruskal-Wallis test (post hoc), are shown in Table 7.4.

Show the code
df <- tibble(dunn_result$comparisons, dunn_result$P.adjusted)

colnames(df) <- c("Comparison", "P.adj")

resultsTable <- df %>%
  filter(P.adj < 0.05) %>%
  select(`Locations with a significant difference (P<0.05)` = Comparison, `adjusted P ` = P.adj)

resultsTable %>%
  flextable(col_keys = colnames(.)) %>%
  fontsize(part = "header", size = 12) %>%
  fontsize(part = "body", size = 12) %>%
  bold(part = "header") %>%
  autofit(add_w = 0.1, add_h = 0.1) %>%
  bg(bg = "black", part = "header") %>%
  color(color = "white", part = "header") %>%
  align(align = "center", part = "header") %>%
  align(j = 2, align = "right", part = "body") %>%
  bold(j = 1, bold = TRUE, part = "body")     
Table 7.4: Results of Post-hoc testing with the Dunn’s Test

Locations with a significant difference (P<0.05)

adjusted P

Static 1 - Static 2

0.0000939731690135662070

Static 1 - Static 3

0.0460478959637407314620

Static 1 - Static 4

0.0000000000000007857217

Static 2 - Static 4

0.0000001079868368327319

Static 3 - Static 4

0.0000000296352155723235

Static 4 - Static 5

0.0000000014826746844802

Table 7.4 gives the results of the post hoc testing with the Dunn’s test; post P value adjustment with the Bonferroni method showing:

  • Common pipistrelle activity at static 4 is significantly different (at P<0.05) to all the other static locations.
  • Common pipistrelle activity at static 1 is significantly different to static locations 2, 3 and 4. The significant difference (at P<0.05) between static 1 and static 3 is not easily determined from Figure 7.3.
  • Common pipistrelle activity at static 2, static 3, and static 5 are not significantly different (at P<0.05).

7.5 Bootstrap Confidence Intervals

Table 7.5 gives summary statistics for Barbastelle (Barbastelle barbastellus) pass observations per night at Static 5.

Show the code
# Libraries (Packages) used
library(tidyverse)
library(mosaic)
library(gt)
library(gtExtras)
library(iBats)
library(infer)

# Add data and time information to the statics data using the iBats::date_time_info
statics_plus <- iBats::date_time_info(statics)

# Group by Description and Night and Count the Observations
barb_static5 <- statics_plus %>% 
  filter(Species == "Barbastella barbastellus", Description == "Static 5") %>% 
  group_by(Description, Night) %>% 
  tally() 

# The summary statistics are saved into a variable riven_cond_stats 
cond_stats <- favstats(n~Description, data = barb_static5)

# riven_cond_stats is made into a the table (using the code below)
cond_stats %>% 
  # Create the table with the gt package
  gt() %>% 
  # Style the header to black fill and white text
  tab_style(
    style = list(
      cell_fill(color = "black"),
      cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels(
      columns = c(everything())
    )
  ) %>% 
  #gt_color_rows(median, palette = "ggsci::yellow_material") %>% 
  tab_options(data_row.padding = px(2)) 
Table 7.5: Barbastelle Observations (Passes) at Static 5
Description min Q1 median Q3 max mean sd n missing
Static 5 1 1 2 4 11 3.173913 2.639769 23 0

For Static 5 there is a mean of 3.17 passes per night for the 23 nights the bat detector was deployed. It is valuable to know the confidence of this mean number of passes observed. The population of passes, the number of Barbastelle’s passes for every night in history, is not available; in practice there is only the sample of data and not the entire population. The bootstrap (Glossary B.3.2) is a statistical method that allows us to approximate the sampling distribution (and therefore confidence intervals) even without access to the population.

In bootstrapping the sample itself becomes the population. Samples from the population are drawn many times over and every time from all the population. This process is called re-sampling with replacement and creates a bootstrap distribution. The infer package has functions for re-sampling with replacement and estimating confidence limits.

Show the code
# bootstrap the number of Barb pass 1000 time and calculate the mean 
bootstrap_distribution <- barb_static5 %>% 
  specify(response = n) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

#calculation the 95% confidence intervals
percentile_ci <- bootstrap_distribution %>% 
  get_confidence_interval(level = 0.95, type = "percentile")

The 95% confidence interval is computed by piping bootstrap_distribution into the get_confidence_interval() function from the infer package, with the confidence level set to 0.95 and the confidence interval type to be “percentile”. The results are saved in percentile_ci.

The 95% confidence interval is the middle 95% of values of the bootstrap distribution, the 2.5th and 97.5th percentiles, which are percentile_ci$lower_ci = 2.22 and percentile_ci$lower_ci = 4.31, respectively. This is known as the percentile method for constructing confidence intervals.

The confidence interval (2.22, 4.31) can be visualized, see Figure 7.4, by piping the bootstrap_distribution data frame into the visualize() function and adding a shade_confidence_interval() layer. The endpoints argument are set to the percentile_ci.

Show the code
visualize(bootstrap_distribution) + 
  shade_confidence_interval(endpoints = percentile_ci)
Figure 7.4: Boostrap Distribution of the Mean Barbastelle Observations (Passes) at Static 5

The bootstrapping of confidence intervals can be used for all the factors in a description (e.g. static locations), the month of activity or the species observed. The code below creates Table 7.6; confidence intervals for nightly species observations (passes) where there is more than 10 bat observations (passes) for the species in total. Figure 7.5 illustrates the mean nightly species observations with error bars indicating the confidence intervals.

Show the code
#List species with more than 10 passes observed
SpeciesList <- statics_plus %>%
  group_by(Species) %>% 
  count() %>% 
  filter(n > 10)

SpeciesList <- levels(factor(SpeciesList$Species))

Table_ci <- NULL #blank data.frame for table

for(BatName in SpeciesList) {

  
  night_obs <- statics_plus %>% 
    #filter(Species == "Pipistrellus pipistrellus", Description == static) %>% 
    #group_by(Description, Night) %>% 
    filter(Species == BatName) %>% 
    group_by(Species, Night) %>% 
    tally() 
  
  # Calculate the mean
  species_mean <- mean(night_obs$n)
  
  # bootstrap the number of Barb pass 1000 time and calculate the mean 
  bootstrap_distribution <- night_obs %>% 
    specify(response = n) %>% 
    generate(reps = 1000, type = "bootstrap") %>% 
    calculate(stat = "mean")
  
  #calculation the 95% confidence intervals
  species_ci <- bootstrap_distribution %>% 
    get_confidence_interval(level = 0.95, type = "percentile")
  
  species_ci$mean <- species_mean
  
  species_ci$species <- BatName
  
  Table_ci <- rbind(Table_ci, species_ci)
  
}

Table_ci %>% 
  # round numbers to 1 decimal place
  mutate(mean = round(mean, digits = 1),
         lower_ci = round(lower_ci, digits = 1),
         upper_ci = round(upper_ci, digits = 1)) %>% 
  #Arrange descending for readability
  arrange(desc(mean)) %>% 
  # configure column headings
  select(Species = species, Mean = mean, `Lower 95% ci` = lower_ci, `Upper 95% ci` = upper_ci) %>% 
  gt() %>% 
  # Style the header to black fill and white text
  tab_style(
    style = list(
      cell_fill(color = "black"),
      cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels(
      columns = c(everything())
    )
  ) %>% 
    # Make bat scientific name italic
  tab_style(
    style = list(
      cell_text(style = "italic")
      ),
    locations = cells_body(
      columns = c(Species)
    )
  ) %>% 
  tab_footnote(
    footnote = "Species with more than 10 bat observations (passes) in total",
    locations = cells_column_labels(
      columns = Species
    )) %>% 
  tab_options(data_row.padding = px(2)) 
Table 7.6: Nightly Species Observations (Passes) Mean and 95% Confidence Intervals
Species1 Mean Lower 95% ci Upper 95% ci
Pipistrellus pipistrellus 77.7 54.1 104.9
Myotis spp. 9.5 7.5 11.9
Barbastella barbastellus 8.7 6.3 11.8
Nyctalus noctula 6.1 4.4 8.1
Pipistrellus spp. 5.7 4.2 7.4
Plecotus spp. 4.5 3.2 6.0
Pipistrellus nathusii 3.6 2.6 4.6
Pipistrellus pygmaeus 3.6 2.5 4.9
Rhinolophus ferrumequinum 3.6 2.3 5.6
Rhinolophus hipposideros 2.1 1.7 2.6
Eptesicus serotinus 1.3 1.0 1.7
1 Species with more than 10 bat observations (passes) in total
Show the code
graph_bat_colours <- iBats::bat_colours_default(Table_ci$species)

Table_ci %>% 
  ggplot(aes(x=reorder(species, mean), y=mean, fill=species)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=lower_ci, ymax=upper_ci),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9)) +
  scale_fill_manual(values = graph_bat_colours) +
  coord_flip() +
  labs(
    y = "Mean Nightly Species Observations") +
  theme_bw() +
  theme(
    legend.position = "none", # No legend
    axis.text.x = element_text(size = 10, angle = 0, face = "bold"),
    axis.text.y = element_text(size = 10, face = "bold.italic"), # bat names italic
    axis.title.y = element_blank(), # no y title (just bat names)
    axis.title.x = element_text(size = 10), # no x title
    panel.grid.major = element_blank(), # remove grid lines
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.grid.major.x = element_line(colour = "grey20", linewidth = 0.1, linetype = "dashed")
  )
Figure 7.5: Nightly Species Observations (Passes) and 95% Confidence Intervals (error bars)

7.6 Bat Assemblage

The assessment of individual species ignores the fact there is also the species assemblage; the taxonomically related group of species (i.e. bats) occupying the same geographical area at the same time. The assemblage of bat species can be explored using multivariate methods.

For most surveys the number of species observed is greater than one, making it a biological assemblage. We can relate this assemblage to other factors such as location, habitat, time, weather, etc. Multivariate analysis gives a way of exploring the bat assemblage differences with respect to factors such as the location/habitat or month/year or combination of both.

7.6.1 Cluster Analysis

Table 7.7 shows a summary matrix of bat passes per species for each static (i.e. location) We would like to know which locations are similar/dissimilar.

Show the code
Tab_L_S <- statics %>% # Location - Species
  group_by(Species, Description) %>% 
  count() 

maxValue <- max(Tab_L_S$n, na.rm = T)

Tab_L_S <- Tab_L_S %>% 
  pivot_wider(names_from = Species, values_from = n) %>% 
  replace(is.na(.), 0)

ncols <- ncol(Tab_L_S)

#Make coloured palette
colourer <- scales::col_numeric(
  palette = c("transparent", "forestgreen"),
  domain = c(0, maxValue))

Tab_L_S %>% 
  flextable() %>% 
    bold(part = "header") %>% 
    bg(bg = "black", part = "header") %>% 
    color(color = "white", part = "header") %>% 
    rotate(j = 2:ncols, rotation = "tbrl", align = "center", part = "header") %>% 
    height_all(height = 2.3, part = "header") %>% 
    hrule(rule = "exact", part = "header") %>%
    bg(bg = colourer, j = 2:ncols, part = "body") %>%
    width(j = 2:ncols, width = 0.4) %>% 
    width(j = 1, width = 0.8)
Table 7.7: Matrix of Location by Species

Description

Barbastella barbastellus

Eptesicus serotinus

Myotis nattereri

Myotis spp.

Nyctalus leisleri

Nyctalus noctula

Nyctalus spp.

Pipistrellus nathusii

Pipistrellus pipistrellus

Pipistrellus pygmaeus

Pipistrellus spp.

Plecotus spp.

Rhinolophus ferrumequinum

Rhinolophus hipposideros

Static 1

4

0

0

67

0

35

0

2

147

3

4

0

0

0

Static 2

28

4

1

168

0

201

0

10

758

22

36

95

60

28

Static 3

10

0

1

41

3

35

0

18

263

15

19

20

81

36

Static 4

304

8

0

85

0

28

0

2

3,569

39

136

20

14

8

Static 5

73

0

0

74

0

16

2

0

235

7

88

1

5

1

In multivariate analysis the pre-treament of data (sometimes in more than one way) is usually desirable (Glossary B.7.1). For assemblage data, transformations will reduce the dominant contribution of abundant species (i.e. all those pips). Transformations include (None, Square root, Fourth root, Log(X+1), Presence/absence). Table 7.8 shows Table 7.7 transformed with the species \(\sqrt[2]{count}\).

Show the code
SpeciesC <- Tab_L_S %>% 
  ungroup() %>% 
  select(2:ncol(.)) %>% 
  mutate_all(funs(sqrt(.)))

Table <- SpeciesC %>% 
  mutate_all(funs(round(., 2)))

Matrix <- Tab_L_S %>%
  ungroup() %>% 
  select(Description) %>% 
  bind_cols(SpeciesC)


MatrixTable <- Tab_L_S %>%
  ungroup() %>% 
  select(Description) %>% 
  bind_cols(Table)

Matrix <- data.matrix(Matrix[, 2:ncol(Matrix)])

r_names <- Tab_L_S %>%
  ungroup() %>% 
  pull(Description)

rownames(Matrix) <- r_names

# Number of columns in table
ncols <- ncol(MatrixTable)

#Make coloured palette
colourer <- scales::col_numeric(
  palette = c("transparent", "forestgreen"),
  domain = c(0, maxValue))

MatrixTable %>% 
  flextable() %>% 
    bold(part = "header") %>% 
    bg(bg = "black", part = "header") %>% 
    color(color = "white", part = "header") %>% 
    rotate(j = 2:ncols, rotation = "tbrl", align = "center", part = "header") %>% 
    height_all(height = 2.3, part = "header") %>% 
    hrule(rule = "exact", part = "header") %>%
    bg(bg = colourer, j = 2:ncols, part = "body") %>%
    width(j = 2:ncols, width = 0.4) %>% 
    width(j = 1, width = 0.8)
Table 7.8: Transformed Matrix

Description

Barbastella barbastellus

Eptesicus serotinus

Myotis nattereri

Myotis spp.

Nyctalus leisleri

Nyctalus noctula

Nyctalus spp.

Pipistrellus nathusii

Pipistrellus pipistrellus

Pipistrellus pygmaeus

Pipistrellus spp.

Plecotus spp.

Rhinolophus ferrumequinum

Rhinolophus hipposideros

Static 1

2.00

0.00

0

8.19

0.00

5.92

0.00

1.41

12.12

1.73

2.00

0.00

0.00

0.00

Static 2

5.29

2.00

1

12.96

0.00

14.18

0.00

3.16

27.53

4.69

6.00

9.75

7.75

5.29

Static 3

3.16

0.00

1

6.40

1.73

5.92

0.00

4.24

16.22

3.87

4.36

4.47

9.00

6.00

Static 4

17.44

2.83

0

9.22

0.00

5.29

0.00

1.41

59.74

6.24

11.66

4.47

3.74

2.83

Static 5

8.54

0.00

0

8.60

0.00

4.00

1.41

0.00

15.33

2.65

9.38

1.00

2.24

1.00

To assess the differences of activity (in the bat assemblage) between locations (or months, habitat, etc.) we measure the distance between every point of activity with every other point. So activity (e.g. a median or count) at one location for a species is compared(measured) with every other activity for all the species and locations.

There are many distance measurements to investigate species similarity/dissimilarity3 for more information see the Glossary B.7.5

In this example the Bray-Curtis distance measurement (Glossary B.7.10) has been applied to Table 7.8 to produce the Bray-Curtis dissimilarity matrix in Table 7.9.

(For species similarities it may be worth considering the removal of the rarer species (i.e. the less observed) and repeating the analysis.)

Show the code
library(vegan)

dist_bray <- vegdist(Matrix, method="bray")

dist_df <- tibble(dist_bray)

fmt2 <- function(dist_num) {
  
  temp <- as.character(dist_num)
  
  stringr::str_sub(temp, 1L, 7L)
  
}

c1 <- c("Static 2", "Static 3", "Static 4", "Static 5")
c2 <- c(fmt2(dist_df[1,1]), fmt2(dist_df[2,1]), fmt2(dist_df[3,1]), fmt2(dist_df[4,1]))
c3 <- c(" ", fmt2(dist_df[5,1]), fmt2(dist_df[6,1]), fmt2(dist_df[7,1]))
c4 <- c(" ", " ", fmt2(dist_df[8,1]), fmt2(dist_df[9,1]))
c5 <- c(" ", " ", " ", fmt2(dist_df[10,1]))

dist_tbl <- tibble(c1, c2, c3, c4, c5)

colnames(dist_tbl) <- c(" ", "Static 1", "Static 2", "Static 3", "Static 4")

ft <- dist_tbl
ncols <- ncol(ft)

ft %>% 
  flextable() %>% 
    bold(part = "header") %>% 
    bold(j=1, part = "body") %>% 
    bg(bg = "black", part = "header") %>% 
    bg(j=1, bg = "black", part = "body") %>% 
    color(color = "white", part = "header") %>% 
    color(j=1, color = "white", part = "body")
Table 7.9: Bray-Curtis Dissimilarity Matrix

Static 1

Static 2

Static 3

Static 4

Static 2

0.49805

Static 3

0.36661

0.25769

Static 4

0.58613

0.35422

0.45871

Static 5

0.31352

0.40026

0.33401

0.41084

The Bray-Curtis dissimilarity matrix can be used in many multivariate methods; here the matix is used in Hierarchical Clustering (Glossary B.7.13). Hierarchical clustering is useful because it can create a tree-based representation of the observations called a dendrogram (Glossary B.7.15); that is easy to interpret visually. The dendrogram from the Bray-Curtis dissimilarity matrix in Table 7.9 is shown in Figure 7.6.

Show the code
# Hierarchical cluster function hclust
hc_bray <- hclust(dist_bray, method = "complete")

# Always visualse
library(ggdendro)
#Creates Dendrogram Plot Using ggplot.
ggdendrogram(hc_bray, rotate = TRUE, size = 2) +
  labs(title = "Static Location - Complete Cluster - Bray-Curtis")

dhc <- as.dendrogram(hc_bray)
Figure 7.6: Dendrogram Between Static Locations

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. Clustering involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Referring to the dendrogram in Figure 7.6 the lower in the tree fusions occur, the more similar the groups of observations are to each other. While observations that fuse later, near the top of the tree, can be considered less similar.

For Figure 7.6 we can see that for the bat assemblage:

  • Statics 3 & 2 are similar
  • Statics 1 & 5 are similar
  • Static 4 is more similar to Statics 2 & 3 than Statics 1 & 5
  • Statics 2 & 3 and Statics 1 & 5 are more similar to Static 4 - than Statics 2 & 3 and Statics 1 & 5 are to each other

Note, the dendrogram is illustrative and we cannot draw conclusions about the degree of similarity of two observations based on their proximity along the vertical axis.

This is just an introduction to Hierarchical Clustering, there are many variations and other types of clustering (e.g. k-means). Key variations for hierarchical clustering are:

  • distance measurement used e.g. Bray-Curtis, Euclidean, etc.
  • type of linkage e.g. complete, average, single or centroid.

7.6.2 K-Means Clustering

K-means clustering (Glossary B.7.17) is a simple approach for partitioning a data set into K distinct non-overlapping clusters. To perform K means clustering we must specify the desired number of clusters first. The code below applies k-means clustering to the Bray-Curtis dissimilarity matrix in Table 7.9 with K=2.

Show the code
result_km <- kmeans(dist_bray, centers = 2)
Show the code
clusters <- tibble(names(result_km$cluster), result_km$cluster)

colnames(clusters) <- c("Location", "Cluster")

clusters %>% 
  gt() %>% 
  tab_style(
    style = list(
      cell_fill(color = "black"),
      cell_text(color = "white", weight = "bold")
      ),
    locations = cells_column_labels(
      columns = c(everything())
    )
  ) %>% 
  cols_width(everything() ~ px(150))
Table 7.10: k-means Cluster and Location
Location Cluster
Static 1 2
Static 2 1
Static 3 2
Static 4 1
Static 5 2
Show the code
tidy(result_km) %>% 
  gt() %>% 
  tab_style(
    style = list(
      cell_fill(color = "black"),
      cell_text(color = "white", weight = "bold")
      ),
    locations = cells_column_labels(
      columns = c(everything())
    )
  )
Table 7.11: Summary on a Per-Cluster Level
Static 1 Static 2 Static 3 Static 4 Static 5 size withinss cluster
0.5420950 0.1771132 0.3582046 0.1771132 0.4055556 2 0.1496150 1
0.2267147 0.3853388 0.2335433 0.4852314 0.2158467 3 0.2765864 2
Show the code
glance(result_km) %>% 
  gt() %>% 
  tab_style(
    style = list(
      cell_fill(color = "black"),
      cell_text(color = "white", weight = "bold")
      ),
    locations = cells_column_labels(
      columns = c(everything())
    )
  )
Table 7.12: k-means Overall Summary Statistics
totss tot.withinss betweenss iter
0.7733486 0.4262014 0.3471472 1

The output of kmeans is a list of information; the most important:

  • cluster: Table 7.10 & Table 7.11 A vector of integers (from 1:K) indicating the cluster to which each point is allocated.
  • centers: A matrix of cluster centers.
  • totss: Table 7.12 The total sum of squares.
  • withinss: Table 7.11 Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss: Table 7.12 Total within-cluster sum of squares, i.e. sum(withinss).
  • betweenss: Table 7.12 The between-cluster sum of squares, i.e. \(totss-tot.withinss\).
  • size: The number of points in each cluster.
  • iter: Table 7.12 The number of (outer) iterations.

7.6.3 Non-metric Multidimensional Scaling (NMDS)

Non-metric Multidimensional Scaling (Glossary B.7.20) is an ordination technique that reduces dimensionality in multivariate data sets such as the count of bats in Table 7.13. Dimensionality reduction simply refers to the process of reducing the number of attributes in a dataset while keeping as much of the variation in the original dataset as possible. The goal of NMDS is to collapse information from multiple dimensions (e.g, from multiple communities, sites, etc.) into just a few, so that they can be visualized, (e.g. as a 2D or 3D plot), and interpreted.

Unlike other ordination techniques that rely on (primarily Euclidean) distances, such as Principal Component Analysis (PCA), NMDS uses rank orders, and thus is an extremely flexible technique that can accommodate a variety of different kinds of data (K. R. Clarke 1993), (K. Robert Clarke 1999); NMDS does require at least one observation per factor (e.g. site or month). In R, the metaMDS() function of the vegan package can execute a NMDS. As input, it expects a community matrix with the factors as rows and the species as columns; Table 7.13 has two factors, static location and month, that are treated separately.

Before undertaking ordination it is best to transform the data; this helps reduce the influence of large counts from dominant species (i.e. Common pipistrelles). Table 7.14 is Table 7.13 transformed with \(\sqrt[4]{count}\).

A measure of how successful the NMDS reduction is stress4. The lower the stress value, the better the ordination. A rule of thumb for stress values:

  • stress < 0.05 provides an excellent representation in reduced dimensions
  • stress < 0.1 is great,
  • stress < 0.2 is good,
  • stress > 2.0 < 0.3 interpret with caution
  • stress > 0.3 provides a poor representation.

Figure 7.7 (a) shows the two dimensional plot after the NMDS is applied through metaMDS() function to the transformed data shown in Table 7.14; the NMDS produced a good stress of 0.14. Figure 7.7 (b) is the Shepard plot (Glossary B.7.20.1) showing the scatter around the regression between the interpoint distances in the final configuration (i.e., the distances between each pair of communities) against their original dissimilarities; the desirable outcome from the Shepard plot is a diagonal line with limited scatter in this respect Figure 7.7 (b) looks reasonable.

Figure 7.8 shows the 2D Figure 7.7 (a): with Figure 7.8 (a) labelled for static location and Figure 7.8 (b) labelled for month (Figure 7.8 (b)).

Show the code
library(tidyverse)
library(glue)
library(ggthemes) # for colour pallet "Tableau 10"

Attaching package: 'ggthemes'
The following object is masked from 'package:mosaic':

    theme_map
Show the code
library(flextable)
library(iBats)
library(vegan)

# Add data and time information to the statics data using the iBats::date_time_info
statics_plus <- iBats::date_time_info(statics)

# Count of bat species by location and month
description_month_count <- statics_plus %>% # Location - Species
  group_by(Species, Description, Month) %>% 
  count() %>% 
  pivot_wider(names_from = Species, values_from = n) %>% 
  replace(is.na(.), 0)

# Number of columns in table
ncols <- ncol(description_month_count)

description_month_count %>% 
  flextable() %>% 
    bold(part = "header") %>% 
    bg(bg = "black", part = "header") %>% 
    color(color = "white", part = "header") %>% 
    rotate(j = 3:ncols, rotation = "tbrl", align = "center", part = "header") %>% 
    height_all(height = 2.3, part = "header") %>% 
    hrule(rule = "exact", part = "header") %>%
    width(j = 3:ncols, width = 0.4) %>% 
    width(j = 1:2, width = 0.8)
Table 7.13: Count of Species by the Meta Data; Location and month

Description

Month

Barbastella barbastellus

Eptesicus serotinus

Myotis nattereri

Myotis spp.

Nyctalus leisleri

Nyctalus noctula

Nyctalus spp.

Pipistrellus nathusii

Pipistrellus pipistrellus

Pipistrellus pygmaeus

Pipistrellus spp.

Plecotus spp.

Rhinolophus ferrumequinum

Rhinolophus hipposideros

Static 1

Aug

4

0

0

58

0

12

0

0

20

0

1

0

0

0

Static 2

May

7

1

0

1

0

4

0

2

128

0

6

2

2

2

Static 2

Jun

6

2

1

3

0

32

0

0

163

0

5

0

23

6

Static 2

Aug

10

1

0

103

0

137

0

0

260

17

6

14

18

4

Static 2

Sep

3

0

0

35

0

4

0

8

108

5

16

32

9

7

Static 2

Oct

2

0

0

9

0

2

0

0

20

0

2

47

1

7

Static 3

Jun

2

0

1

2

0

8

0

0

33

0

1

0

13

3

Static 3

Aug

2

0

0

22

3

19

0

0

153

5

5

6

6

15

Static 3

Sep

6

0

0

17

0

6

0

17

40

9

10

13

55

10

Static 4

Jun

161

0

0

9

0

7

0

0

379

1

9

0

7

2

Static 4

Jul

22

2

0

20

0

1

0

0

1,644

6

48

1

0

1

Static 4

Aug

83

6

0

37

0

20

0

0

1,318

29

48

19

6

4

Static 4

Oct

38

0

0

19

0

0

0

2

228

3

31

0

1

1

Static 5

Jun

29

0

0

2

0

12

2

0

66

0

9

1

4

0

Static 5

Jul

27

0

0

25

0

0

0

0

65

1

22

0

0

1

Static 5

Aug

10

0

0

37

0

4

0

0

72

6

42

0

1

0

Static 5

Oct

7

0

0

10

0

0

0

0

32

0

15

0

0

0

Static 1

May

0

0

0

1

0

4

0

0

4

0

0

0

0

0

Static 1

Sep

0

0

0

8

0

3

0

2

11

3

2

0

0

0

Static 2

Jul

0

0

0

17

0

22

0

0

79

0

1

0

7

2

Static 1

Jun

0

0

0

0

0

16

0

0

112

0

1

0

0

0

Static 3

May

0

0

0

0

0

2

0

1

37

1

3

1

7

8

Show the code
species_count <- description_month_count[, 3:ncol(description_month_count)]
meta_data <- description_month_count[, 1:2]

Transformed_Data <- (species_count)^(1/4)

# Number of columns in table
ncols <- ncol(Transformed_Data)

Transformed_Data %>% 
  flextable() %>% 
    bold(part = "header") %>% 
    bg(bg = "black", part = "header") %>% 
    color(color = "white", part = "header") %>% 
    rotate(j = 1:ncols, rotation = "tbrl", align = "center", part = "header") %>% 
    height_all(height = 2.3, part = "header") %>% 
    hrule(rule = "exact", part = "header") %>%
    width(j = 1:ncols, width = 0.4) %>% 
    colformat_double(digits = 3)
Table 7.14: Count of Species with Data Transformed by 4th Root

Barbastella barbastellus

Eptesicus serotinus

Myotis nattereri

Myotis spp.

Nyctalus leisleri

Nyctalus noctula

Nyctalus spp.

Pipistrellus nathusii

Pipistrellus pipistrellus

Pipistrellus pygmaeus

Pipistrellus spp.

Plecotus spp.

Rhinolophus ferrumequinum

Rhinolophus hipposideros

1.414

0.000

0.000

2.760

0.000

1.861

0.000

0.000

2.115

0.000

1.000

0.000

0.000

0.000

1.627

1.000

0.000

1.000

0.000

1.414

0.000

1.189

3.364

0.000

1.565

1.189

1.189

1.189

1.565

1.189

1.000

1.316

0.000

2.378

0.000

0.000

3.573

0.000

1.495

0.000

2.190

1.565

1.778

1.000

0.000

3.186

0.000

3.421

0.000

0.000

4.016

2.031

1.565

1.934

2.060

1.414

1.316

0.000

0.000

2.432

0.000

1.414

0.000

1.682

3.224

1.495

2.000

2.378

1.732

1.627

1.189

0.000

0.000

1.732

0.000

1.189

0.000

0.000

2.115

0.000

1.189

2.618

1.000

1.627

1.189

0.000

1.000

1.189

0.000

1.682

0.000

0.000

2.397

0.000

1.000

0.000

1.899

1.316

1.189

0.000

0.000

2.166

1.316

2.088

0.000

0.000

3.517

1.495

1.495

1.565

1.565

1.968

1.565

0.000

0.000

2.031

0.000

1.565

0.000

2.031

2.515

1.732

1.778

1.899

2.723

1.778

3.562

0.000

0.000

1.732

0.000

1.627

0.000

0.000

4.412

1.000

1.732

0.000

1.627

1.189

2.166

1.189

0.000

2.115

0.000

1.000

0.000

0.000

6.368

1.565

2.632

1.000

0.000

1.000

3.018

1.565

0.000

2.466

0.000

2.115

0.000

0.000

6.025

2.321

2.632

2.088

1.565

1.414

2.483

0.000

0.000

2.088

0.000

0.000

0.000

1.189

3.886

1.316

2.360

0.000

1.000

1.000

2.321

0.000

0.000

1.189

0.000

1.861

1.189

0.000

2.850

0.000

1.732

1.000

1.414

0.000

2.280

0.000

0.000

2.236

0.000

0.000

0.000

0.000

2.839

1.000

2.166

0.000

0.000

1.000

1.778

0.000

0.000

2.466

0.000

1.414

0.000

0.000

2.913

1.565

2.546

0.000

1.000

0.000

1.627

0.000

0.000

1.778

0.000

0.000

0.000

0.000

2.378

0.000

1.968

0.000

0.000

0.000

0.000

0.000

0.000

1.000

0.000

1.414

0.000

0.000

1.414

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

1.682

0.000

1.316

0.000

1.189

1.821

1.316

1.189

0.000

0.000

0.000

0.000

0.000

0.000

2.031

0.000

2.166

0.000

0.000

2.981

0.000

1.000

0.000

1.627

1.189

0.000

0.000

0.000

0.000

0.000

2.000

0.000

0.000

3.253

0.000

1.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

1.189

0.000

1.000

2.466

1.000

1.316

1.000

1.627

1.682

Show the code
# Nonmetric Multidimensional Scaling (NMDS), 
# and tries to find a stable solution 
# using several random starts
nMDS_results <- metaMDS(Transformed_Data, distance = "bray")

# extract NMDS scores (x and y coordinates) for plotting
nMDS_coords = as_tibble(scores(nMDS_results)$sites)

# bind the factors (static location and month back)
graph_data <- bind_cols(meta_data, nMDS_coords)
Show the code
graph_style <- theme_bw() + 
  theme(legend.position = "none", # No legend
  axis.text.x = element_text(size=12, angle = 270, face="bold"), # text bold and vertical
  axis.text.y = element_text(size=12, face="bold.italic"), # bat names italic
  strip.text = element_text(size=12, face="bold"), # Bold facet names
  axis.title.x = element_blank(), # currently not used
  axis.title.y = element_blank(), # no y title (just bat names)
  panel.grid.major = element_blank(), #remove grid lines
  panel.grid.minor = element_blank(),
  plot.caption = element_text())

g1 <- ggplot(graph_data, aes(x = NMDS1, y = NMDS2)) + 
    geom_point(size = 9, colour = "dodgerblue", alpha = 0.8) +
    scale_x_continuous(expand = c(0.05, 0.05)) +
    scale_y_continuous(expand = c(0.05, 0.05)) +
    labs(caption = glue("Stress {round(nMDS_results$stress, digits = 2)}")) +
    graph_style

g1
stressplot(nMDS_results)
(a) Unlabeled Solution of NMDS (2D)
(b) Stress Plot
Figure 7.7: Non-metric Multidimensional Scaling Solution and Stress Plot
Show the code
graph_style <- theme_bw() + 
  theme(legend.position = "none", 
  axis.text.x = element_text(size=12, face="bold"),
  axis.text.y = element_text(size=12, face="bold"), 
  strip.text = element_text(size=12, face="bold"), 
  axis.title.x = element_blank(), 
  axis.title.y = element_blank(), 
  panel.grid.major = element_blank(), 
  panel.grid.minor = element_blank(),
  plot.title = element_text(size=16, face="bold"))

p1 <- ggplot(graph_data, aes(x = NMDS1, y = NMDS2, label=Description)) + 
    geom_point(size = 9, aes(colour = Description), alpha = 0.8) +
  geom_text(size = 5) +
  scale_color_brewer(palette = "Set2") +
  scale_x_continuous(expand = c(0.05, 0.05)) +
  scale_y_continuous(expand = c(0.05, 0.05)) +
  labs(title = "Static Location") +
  graph_style

p2 <- ggplot(graph_data, aes(x = NMDS1, y = NMDS2, label=Month)) + 
    geom_point(size = 9, aes(colour = Month), alpha = 0.8) +
  labs(title = "Month") +
  geom_text(size = 5) +
 scale_fill_tableau(palette = "Tableau 10") + 
  scale_x_continuous(expand = c(0.05, 0.05)) +
  scale_y_continuous(expand = c(0.05, 0.05)) +
  graph_style


p1
p2
(a) Labelled by Static Location
(b) Labelled by Month
Figure 7.8: Non-metric Multidimensional Scaling Solution and Stress Plot

7.6.3.1 ANOSIM

The ANOSIM test (Glossary B.7.21) is similar to an ANOVA hypothesis test; it uses a dissimilarity matrix as input instead of raw data. It is also non-parametric, meaning it doesn’t assume much about the data, a useful attribute for the often-skewed ecological data. As a non-parametric test, ANOSIM uses ranked dissimilarities instead of actual distances, and for this reason complements an NMDS plot. The main point of the ANOSIM test is to determine if the differences between two or more groups are significant.

The function anosim from the package vegan(Glossary B.8.6) can test whether there is a statistical difference between groups (e.g. static locations or months)

  • null hypothesis - no differnce
  • alternative - difference

The higher the R value (it has a range 0 to 1), the more dissimilar your groups5 are in terms of bat assemblage. Figure 7.9 (a) shows the ANOSIM result for the Static Locations, there is a statistical difference of the bat species assemblage for the static location. Figure 7.9 (b) shows the ANOSIM result for the Month, there is no statistical difference of the bat species assemblage for the Months.

Show the code
graph_style <- theme_bw() + 
  theme(legend.position = "none", 
  axis.text.x = element_text(size=12, face="bold"),
  axis.text.y = element_text(size=12, face="bold"), 
  strip.text = element_text(size=12, face="bold"), 
  axis.title.x = element_blank(), 
  axis.title.y = element_blank(), 
  panel.grid.major = element_blank(), 
  panel.grid.minor = element_blank(),
  plot.caption = element_text(size = 12),
  plot.title = element_text(size=16, face="bold"))

Factor <- "Static Location"

ano = anosim(Transformed_Data, 
             meta_data$Description, 
             distance = "bray", 
             permutations = 9999)

#Tjis is how PrimerE views ANOSIM Results
Rperms<- tibble(ano$perm)

Rstat <- ano$statistic
sigValue <- ano$signif

if(sigValue < 0.05) {
  
  SigText <- glue("ANOSIM significamce value ({sigValue}) for {Factor} is statistically significant at P<0.05")
  
} else {
  
  SigText <- glue("ANOSIM significamce value ({sigValue}) for {Factor} is not statistically significant at P<0.05")
  
}

colnames(Rperms) <- c("Rano")

p1_description <- ggplot(Rperms, aes(Rano)) +
  geom_histogram(fill = "dodgerblue") +
  geom_vline(xintercept = Rstat, colour = "gold3", linewidth = 2, linetype = "dashed") +
  annotate("text", x = Rstat, y = 300, label = glue("R statistic \n {round(Rstat, digits = 3)}")) +
  labs(title = Factor,
       y = "Frequency",
       x = "R",
       caption = SigText) +
  graph_style


Factor <- "Month"

ano = anosim(Transformed_Data, 
             meta_data$Month, 
             distance = "bray", 
             permutations = 9999)

#Tjis is how PrimerE views ANOSIM Results
Rperms<- tibble(ano$perm)

Rstat <- ano$statistic
sigValue <- ano$signif

if(sigValue < 0.05) {
  
  SigText <- glue("ANOSIM significamce value ({sigValue}) for {Factor} is statistically significant at P<0.05")
  
} else {
  
  SigText <- glue("ANOSIM significamce value ({sigValue}) for {Factor} is not statistically significant at P<0.05")
  
}

colnames(Rperms) <- c("Rano")

p2_month <- ggplot(Rperms, aes(Rano)) +
  geom_histogram(fill = "dodgerblue") +
  geom_vline(xintercept = Rstat, colour = "gold3", linewidth = 2, linetype = "dashed") +
  annotate("text", x = Rstat, y = 300, label = glue("R statistic \n {round(Rstat, digits = 3)}")) +
  labs(title = Factor,
       y = "Frequency",
       x = "R",
       caption = SigText) +
  graph_style

p1_description
p2_month
(a) ANOSIM Static Location
(b) ANOSIM Month
Figure 7.9: ANOSIM: Test is there a Statistical Difference Between Groups

7.6.4 Diversity Indices

The diversity indexes (Glossary B.7.23) are a set of metrics that quantify the diversity of a community (Magurran 2004). The most common diversity indexes are the Shannon index Shannon and Weaver (1949), the Simpson index (Simpson 1949), and the Pielou index (Simpson 1949). The Shannon index is a measure of the uncertainty in predicting the species identity of an individual chosen at random from the community. The Simpson index is a measure of the probability that two individuals randomly selected from the community will belong to the same species. The Pielou index is a measure of the evenness of the species distribution in the community. These indices go beyond simply counting the number of species (species richness) and consider how evenly distributed those species are.

It is normal for no transformation (pre-treatment) to be applied to the abundance data before calculating. However, in some cases, a log transformation (e.g. ln(x+1)) on the abundance data, could be undertaken. This can downweight the influence of highly abundant species and give more weight to rare ones, offering a more even weighting across species.

Diversity can be measured through the number of bat passes recorded on ultrasonic bat detectors, but they do introduce certain biases that can affect data interpretation. Understanding these biases is crucial for accurate data analysis and interpretation.

  • Bats vary their echolocation calls based on factors like prey type, hunting strategy, and social interactions.
  • Multiple bat species can share similar call characteristics, making it difficult to distinguish between them based on echolocation calls alone.
  • Different bat species use different frequency ranges for echolocation, lower frequency calls travel further than higher frequency.
  • Background noise from wind, rain, traffic, and other sources can interfere with call detection, especially for faint or distant calls.

7.6.4.1 Shannon Diversity Index

Figure 7.10 shows the Shannon diversity index of static locations. The Shannon index is calculated with the diversity function from the vegan package. The Shannon index is calculated for each static location and month and the results are plotted on a box plot by the static location. Figure 7.11 shows the Shannon diversity index of static locations with a log of counts and Figure 7.12 shows the Shannon diversity index of months.

Show the code
statics_with_night <- iBats::date_time_info(statics)

static_count <- statics_with_night |> 
  group_by(Species, Description, Month) |>
  summarise(n = n()) |>
  #Merge the Description and Month columns
  unite("Description", c("Description", "Month"), sep = "_") |>
  pivot_wider(names_from = "Species", values_from = "n", values_fill = 0) |> 
  #make first column the description (row names)
  column_to_rownames(var = "Description") 
 
#Divesity index of static locations
Sh <- diversity(static_count, index = "shannon")

#class(Sh)

# unname the Sh vector
Index_Sh <- unname(Sh)
# extract the row names from Sh
Names_Sh <- names(Sh)

Sh_data <- tibble(Names_Sh, Index_Sh) |> 
  separate(Names_Sh, c("Description", "Month"), sep = "_") 


ggplot(Sh_data, aes(x = Description, y = Index_Sh, fill = Description))+
  geom_boxplot(alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.9, shape = 21, size = 4) +
  labs(title = "Shannon Index for the Static Locations",
       x = "Location",
       y = "Shannon Index") +
  scale_fill_npg() +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        axis.title.x = element_blank())
Figure 7.10: Shannon diversity index of static locations
Show the code
# Log values of counts

static_count_log <- static_count |> 
  mutate(across(everything(), ~log(. + 1))) 
#Divesity index of static locations
Sh_log <- diversity(static_count_log, index = "shannon")

#class(Sh)

# unname the Sh vector
Index_Sh_log <- unname(Sh_log)
# extract the row names from Sh
Names_Sh_log <- names(Sh_log)

Sh_data_log <- tibble(Names_Sh_log, Index_Sh_log) |> 
  separate(Names_Sh_log, c("Description", "Month"), sep = "_") 



ggplot(Sh_data_log, aes(x = Description, y = Index_Sh_log, fill = Description))+
  geom_boxplot(alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.9, shape = 21, size = 4) +
  labs(title = "Shannon Index for the Static Locations (log of counts)",
       x = "Location",
       y = "Shannon Index") +
  scale_fill_npg() +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        axis.title.x = element_blank())
Figure 7.11: Shannon diversity index of static locations (Log of counts)
Show the code
Names_Sh <- names(Sh)

Sh_data <- tibble(Names_Sh, Index_Sh) |> 
  separate(Names_Sh, c("Description", "Month"), sep = "_") 

ggplot(Sh_data, aes(x = Month, y = Index_Sh, fill = Month))+
  geom_boxplot(alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.9, shape = 21, size = 4) +
  labs(title = "Shannon Index of Months",
       x = "Location",
       y = "Shannon Index") +
  scale_fill_jama() +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        axis.title.x = element_blank())
Figure 7.12: Shannon diversity index of Months

7.6.4.2 Simpson Diversity Index

Figure 7.13 shows the Simpson diversity index of static locations. The Simpson index is calculated with the diversity function from the vegan package. The Simpson index is calculated for each static location and month and the results are plotted on a box plot by the static location.

Show the code
#Divesity index of static locations
Si <- diversity(static_count, index = "simpson")

#class(Sh)

# unname the Sh vector
Index_Si <- unname(Si)
# extract the row names from Sh
Names_Si <- names(Si)

Si_data <- tibble(Names_Si, Index_Si) |> 
  separate(Names_Si, c("Description", "Month"), sep = "_") 


ggplot(Sh_data, aes(x = Description, y = Index_Si, fill = Description))+
  geom_boxplot(alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.9, shape = 21, size = 4) +
  labs(title = "Simpson Index for the Static Locations",
       x = "Location",
       y = "Simpson Index") +
  scale_fill_npg() +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        axis.title.x = element_blank())
Figure 7.13: Simpson diversity index of static locations

7.6.4.3 Inverse Simpson Diversity Index

Figure 7.14 shows the Inverse Simpson diversity index (\(1/D\)) of static locations. The Simpson index is calculated with the diversity function from the vegan package. The Simpson index is calculated for each static location and month and the results are plotted on a box plot by the static location.

Show the code
#Divesity index of static locations
Si <- diversity(static_count, index = "simpson")
Si <- 1/Si


# unname the Sh vector
Index_Si <- unname(Si)
# extract the row names from Sh
Names_Si <- names(Si)

Si_data <- tibble(Names_Si, Index_Si) |> 
  separate(Names_Si, c("Description", "Month"), sep = "_") 


ggplot(Sh_data, aes(x = Description, y = Index_Si, fill = Description))+
  geom_boxplot(alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.9, shape = 21, size = 4) +
  labs(title = "Inverse Simpson Index for the Static Locations",
       x = "Location",
       y = "Simpson Index") +
  scale_fill_npg() +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        axis.title.x = element_blank())
Figure 7.14: Inverse Simpson diversity index of static locations

7.6.4.4 Pielou Diversity Index

The Pielou index is calculated by the equation:

\[J = \frac{H}{\ln S}\] where \(H\) is the Shannon index and \(S\) is the total number of species in the community.

The Shannon index is calculated with the diversity function from the vegan package. The total number of species can be calculated with the specnumber function from the vegan package. Figure 7.15 shows the Pielou diversity index of static locations.

Show the code
Sh <- diversity(static_count, index = "shannon")

Spec_number <- specnumber(static_count)

J <- Sh/log(Spec_number)
#class(Sh)

# unname the Sh vector
Index_J <- unname(J)
# extract the row names from Sh
Names_J <- names(J)

J_data <- tibble(Names_J, Index_J) |> 
  separate(Names_J, c("Description", "Month"), sep = "_")

ggplot(J_data, aes(x = Description, y = Index_J, fill = Description))+
  geom_boxplot(alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.9, shape = 21, size = 4) +
  labs(title = "Pielou's Index for the Static Locations",
       x = "Location",
       y = "Pielou's Index") +
  scale_fill_npg() +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        axis.title.x = element_blank())
Figure 7.15: Pielou’s diversity index of static locations

  1. The significance level is always chosen before the test is undertaken.↩︎

  2. A p-value is a measure of the probability that the observed results of a statistical test occurred by chance, given that the null hypothesis is true. In other words, a p-value helps you determine whether the results of your experiment are statistically significant. If the p-value is low (generally less than 0.05; here 0.01 is chosen), you can reject the null hypothesis and conclude that the observed differences between the groups you are studying are statistically significant.↩︎

  3. note it is the dissimilarity that is used, 1- similarity↩︎

  4. NMDS reduces the rank-based differences between the distances between objects in the original matrix and distances between the ordinated objects. The difference is expressed as stress.↩︎

  5. The test to determine which individual group is different (e.g. Static 1, Static 2 …), is not available in R; it can be undertaken in the Primer-E software https://www.primer-e.com/.↩︎