5 Introduction to Statistics in R

install.packages("ggplot2",  repos = "https://cran.us.r-project.org")
install.packages("dplyr",  repos = "https://cran.us.r-project.org")
install.packages("ggfortify",  repos = "https://cran.us.r-project.org")
urlfile05a="https://raw.githubusercontent.com/apicellap/data/main/ladybirds_morph_colour.csv"
lady<-read.csv(url(urlfile05a))
head(lady)
##   Habitat Site morph_colour number
## 1   Rural   R1        black     10
## 2   Rural   R2        black      3
## 3   Rural   R3        black      4
## 4   Rural   R4        black      7
## 5   Rural   R5        black      6
## 6   Rural   R1          red     15

View data structure:

str(lady)
## 'data.frame':    20 obs. of  4 variables:
##  $ Habitat     : chr  "Rural" "Rural" "Rural" "Rural" ...
##  $ Site        : chr  "R1" "R2" "R3" "R4" ...
##  $ morph_colour: chr  "black" "black" "black" "black" ...
##  $ number      : int  10 3 4 7 6 15 18 9 12 16 ...

5.1 \(\chi\)\(^{2}\) contingency table analysis

Create new dataframe, totals:

totals <- lady %>% #working with this dataframe
  group_by(Habitat, morph_colour) %>% #we want to represent these groups in the dataframe
  summarise(total.number = sum(number)) #add up the numbers of each group using the new object total.number
## `summarise()` has grouped output by 'Habitat'. You can override using the
## `.groups` argument.
totals
## # A tibble: 4 × 3
## # Groups:   Habitat [2]
##   Habitat    morph_colour total.number
##   <chr>      <chr>               <int>
## 1 Industrial black                 115
## 2 Industrial red                    85
## 3 Rural      black                  30
## 4 Rural      red                    70

Create bar chart:

base_plot<-ggplot(totals, aes(x=Habitat, y=total.number,
                  fill = morph_colour)) + #fill is used when there is something like a bar that can be filled with color 
                                          #if this were color = morph_colour, then the argument would affect the bar's outline
  geom_bar(
    stat = 'identity', #this tells ggplot not to calculate anything from the data and just display the data as they are in the dataframe
    position = 'dodge' #this is request to put the two bars in each Habitat group next to each other 
                       #if it's not used, a stacked barplot would be printed 
  ) 
base_plot
base_plot +   
  scale_fill_manual(values = c(black = "black", red = "red")) #the text in "" are the colors we are instructing R to fill the bars with 
  • Null hypothesis: there is no association between the color of the birds and their habitat
    • My opinion: data suggest that there is a higher proportion of colored birds in the industrial habitat. We should reject the null hypothesis (pre-stats)
    • to rigorously test this, a chi-squared test must be performed

5.2 Making the \(\chi\)\(^{2}\) Test

Use xtabs() function to generate a contingency table:

lady.mat <-xtabs(                     #function converts dataframe into a matrix, which is different from a dataframe
                  number ~ Habitat + morph_colour, #cross-tabulate the number of column counts in the dataframe by the Habitat and morph_colour variables
                 data = lady) 
lady.mat
##             morph_colour
## Habitat      black red
##   Industrial   115  85
##   Rural         30  70

Perform a \(\chi\)\(^{2}\) test on the matrix:

lady.chi<-chisq.test(lady.mat) #chi squared test function performed on matrix 
lady.chi
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lady.mat
## X-squared = 19.103, df = 1, p-value = 1.239e-05
  • the p value is 0.00001239 - this is the probability that the pattern arose by chance.
    • It is lower than 0.05, so we can reject the null hypothesis
names(lady.chi) #can examine all of the parts of the test mechanics 
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"

5.3 Two-sample t-test

  • The two-sample t-test compares the means of two groups of numeric values
  • It is appropriate when the sample sizes of these two groups is small
  • The analysis makes the following assumptions about the data:
    • The data are normally distributed
    • The variances of the data are equivalent
urlfile05b="https://raw.githubusercontent.com/apicellap/data/main/ozone.csv"
ozone<-read.csv(url(urlfile05b))
glimpse(ozone)
## Rows: 20
## Columns: 3
## $ Ozone           <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8, …
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West"…
## $ Garden.ID       <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9", …

5.3.1 The first step: Plot your data

Create histograms of the data, by Garden.location variable:

ggplot(ozone, aes(x=Ozone))+ #since this is a histogram, x must be a continuous variable; cannot be categorical 
  geom_histogram(binwidth=10) + 
  facet_wrap(~Garden.location, #Divide data up into groups by this variable 
            ncol=1 ) +  #stacks the graphs on top of each other - 1 column 
  theme_bw()
  • graph shows that assumptions of normality and equality of variance are met
    • R has functions to evaluate these aspects more rigorously too

Count the number of observations when variable is ‘East’:

x<-sum(with(ozone, Garden.location == "East")) 
x
## [1] 10
summary<-ozone %>% 
  group_by(Garden.location) %>% 
  summarise(mean = mean(Ozone), SE = (sd(Ozone))/sqrt(x))
summary
## # A tibble: 2 × 3
##   Garden.location  mean    SE
##   <chr>           <dbl> <dbl>
## 1 East             77.3  2.49
## 2 West             61.3  2.87

Plot the ozone data as a barplot:

OzBrP<-ggplot(summary, aes(x = Garden.location, y = mean)) + #x defined as grazing treatment, y as fruit production
  geom_col() + 
  geom_errorbar(aes(x=Garden.location, ymin=mean -SE, ymax=mean*1+SE, width=.2),
                                            position = 'dodge')
OzBrP

Plot the ozone data as a boxplot:

OzBxP <-ggplot(ozone, aes(x = Garden.location, y = Ozone)) + 
  geom_boxplot()
OzBxP
  • null hypothesis: there is no difference in ozone levels between the East and West garden locations
  • my opinion based on the boxplot - reject the null hypothesis

5.3.2 Two sample t-test analysis

Perform two sample t-test on the ozone dataset:

t.test(Ozone ~ Garden.location, #expression means do ozone levels vary as a function of location? That's how the ~ reads 
       data = ozone)
## 
##  Welch Two Sample t-test
## 
## data:  Ozone by Garden.location
## t = 4.2363, df = 17.656, p-value = 0.0005159
## alternative hypothesis: true difference in means between group East and group West is not equal to 0
## 95 percent confidence interval:
##   8.094171 24.065829
## sample estimates:
## mean in group East mean in group West 
##              77.34              61.26
  • the default of this two sample t-test is the Welch’s version
    • In Welch version, the assumption about equal variance is relaxed and allows that you do not have to test for equal variance
  • The GSwR authors suggest not testing for equal variance
  • note regarding the Confidence interval (CI): the line about the 95% CI is 8.1-24.1. Since this range does not include 0 says that the means are statistically different from each other. This is congruent with the p value of the t-test

5.4 Linear models

  • Linear models are a class of analyses that include regression, multiple regression, ANOVA, and ANCOVA
    • they are centered around a similiar framework of ideas such as a common set of assumptions about the data like the idea that the data are normally distributed

5.5 Simple linear regression

  • This section focuses on a dataset that compares plant growth rates to soil moisture content
  • The response (dependent) variable is that of plant growth rate
  • Plant growth rate is plotted against the explanatory (independent) variable, soil moisture content
    • This variable is a continuous, numeric variable which does not have categories

Read and view structure of the data:

urlfile05c="https://raw.githubusercontent.com/apicellap/data/main/plant.growth.rate.csv"
plant_gr<-read.csv(url(urlfile05c))
glimpse(plant_gr) #two continuous (numeric) variables. they have no categories 
## Rows: 50
## Columns: 2
## $ soil.moisture.content <dbl> 0.4696876, 0.5413106, 1.6979915, 0.8255799, 0.85…
## $ plant.growth.rate     <dbl> 21.31695, 27.03072, 38.98937, 30.19529, 37.06547…

5.5.1 Getting and plotting the data

Plot the plant growth (pg) data:

ggplot(plant_gr,
       aes(x=soil.moisture.content, y=plant.growth.rate)) +
  geom_point()+ 
  ylab("Plant Growth Rate (mm/week)") + 
  theme_bw()
  • Plot shows that there is a probably a positive linear relationship between the two variables
    • There is a slope of about 15 mm/week
    • preliminary analysis: probably will reject the null hypothesis that soil moisture does not affect plant growth

5.5.2 Making a simple linear regression happen

Create a linear model:

model_pgr <- lm(plant.growth.rate ~ soil.moisture.content, data = plant_gr) #plant growth rate is a function of soil moisture content
model_pgr
## 
## Call:
## lm(formula = plant.growth.rate ~ soil.moisture.content, data = plant_gr)
## 
## Coefficients:
##           (Intercept)  soil.moisture.content  
##                 19.35                  12.75

5.5.3 Assumptions first

Check assumptions with ggfortify:

autoplot(model_pgr, 
         smooth.colour = NA)
  • ggfortify plots:
    • Top left - informs us on whether the line is an appropriate fit to the data; flat line means model is a good fit. Humps/valleys -> poor fit
    • Top right - dots = residuals; dashes line = expectation under normal distribution. Better tool than histogram to assess normal distribution
    • Bottom left - evaluates assumption of equal variance. linear models assume that variance is constant over all predicted values of the response variable. There should be no pattern
    • Bottom right - evaluates leverage. Used to detect influential datapoints that will shift the gradient more than expected + also for outliers
  • null hypothesis: soil moisture has no effect on plant growth

5.5.4 Now the interpretation

Produce a sum of squares table:

anova(model_pgr)
## Analysis of Variance Table
## 
## Response: plant.growth.rate
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## soil.moisture.content  1 2521.15 2521.15  156.08 < 2.2e-16 ***
## Residuals             48  775.35   16.15                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Output:
  • Large F value indicates that error variance is small relative to the variance attributed to the explanatory variable
    • This leads to the tiny p value.
      • Both are good indications that the effect seen in the data isn’t the result of chance

Produce a summary table:

summary(model_pgr) 
## 
## Call:
## lm(formula = plant.growth.rate ~ soil.moisture.content, data = plant_gr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9089 -3.0747  0.2261  2.6567  8.9406 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             19.348      1.283   15.08   <2e-16 ***
## soil.moisture.content   12.750      1.021   12.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.019 on 48 degrees of freedom
## Multiple R-squared:  0.7648, Adjusted R-squared:  0.7599 
## F-statistic: 156.1 on 1 and 48 DF,  p-value: < 2.2e-16
  • produces table of estimates of the coefficients of the line that is the model
  • the slope that is associated with the explanatory variable (soil moisture) - the values of which are associated with the differences in plant growth rate

Superimpose linear model onto plot:

ggplot(plant_gr,
       aes(x=soil.moisture.content, y=plant.growth.rate)) +
  geom_point()+ 
  geom_smooth(method = 'lm') + #put a linear-model fitted line and the standard error of the fit using flash transparent gray onto graph 
  ylab("Plant Growth Rate (mm/week)") + 
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

5.6 Analysis of variance: the one-way ANOVA

  • the one-way ANOVA is similiar to the previous example and the two-sample t-test with one key difference:
    • the explanatory variable in the one-way ANOVA is a categorical variable (factor)
  • waterflea dataset - we ask:
    • whether parasites alter waterflea growth rates
    • whether each of three parasittes reduces growth rates relative to a control in which there was no parasite
urlfile05d="https://raw.githubusercontent.com/apicellap/data/main/Daphniagrowth.csv"
daphnia<-read.csv(url(urlfile05d))
glimpse(daphnia)
## Rows: 40
## Columns: 3
## $ parasite    <chr> "control", "control", "control", "control", "control", "co…
## $ rep         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, …
## $ growth.rate <dbl> 1.0747092, 1.2659016, 1.3151563, 1.0757519, 1.1967619, 1.3…

Plot daphnia data:

ggplot(daphnia, aes(x = parasite, y=growth.rate)) + 
  geom_boxplot() +
  theme_bw() + 
  coord_flip() #flips the orientation of the graph 90˚ to the right 
  • Visualization takeaways:
    • There is substantial variation among the parasite groups
    • The control group has the highest growth rate
    • There is likely to be an overall effect of parasites on growth rate
    • There may be an order in the impacts different parasites have on growth rates: P. ramosa < M. bicuspidata < P. perplexa

Create a model:

model_grow <- lm(growth.rate ~ parasite, data = daphnia)
model_grow
## 
## Call:
## lm(formula = growth.rate ~ parasite, data = daphnia)
## 
## Coefficients:
##                       (Intercept)  parasiteMetschnikowia bicuspidata  
##                            1.2139                            -0.4128  
##      parasitePansporella perplexa           parasitePasteuria ramosa  
##                           -0.1376                            -0.7317

Check the assumptions:

autoplot(model_grow, 
         smooth.colour = NA)
  • assumption-checking plots:
    • The figures suggest that the assumptions are fine
    • Even the upper right plot, the Q-Q plot, is within the bounds of expected variation

Produce anova table for model:

anova(model_grow)
## Analysis of Variance Table
## 
## Response: growth.rate
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## parasite   3 3.1379 1.04597  32.325 2.571e-10 ***
## Residuals 36 1.1649 0.03236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • For a one-way anova, the null hypothesis is that all of the groups come from populations with (statistically) the same mean
    • The F-value quantifies how likely this is to be true
      • F value is the ratio between the between group variation: within group variation
        • A large F value means that the between group variation is much larger
summary(model_grow)
## 
## Call:
## lm(formula = growth.rate ~ parasite, data = daphnia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41930 -0.09696  0.01408  0.12267  0.31790 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.21391    0.05688  21.340  < 2e-16 ***
## parasiteMetschnikowia bicuspidata -0.41275    0.08045  -5.131 1.01e-05 ***
## parasitePansporella perplexa      -0.13755    0.08045  -1.710   0.0959 .  
## parasitePasteuria ramosa          -0.73171    0.08045  -9.096 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1799 on 36 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.7067 
## F-statistic: 32.33 on 3 and 36 DF,  p-value: 2.571e-10
  • Output:
  • table of coefficients:
    • What is labelled as ‘(Intercept)’ is the control
    • R tends to list things in alphabetical order
      • Of the levels of the explanatory variable, ‘control’ is the first alphabetically
      • In this context, we can assume that (Intercept) refers to the first level in alphabetical order - control - in this case
    • Treatment contrasts report the differences between the reference level (the control in this case) and the other levels
      • So -0.41275 is the difference between the control level and the parasiteMetschnikowia bicuspidata level
      • In other words, the differences are the distances between the colored diamonds and the dotted black line in figure 6.1 below

Determine the means for each level:

sumDat <- daphnia %>% 
  group_by(parasite) %>%
  summarise(meanGR = mean(growth.rate))
sumDat
## # A tibble: 4 × 2
##   parasite                  meanGR
##   <chr>                      <dbl>
## 1 control                    1.21 
## 2 Metschnikowia bicuspidata  0.801
## 3 Pansporella perplexa       1.08 
## 4 Pasteuria ramosa           0.482

Manually determine the difference between control and other levels:

0.8011541   -1.2139088  
## [1] -0.4127547
1.0763551 - 1.2139088
## [1] -0.1375537
0.4822030 - 1.2139088
## [1] -0.7317058
ggplot(daphnia, aes(x = parasite, y=growth.rate, color = parasite)) + 
  geom_point(size =2) + 
  geom_point(data = sumDat, aes( x = parasite, y = meanGR), shape = 18, size = 5) +
  geom_hline(yintercept = sumDat$meanGR[1],
             lwd=1,
             linetype = 'dotted',
             colour="black") + 
  geom_hline(yintercept = sumDat$meanGR[1],
             lwd=1,
             linetype = 'dotted',
             colour="black") + 
  xlab("")+
  theme_bw() + 
  coord_flip() 
\label{fig:figs}Daphnia Treatment Differences

Figure 5.1: Daphnia Treatment Differences