Introduction

This document provides worked answers for all of the exercises in the Introduction to R with Tidyverse course.

Exercise 1

Numbers

We start by doing some simple calculations.

31 * 78
## [1] 2418
697 / 41
## [1] 17

We next look at how to assign data to named variables and then use those variables in calculations.

We make assignments using arrows and they can point to the right or the left depending on the ordering of our data and variable name.

39 -> x

y <- 22

We can then use these in calculations instead of re-entering the data

x - y
## [1] 17

We can also save the results directly into a new variable.

x - y -> z

z
## [1] 17

Text

We can also store text. The difference with text is that we need to indicate to R that this isn’t something it should try to understand. We do this by putting it into quotes.

"simon" -> my.name

We can use the nchar function to find out how many characters my name has in it.

nchar(my.name)
## [1] 5

We can also use the substr function to get the first letter of my name.

substr(my.name,1,1)
## [1] "s"

Exercise 2

Making vectors manually

We’re going to manually make some vectors to work with. For the first one there is no pattern to the numbers so we’re going to make it completley manually with the c() function.

c(2,5,8,12,16) -> some.numbers

For the second one we’re making an integer series, so we can use the colon notation to enter this more quickly.

5:9 -> number.range

Now we can do some maths using the two vectors.

some.numbers - number.range
## [1] -3 -1  1  4  7

Because the two vectors are the same size then the equivalent positions are matched together. Thus the final answer is:

(2-5), (5-6), (8-7), (12-8), (16-9)

Vector functions and subsetting

We’re going to use some functions which return vectors and then use the subsetting functionality on them.

First we’re going to make a numerical sequence with the seq function.

seq(
  from=2,
  by=3,
  length.out = 100
) -> number.series

number.series
##   [1]   2   5   8  11  14  17  20  23  26  29  32  35  38  41  44  47  50
##  [18]  53  56  59  62  65  68  71  74  77  80  83  86  89  92  95  98 101
##  [35] 104 107 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152
##  [52] 155 158 161 164 167 170 173 176 179 182 185 188 191 194 197 200 203
##  [69] 206 209 212 215 218 221 224 227 230 233 236 239 242 245 248 251 254
##  [86] 257 260 263 266 269 272 275 278 281 284 287 290 293 296 299

We now want to extract the values at positions 5,10,15 and 20. This means that we need a vector with these values in it. It’s short enough that we can just enter these manually, but we can also see that it’s a mathematical progression, so we could also use seq to create this.

c(5,10,15,20)
## [1]  5 10 15 20
seq(from=5, by=5, to=20)
## [1]  5 10 15 20

We can now use either of these methods to select the correspoding values at those positions in the number.series data structure.

number.series[c(5,10,15,20)]
## [1] 14 29 44 59
number.series[seq(from=5,by=5, to=20)]
## [1] 14 29 44 59

Finally we’re going to extract all values from positions 10 to 30. For this we’ll use the colon operator as we did in the last exercise, but now it’s inside a selector.

number.series[10:30]
##  [1] 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89

Statistical functions and vectors

Since R is a language built around data manipulation and statistics we can use some of the built in statistical functions.

We can use rnorm to generate a sampled set of values from a normal distribution

rnorm(20) -> normal.numbers

Note that if you run this multiple times you’ll get slightly different results.

We can now use the t.test function to test whether this vector of numbers has a mean which is significantly differnt from zero.

t.test(normal.numbers)
## 
##  One Sample t-test
## 
## data:  normal.numbers
## t = -0.15238, df = 19, p-value = 0.8805
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.3423456  0.2958809
## sample estimates:
##   mean of x 
## -0.02323238

Not surprisingly, it isn’t significantly different.

If we do the same thing again but this time use a distribution with a mean of 1 we should see a difference in the statistical results we get.

t.test(rnorm(20, mean=1))
## 
##  One Sample t-test
## 
## data:  rnorm(20, mean = 1)
## t = 5.7454, df = 19, p-value = 1.548e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.8540882 1.8329653
## sample estimates:
## mean of x 
##  1.343527

This time the result is significant.

Exercise 3

We’re going to read some data from a file straight into R. To do this we’re going to use the tidyverse read_ functions. We therefore need to load tidyverse into our script.

library(tidyverse)
## -- Attaching packages ------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Reading a small file

We’ll start off by reading in a small file.

read_tsv("small_file.txt") -> small
## Parsed with column specification:
## cols(
##   Sample = col_character(),
##   Length = col_double(),
##   Category = col_character()
## )
small

Note that the only relevant name from now on is small which is the name we saved the data under. The original file name is irrelevant after the data is loaded.

We can see that Sample and Category have the ‘character’ type because they are text. Length has the ‘double’ type because it is a number.

We want to find the median of the log2 transformed lengths.

To start with we need to extract the length column using the $ notation.

small$Length
##  [1]  45  82  81  56  96  85  65  96  60  62  80  63  50  64  43  98  78
## [18]  53 100  79  84  68  99  65  55  98  56  83  81  69  50  72  54  56
## [35]  87  84  80  68  95  93

Now we can log2 transform this.

log2(small$Length)
##  [1] 5.491853 6.357552 6.339850 5.807355 6.584963 6.409391 6.022368
##  [8] 6.584963 5.906891 5.954196 6.321928 5.977280 5.643856 6.000000
## [15] 5.426265 6.614710 6.285402 5.727920 6.643856 6.303781 6.392317
## [22] 6.087463 6.629357 6.022368 5.781360 6.614710 5.807355 6.375039
## [29] 6.339850 6.108524 5.643856 6.169925 5.754888 5.807355 6.442943
## [36] 6.392317 6.321928 6.087463 6.569856 6.539159

Finally we can find the median of this.

median(log2(small$Length))
## [1] 6.227664

Reading a larger variants file

The second file we’re going to read is a CSV file of variant data. We therefore need to use read_csv to read it in.

read_csv("Child_Variants.csv") -> child
## Parsed with column specification:
## cols(
##   CHR = col_character(),
##   POS = col_double(),
##   dbSNP = col_character(),
##   REF = col_character(),
##   ALT = col_character(),
##   QUAL = col_double(),
##   GENE = col_character(),
##   ENST = col_character(),
##   MutantReads = col_double(),
##   COVERAGE = col_double(),
##   MutantReadPercent = col_double()
## )
child