12 min read

R Vocabulary - Part 4

This is the fourth and final part in the series of articles on R vocabulary. In this series, we explore most of the functions mentioned in Chapter 2 of the book Advanced R. The first, second and third part of the series can be read here, here and here.

In this article, we explore most of the functions mentioned under the heading Statistics in the chapter.

The duplicated function returns a vector of logical values to indicate which elements of a vector are duplicates. It can also be used to test whether a data frame has duplicate rows. In this case it returns a vector of logical values, with one value corresponding to each row of the data frame. In the examples below, the values 1 and 2 are duplicates in the vector and the row a, p, 1 is a duplicate in the data frame.

duplicated(c(1, 2, 1, 3, 2, 2))
## [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE
d <- data.frame(x = c("a", "b", "a"), y = c("p", "p", "p"), z = c(1, 2, 1))
duplicated(d)
## [1] FALSE FALSE  TRUE

unique will remove the duplicate elements from a vector or data frame. NA and NaN are distinct values. Also, if you provide a vector of values via the incomparables argument, they will never be marked as duplicate.

x <- sample(1:3, 5, replace = TRUE)
x
## [1] 1 2 1 3 1
unique(x)
## [1] 1 2 3
unique(c(1, 2, NA, 3, 3, 1))
## [1]  1  2 NA  3
unique(c(1, NA, NaN, NA, 3))
## [1]   1  NA NaN   3
unique(c(1, NA, NaN, NA, 3, 1), incomparables = NA)
## [1]   1  NA NaN  NA   3
unique(d)
##   x y z
## 1 a p 1
## 2 b p 2

merge is used to perform the join operations between two data frames. By using the by and all arguments, different types of joins can be implemented. By default, all the common names between the two data frames will be used to join and an inner join will be performed. In the example below, the data frames will be merged using the column x. As there are three rows with the value 3 in the first data frame, and two rows with the value 3 in the second data frame, the result will have six rows with the value 3 as a Cartesian product is considered.

d1 <- data.frame(
  x = c(3, 3, 3, 1, 1),
  y = rnorm(5)
)
d1
##   x           y
## 1 3 -0.93804483
## 2 3  0.36256382
## 3 3  0.05134892
## 4 1  0.33818822
## 5 1  0.31147624
d2 <- data.frame(
  x = c(1, 2, 3, 3, 2),
  z = rnorm(5)
)
d2
##   x         z
## 1 1 3.2746670
## 2 2 1.4201264
## 3 3 2.4367997
## 4 3 0.1864403
## 5 2 0.3210642
merge(d1, d2)
##   x           y         z
## 1 1  0.33818822 3.2746670
## 2 1  0.31147624 3.2746670
## 3 3 -0.93804483 2.4367997
## 4 3 -0.93804483 0.1864403
## 5 3  0.36256382 2.4367997
## 6 3  0.36256382 0.1864403
## 7 3  0.05134892 2.4367997
## 8 3  0.05134892 0.1864403

The remaining examples demonstrate the use of a few arguments which can be used with merge. In the last example, the value 2 which appears in d2 but not d1, is included and the values of the variables from d1 set to NA for these rows.

merge(d1, d2, by = "x") # same as above
##   x           y         z
## 1 1  0.33818822 3.2746670
## 2 1  0.31147624 3.2746670
## 3 3 -0.93804483 2.4367997
## 4 3 -0.93804483 0.1864403
## 5 3  0.36256382 2.4367997
## 6 3  0.36256382 0.1864403
## 7 3  0.05134892 2.4367997
## 8 3  0.05134892 0.1864403
names(d2) <- c("x2", "z")
merge(d1, d2, by.x = "x", by.y = "x2") # specify the join keys
##   x           y         z
## 1 1  0.33818822 3.2746670
## 2 1  0.31147624 3.2746670
## 3 3 -0.93804483 2.4367997
## 4 3 -0.93804483 0.1864403
## 5 3  0.36256382 2.4367997
## 6 3  0.36256382 0.1864403
## 7 3  0.05134892 2.4367997
## 8 3  0.05134892 0.1864403
merge(d1, d2, by.x = "x", by.y = "x2", all.y = TRUE) # right join
##    x           y         z
## 1  1  0.33818822 3.2746670
## 2  1  0.31147624 3.2746670
## 3  2          NA 1.4201264
## 4  2          NA 0.3210642
## 5  3 -0.93804483 2.4367997
## 6  3 -0.93804483 0.1864403
## 7  3  0.36256382 2.4367997
## 8  3  0.36256382 0.1864403
## 9  3  0.05134892 2.4367997
## 10 3  0.05134892 0.1864403

order takes a vector of values, and returns another vector which specifies the index of the values in the original vector, after sorting the original vector. For example, the value -1.27 is the lowest value in the example below, and it appeared as the 8th element in the original vector. So the first element in the vector returned by order will be 8. Similarly, 0.13 is the 7th largest value which appears in the 5th position in the original vector - so the 7th element in the vector returned by order will be 5.

set.seed(123)
x <- round(rnorm(10), 2)
x
##  [1] -0.56 -0.23  1.56  0.07  0.13  1.72  0.46 -1.27 -0.69 -0.45
sort(x)
##  [1] -1.27 -0.69 -0.56 -0.45 -0.23  0.07  0.13  0.46  1.56  1.72
order(x)
##  [1]  8  9  1 10  2  4  5  7  3  6

It is easy to see that this can be used to sort a data frame by one variable. In the example below, we re-arrange the rows of d in the sorted order of d$x, thus sorting the data frame by x.

d <- data.frame(x = rnorm(10), y = rnorm(10))
d
##             x          y
## 1   1.2240818 -1.0678237
## 2   0.3598138 -0.2179749
## 3   0.4007715 -1.0260044
## 4   0.1106827 -0.7288912
## 5  -0.5558411 -0.6250393
## 6   1.7869131 -1.6866933
## 7   0.4978505  0.8377870
## 8  -1.9666172  0.1533731
## 9   0.7013559 -1.1381369
## 10 -0.4727914  1.2538149
d[order(d$x), ]
##             x          y
## 8  -1.9666172  0.1533731
## 5  -0.5558411 -0.6250393
## 10 -0.4727914  1.2538149
## 4   0.1106827 -0.7288912
## 2   0.3598138 -0.2179749
## 3   0.4007715 -1.0260044
## 7   0.4978505  0.8377870
## 9   0.7013559 -1.1381369
## 1   1.2240818 -1.0678237
## 6   1.7869131 -1.6866933

rank is used to calculate the sample ranks of the values in a vector. The argument ties.method is used to control how to handle duplicate values. In the example below, values 1 and 2 get the ranks 1.0 and 2.0 respectively. Since there are four values with the value 3, the default method will replace it by the average - so the ranks 3, 4, 5 and 6 are averaged to get a value of 4.5. The value 4 then gets the rank 7.0 and so on. Using ties.method = “min” results in a ranking similar to sports competitions, where everyone with the same value gets the same rank and the next rank starts at a point depending on the number of people with the same value.

set.seed(123)
x <- sample(1:5, 10, replace = TRUE)
x
##  [1] 3 3 2 2 3 5 4 1 2 3
rank(x)
##  [1]  6.5  6.5  3.0  3.0  6.5 10.0  9.0  1.0  3.0  6.5
rank(x, ties.method = "min")
##  [1]  5  5  2  2  5 10  9  1  2  5

quantile calculates the quantiles of a vector, with the probs argument specifying which quantiles should be calculated. For example, to obtain the deciles of a vector, we specify the probs argument to vary by 0.1. In the example below, the median may not be exactly 0 due to randomness.

x <- rnorm(1000)
hist(x)

quantile(x, probs = seq(0, 1, by = 0.1))
##           0%          10%          20%          30%          40% 
## -2.809774679 -1.284947198 -0.792953894 -0.490076550 -0.214854912 
##          50%          60%          70%          80%          90% 
##  0.004310117  0.242040647  0.513088142  0.841412710  1.250304606 
##         100% 
##  3.241039935

sort, as the name suggests, is used to sort a vector. Complex numbers are first sorted by the real part, and then the imaginary part.

sort(rnorm(10))
##  [1] -1.2619005 -1.2187118 -0.4469593 -0.2112469  0.1870511  0.2275427
##  [7]  0.4690320  0.6851982  2.7973911  2.8322260
sort(sample(letters[1:3], 10, replace = TRUE))
##  [1] "a" "b" "b" "b" "b" "b" "c" "c" "c" "c"
sort(sample(c(TRUE, FALSE), 10, replace = TRUE))
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
sort(rnorm(10) + rnorm(10) * 1i)
##  [1] -2.3159235+0.3641461i -0.6137728+2.3428030i -0.2526091-1.9288377i
##  [4]  0.4470217-1.6547705i  0.7309635+0.6508826i  0.7859170-0.2675899i
##  [7]  0.9798161-1.2351926i  1.0420316+0.5153308i  1.3615980-0.5208038i
## [10]  1.5036305-0.4808306i

table and ftable are used to build contingency tables. They accept one or more objects which can be interpreted as factors and creates a contigency table of the counts at each combination of levels. The key difference between the two functions is that ftable creates a ‘flat’ table, a single matrix whose rows and columns correspond to the combination of the levels.

table(sample(letters[1:2], 10, replace = TRUE), sample(letters[3:4], 10, replace = TRUE), sample(letters[5:6], 10, replace = TRUE))
## , ,  = e
## 
##    
##     c d
##   a 0 3
##   b 2 2
## 
## , ,  = f
## 
##    
##     c d
##   a 1 0
##   b 0 2
ftable(sample(letters[1:2], 10, replace = TRUE), sample(letters[3:4], 10, replace = TRUE), sample(letters[5:6], 10, replace = TRUE))
##      e f
##         
## a c  2 1
##   d  3 0
## b c  0 0
##   d  2 2

We now look at some functions related to building statistical models. Many of these functions are generic functions - so they behave differently based on the type of statistical modelling being performed. Let us fit using a linear regression where we try to predict the mpg (miles per gallon) using the disp (displacement) and wt (weight of the car). This is accomplished using the lm function. lm is used to fit a variety of linear models, but in this simple example we are doing a multiple linear regression.

car_mod <- lm(mpg ~ disp + wt, data = mtcars)
class(car_mod)
## [1] "lm"
summary(car_mod)
## 
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4087 -2.3243 -0.7683  1.7721  6.3484 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.96055    2.16454  16.151 4.91e-16 ***
## disp        -0.01773    0.00919  -1.929  0.06362 .  
## wt          -3.35082    1.16413  -2.878  0.00743 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared:  0.7809, Adjusted R-squared:  0.7658 
## F-statistic: 51.69 on 2 and 29 DF,  p-value: 2.744e-10

Recall that when we run the function summary on a data frame, it generates summary statistics of the columns in the data frame. When we pass it an object of class lm, it provides a summary of the model which was built using the lm function. The function fitted will return the fitted values in the training data while predict can be used to apply the model on new data. resid is used to extract the model residuals. rstandard and rstudent calculates the standardised and Studentised residuals. The help page on influence.measures provides the list of diagnostic functions for regression models. We have already looked at the lm function above. The function glm is used to build generalised linear models. We are not going to cover the details of such models in this article.

head(fitted(car_mod))
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##          23.34543          22.49097          25.27237          19.61467 
## Hornet Sportabout           Valiant 
##          17.05281          19.37863
predict(car_mod, newdata = data.frame(
  disp = rnorm(5, mean = 230, sd = 25),
  wt = rnorm(5, mean = 3.5, sd = 2)
))
##         1         2         3         4         5 
##  3.136114 20.332581 20.528321 20.159592 28.958952
head(resid(car_mod))
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##         -2.345433         -1.490972         -2.472367          1.785333 
## Hornet Sportabout           Valiant 
##          1.647193         -1.278631
head(rstandard(car_mod))
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##        -0.8222164        -0.5232550        -0.8757799         0.6243627 
## Hornet Sportabout           Valiant 
##         0.6092882        -0.4483953
head(rstudent(car_mod))
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##        -0.8175008        -0.5165987        -0.8721585         0.6176689 
## Hornet Sportabout           Valiant 
##         0.6025603        -0.4421318

The next set of functions are the ones related to probability distributions. In general, given a probability distribution, the d function computes the density, p the distribution, q the quantile and r generates random numbers from that distribution. For example, the corresponding functions for the normal distribution are dnorm, pnorm, qnorm and rnorm. If there are parameters for the distribution, they are accepted as arguments to these functions with some default values provided by the function. For example, for the normal distribution, the default mean is 0 and standard deviation 1.

dnorm(0)
## [1] 0.3989423
pnorm(0)
## [1] 0.5
qnorm(0.5)
## [1] 0
rnorm(5, mean = 2, sd = 0.5)
## [1] 2.502210 2.406469 2.207488 1.957070 1.045896
qnorm(pnorm(c(0.2, 0.8, 3)))
## [1] 0.2 0.8 3.0

The last set of functions we look at in this article related to matrix algebra. crossprod and tcrossprod are used to calculate matrix cross-products. They are equivalent to t(x) %*% y and x %*% t(y) respectively, where t transposes a matrix while %*% is the matrix multiplication operator. eigen computes the eigenvalues and eigenvectors of a matrix and returns them as a list. qr and svd computes the QR and singular value decomposition of a matrix respectively.

m1 <- matrix(c(1, 2, 3, 4), nrow = 2)
m2 <- matrix(c(5, 6, 7, 8), nrow = 2)
m1
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
m2
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
m1 %*% m2
##      [,1] [,2]
## [1,]   23   31
## [2,]   34   46
crossprod(m1, m2)
##      [,1] [,2]
## [1,]   17   23
## [2,]   39   53
tcrossprod(m1, m2)
##      [,1] [,2]
## [1,]   26   30
## [2,]   38   44
eigen(m1)
## eigen() decomposition
## $values
## [1]  5.3722813 -0.3722813
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.5657675 -0.9093767
## [2,] -0.8245648  0.4159736
qr(m1)
## $qr
##            [,1]       [,2]
## [1,] -2.2360680 -4.9193496
## [2,]  0.8944272 -0.8944272
## 
## $rank
## [1] 2
## 
## $qraux
## [1] 1.4472136 0.8944272
## 
## $pivot
## [1] 1 2
## 
## attr(,"class")
## [1] "qr"
svd(m2)
## $d
## [1] 13.1900344  0.1516296
## 
## $u
##            [,1]       [,2]
## [1,] -0.6521255 -0.7581111
## [2,] -0.7581111  0.6521255
## 
## $v
##            [,1]       [,2]
## [1,] -0.5920601  0.8058938
## [2,] -0.8058938 -0.5920601

solve is used to solve a system of linear equations. The first argument provides the coefficients of the linear system in matrix form, while the second argument provides the right hand side of the system.

m <- matrix(c(1:4), nrow = 2)
m
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
solve(m, c(10, 30))
## [1] 25 -5