Selecting elements with square brackets

By providing a number within square brackets, the respective element is selected from a vector:

names <- c("Sheldon", "Leonard", "Penny", "Amy")
names[1]
[1] "Sheldon"

When you provide a vector of numbers, multiple elements are selected

names[c(1,4)]
[1] "Sheldon" "Amy"    

You can even change the order or repeat elements:

names[c(4, 1, 1)]
[1] "Amy"     "Sheldon" "Sheldon"

With negative numbers, columns are dropped:

names[-1]
[1] "Leonard" "Penny"   "Amy"    
names[c(-1, -3)]
[1] "Leonard" "Amy"    

Subsetting data frames

Firstly, we create an example data frame:

study <- data.frame(
  sen    = c(0, 1, 0, 1, 0, 1),
  gender = c("M", "M", "F", "M", "F", "F"),
  age    = c(12, 13, 11, 10, 11, 14),
  IQ     = c(90, 85, 90, 87, 99, 89)
)
study
sen gender age IQ
0 M 12 90
1 M 13 85
0 F 11 90
1 M 10 87
0 F 11 99
1 F 14 89

Square brackets select a column of a data frame either by a number the column name:

study[3]
age
12
13
11
10
11
14
study["age"]
age
12
13
11
10
11
14

The subsetted object is a data frame with one column.
This is different from extracting a variable with $ or [[ signs:

study[["age"]]
[1] 12 13 11 10 11 14
study$age
[1] 12 13 11 10 11 14

which returns a vector (!)

While this works:

median(study[["age"]])
[1] 11.5

this throws an error:

median(study["age"])
Error in median.default(study["age"]) : need numeric data

Providing a vector will select multiple columns:

study[c(1,3)]
sen age
0 12
1 13
0 11
1 10
0 11
1 14
study[c("sen", "age")]
sen age
0 12
1 13
0 11
1 10
0 11
1 14

Extraction and subsetting

The extraction of a vector and the selection of elements can be combined:

age <- study[["age"]]
age[c(2,4)]
[1] 13 10

Or within one step:

study$age[c(2,4)]
[1] 13 10
study[["age"]][c(2,4)]
[1] 13 10

Selecting rows and columns

Specific cases are selected within square brackets: object_name[rows, columns].

study[5, ]  # filter a row
sen gender age IQ
5 0 F 11 99
study[c(2, 6), ] # filter two rows
sen gender age IQ
2 1 M 13 85
6 1 F 14 89
study[c(2, 6), "IQ"]
[1] 85 89
study[c(2, 6), c("sen", "IQ")]
sen IQ
2 1 85
6 1 89

You could also use numbers to address the columns:

study[, 2]
[1] "M" "M" "F" "M" "F" "F"
study[c(2, 6), c(1, 3)]
sen age
2 1 13
6 1 14

Relational operators

Relational operators compare two values and return a logical value (TRUE or FALSE)

Operator Relation Example
== is identical x == y
!= is not identical x != y
> is greater x > y
>= is greater or identical x >= y
< is less x < y
<= is less or identical x <= y

Examples

7 > 2
[1] TRUE
7 <=  10
[1] TRUE
5 == 4
[1] FALSE
5 != 6
[1] TRUE

Relational vectors and characters

Only == and != can be applied to non numerical objects:

"Hamster" == "Mouse"
[1] FALSE
"Hamster" != "Mouse"
[1] TRUE

Relational operators and vectors

age <- c(12, 4, 3, 8, 4, 2, 1)
age < 5
[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

This behavior is called recycling as is implemented in many (but not all!) R functions.

recycling: An operation is applied to each element of a vector and a vector is returned.

age age < 5
12 FALSE
4 TRUE
3 TRUE
8 FALSE
4 TRUE
2 TRUE
1 TRUE

Using logical vectors to select values

When you put a logical vector within square brackets [ ] after an object, all elements of that object with a TRUE in the logical vector are selected:

age <- c(12, 4, 3, 8)
x <- age > 5
x
[1]  TRUE FALSE FALSE  TRUE
age[x]
[1] 12  8

Using logical vectors to select values

age <- c(12, 4, 3, 8)
x <- age > 5
age[x]
age x <- age > 5 Select? Result
12 TRUE select 12
4 FALSE drop
3 FALSE drop
8 TRUE select 8

which()

The which() functions gives the indices of the elements that are TRUE.
It takes a logical vector as an argument.

x <- c(TRUE, FALSE, FALSE, TRUE)
which(x)
[1] 1 4

which() can handle missing values:

x <- c(TRUE, FALSE, NA, FALSE, TRUE, NA)
which(x)
[1] 1 5
age <- c(12, 4, 3, 8)
x <- age < 5
x
[1] FALSE  TRUE  TRUE FALSE
which(x)
[1] 2 3
age <- c(12, 4, 3, 8)
x <- age < 5
x
which(x)
age[which(x)]
Index age x <- age < 5 which(x) age[which(x)]
1 12 FALSE
2 4 TRUE 2 4
3 3 TRUE 3 3
4 8 FALSE

Why use which?

age = c(NA, 12, 4, 3, NA, 8, 7, 4, 3, 6, 4, 3)
x <- age < 6
x
 [1]    NA FALSE  TRUE  TRUE    NA FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
age[x]
[1] NA  4  3 NA  4  3  4  3
mean(age[x])
[1] NA
mean(age[which(x)])
[1] 3.5

Selecting cases with logical vectors

Logical vectors can also be appplied to data frames for selecting cases.

Let us take an example data frame:

study <- data.frame(
  sen    = c(0, 1, 0, 1, 0, 1),
  gender = c("M", "M", "F", "M", "F", "F"),
  age    = c(12, 13, 11, 10, 11, 14),
  IQ     = c(90, 85, 90, 87, 99, 89)
)

Select with bracket subsetting or the which() function:

study_no_sen <- study[study[["sen"]] == 0, ]
study_no_sen
sen gender age IQ
1 0 M 12 90
3 0 F 11 90
5 0 F 11 99
# Or using the which() function
filter <- which(study[["sen"]] == 0)
study_no_sen <- study[filter, ]

Logical Operations

Logical operations are applied to logical values.

Operator Operation Example Results
! Not ! x TRUE when x = FALSE and FALSE when x = TRUE
& AND x & y TRUE when x and y are TRUE else FALSE
| OR x | y TRUE when x or y is TRUE else FALSE

Note: To get the | sign:
On a german Mac keyboard press: option + 7
On a german Windows keyboard press: AltGr + <

Example

x <- TRUE
y <- FALSE


!x
[1] FALSE
!y
[1] TRUE
x & y
[1] FALSE
x | y
[1] TRUE

Logical Operator with vectors

When applied to vectors, logical operations result in a new vector.
Operations are applied to each element one by one.

x <- c(TRUE, FALSE, TRUE,  FALSE)
y <- c(TRUE, FALSE, FALSE, TRUE)
!x
[1] FALSE  TRUE FALSE  TRUE
x & y
[1]  TRUE FALSE FALSE FALSE
x | y
[1]  TRUE FALSE  TRUE  TRUE
glasses hyperintelligent glasses & hyperintelligent
TRUE TRUE TRUE
TRUE FALSE FALSE
FALSE FALSE FALSE
TRUE TRUE TRUE
FALSE FALSE FALSE

sum() and mean() with logical vectors:

When a logical vector is applied to a numeric function (e.g. mean() or sum()), TRUE is counted as 1 and FALSE as 0:

sum() then gives the number of elements that are TRUE.
mean() gives the proportion of elements that are TRUE.

# e.g.:
sum(c(TRUE, FALSE, TRUE))
[1] 2
mean(c(TRUE, FALSE, TRUE, FALSE))
[1] 0.5

Combining logical and relational operators

age <- c(12, 4, 3, 8, 4, 2, 1, 7, 4)
gender <- c(0, 1, 0, 1, 0, 0, 0, 0, 1)
age > 4
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
gender == 0
[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
age > 4 & gender == 0
[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
income <- c(5000, 4000, 3000, 2000, 1000)
happiness <- c(20, 35, 30, 10, 50)
income > 2500 & happiness > 25
income happiness income > 2500 happiness > 25 income > 2500 &
happiness > 25
5000 20 TRUE FALSE FALSE
4000 35 TRUE TRUE TRUE
3000 30 TRUE TRUE TRUE
2000 10 FALSE FALSE FALSE
1000 50 FALSE TRUE FALSE

… and the proportion

mean(income > 2500 & happiness > 25)
[1] 0.4

Subsetting data frames with logical and relational operators

study
sen gender age IQ
0 M 12 90
1 M 13 85
0 F 11 90
1 M 10 87
0 F 11 99
1 F 14 89
filter <- study[["sen"]] == 1 & study[["gender"]] == "M"
study[filter, ]
sen gender age IQ
2 1 M 13 85
4 1 M 10 87
filter <- ChickWeight[["Diet"]] ==  1 & ChickWeight[["Time"]] < 16
diet1 <- ChickWeight[filter,]
cor(diet1[["weight"]], diet1[["Time"]])
[1] 0.8109772


filter <- ChickWeight[["Diet"]] ==  4 & ChickWeight[["Time"]] < 16
diet4 <- ChickWeight[filter,]
cor(diet4[["weight"]], diet4[["Time"]])
[1] 0.9720822

The correlation is larger for Diet 4. This suggests that Diet 4 has a stronger impact an the chicken’s weight.

The subset() function

R comes with a function to make subsetting a bit more straight forward.

subset() has the main arguments:

  • x : A data.frame
  • subset : A logical vector for filtering rows
  • select : expression, indicating columns to select from a data frame

and returns a data.frame.

subset(study, gender == "F" & IQ > 89, c(sen, gender, IQ))
sen gender IQ
3 0 F 90
5 0 F 99

Variable names must be provided without quotes and without the name of the data.frame.

So many ways of subsetting … an overview

Subset a data frame (and get a new data frame)

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, 
       c("mpg", "am", "gear", "cyl")]

mtcars[mtcars$cyl == 6 & mtcars$am == 1, c("mpg", "am", "gear", "cyl")]

subset(mtcars, cyl == 6 & am == 1, c(mpg, am, gear, cyl))

with(mtcars, 
  mtcars[cyl == 6 & am == 1, c("mpg", "am", "gear", "cyl")]
)

So many ways of subsetting … an overview

Extract a variable from a data frame (and get a numeric or character vector)

mtcars[["mpg"]][mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1]

mtcars$mpg[mtcars$cyl == 6 & mtcars$am == 1]

subset(mtcars, cyl == 6 & am == 1, mpg, drop = TRUE)

with(mtcars, mpg[cyl == 6 & am == 1])

Odd behaviour:

For base R data frames this creates a vector:

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, "mpg"]
[1] 21.0 21.0 19.7

This should have resulted in a data frame with one variable but is automatically reduced to a vector.
Add drop = FALSE to get standard behavior.

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, "mpg", drop = FALSE]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Ferrari Dino 19.7

Some modern implementations of data frames (like tibbles) changed this behavior.