R Basic Concepts: Subsetting

Selecting elements with square brackets

By providing a number within square brackets, the respective element is selected from a vector:

names <- c("Sheldon", "Leonard", "Penny", "Amy")
names[1]

[1] "Sheldon"

When you provide a vector of numbers, multiple elements are selected

names[c(1,4)]

[1] "Sheldon" "Amy"

You can even change the order or repeat elements:

names[c(4, 1, 1)]

[1] "Amy"     "Sheldon" "Sheldon"

With negative numbers, columns are dropped:

names[-1]

[1] "Leonard" "Penny"   "Amy"

names[c(-1, -3)]

[1] "Leonard" "Amy"

Task

Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"

Task - solution

Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"

x <- c(1, 4, 1, 4, 2, 3)
new_order <- names[x]
new_order

[1] "Sheldon" "Amy"     "Sheldon" "Amy"     "Leonard" "Penny"

Subsetting data frames

Firstly, we create an example data frame:

study <- data.frame(
  sen    = c(0, 1, 0, 1, 0, 1),
  gender = c("M", "M", "F", "M", "F", "F"),
  age    = c(12, 13, 11, 10, 11, 14),
  IQ     = c(90, 85, 90, 87, 99, 89)
)
study

sen	gender	age	IQ
0	M	12	90
1	M	13	85
0	F	11	90
1	M	10	87
0	F	11	99
1	F	14	89

Square brackets select a column of a data frame either by a number the column name:

study[3]

age
12
13
11
10
11
14

study["age"]

age
12
13
11
10
11
14

The subsetted object is a data frame with one column.
This is different from extracting a variable with $ or [[ signs:

study[["age"]]

[1] 12 13 11 10 11 14

study$age

[1] 12 13 11 10 11 14

which returns a vector (!)

While this works:

median(study[["age"]])

[1] 11.5

this throws an error:

median(study["age"])

Error in median.default(study["age"]) : need numeric data

Providing a vector will select multiple columns:

study[c(1,3)]

sen	age
0	12
1	13
0	11
1	10
0	11
1	14

study[c("sen", "age")]

sen	age
0	12
1	13
0	11
1	10
0	11
1	14

Extraction and subsetting

The extraction of a vector and the selection of elements can be combined:

age <- study[["age"]]
age[c(2,4)]

[1] 13 10

Or within one step:

study$age[c(2,4)]

[1] 13 10

study[["age"]][c(2,4)]

[1] 13 10

Selecting rows and columns

Specific cases are selected within square brackets: object_name[rows, columns].

study[5, ]  # filter a row

	sen	gender	age	IQ
5	0	F	11	99

study[c(2, 6), ] # filter two rows

	sen	gender	age	IQ
2	1	M	13	85
6	1	F	14	89

study[c(2, 6), "IQ"]

[1] 85 89

study[c(2, 6), c("sen", "IQ")]

	sen	IQ
2	1	85
6	1	89

You could also use numbers to address the columns:

study[, 2]

[1] "M" "M" "F" "M" "F" "F"

study[c(2, 6), c(1, 3)]

	sen	age
2	1	13
6	1	14

Task

Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.

Task - solution

Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.

study2 <- study[c(1, 3, 5), c("gender", "age")]
study2

	gender	age
1	M	12
3	F	11
5	F	11

Sophisticated subsetting

Subsetting becomes most powerful when it is combined with conditional selections.

For example:

Select all students with special educational needs.
Select all male students between the age of 6 and 10

To apply such selections, we have to know about relational and logical operators.

Relational operators

Relational operators compare two values and return a logical value (TRUE or FALSE)

Operator	Relation	Example
`==`	is identical	x == y
`!=`	is not identical	x != y
`>`	is greater	x > y
`>=`	is greater or identical	x >= y
`<`	is less	x < y
`<=`	is less or identical	x <= y

Examples

7 > 2

[1] TRUE

7 <=  10

[1] TRUE

5 == 4

[1] FALSE

5 != 6

[1] TRUE

Relational vectors and characters

Only == and != can be applied to non numerical objects:

"Hamster" == "Mouse"

[1] FALSE

"Hamster" != "Mouse"

[1] TRUE

Relational operators and vectors

age <- c(12, 4, 3, 8, 4, 2, 1)
age < 5

[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

This behavior is called recycling as is implemented in many (but not all!) R functions.

recycling: An operation is applied to each element of a vector and a vector is returned.

age	age < 5
12	FALSE
4	TRUE
3	TRUE
8	FALSE
4	TRUE
2	TRUE
1	TRUE

Using logical vectors to select values

When you put a logical vector within square brackets [ ] after an object, all elements of that object with a TRUE in the logical vector are selected:

age <- c(12, 4, 3, 8)
x <- age > 5
x

[1]  TRUE FALSE FALSE  TRUE

age[x]

[1] 12  8

Using logical vectors to select values

age <- c(12, 4, 3, 8)
x <- age > 5
age[x]

age	x <- age > 5	Select?	Result
12	TRUE	select	12
4	FALSE	drop
3	FALSE	drop
8	TRUE	select	8

Task

Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.

Task - solution

Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.

friends <- c(4, 5, 6, 3, 7, 2, 3)
friends[friends >= 4]

[1] 4 5 6 7

which()

The which() functions gives the indices of the elements that are TRUE.
It takes a logical vector as an argument.

x <- c(TRUE, FALSE, FALSE, TRUE)
which(x)

[1] 1 4

which() can handle missing values:

x <- c(TRUE, FALSE, NA, FALSE, TRUE, NA)
which(x)

[1] 1 5

age <- c(12, 4, 3, 8)
x <- age < 5
x

[1] FALSE  TRUE  TRUE FALSE

which(x)

[1] 2 3

age <- c(12, 4, 3, 8)
x <- age < 5
x
which(x)
age[which(x)]

Index	age	x <- age < 5	which(x)	age[which(x)]
1	12	FALSE
2	4	TRUE	2	4
3	3	TRUE	3	3
4	8	FALSE

Why use which?

age = c(NA, 12, 4, 3, NA, 8, 7, 4, 3, 6, 4, 3)
x <- age < 6
x

 [1]    NA FALSE  TRUE  TRUE    NA FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

age[x]

[1] NA  4  3 NA  4  3  4  3

mean(age[x])

[1] NA

mean(age[which(x)])

[1] 3.5

Task

Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.

Task - solution

Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.

x <- c(1, 4, 5, 3, 4, 5)
which(x >= 3)

[1] 2 3 4 5 6

y <- x[which(x != 4)]
y

[1] 1 5 3 5

Selecting cases with logical vectors

Logical vectors can also be appplied to data frames for selecting cases.

Let us take an example data frame:

study <- data.frame(
  sen    = c(0, 1, 0, 1, 0, 1),
  gender = c("M", "M", "F", "M", "F", "F"),
  age    = c(12, 13, 11, 10, 11, 14),
  IQ     = c(90, 85, 90, 87, 99, 89)
)

Select with bracket subsetting or the which() function:

study_no_sen <- study[study[["sen"]] == 0, ]
study_no_sen

	gender	age	IQ
1	M	12	90
3	F	11	90
5	F	11	99

# Or using the which() function
filter <- which(study[["sen"]] == 0)
study_no_sen <- study[filter, ]

Task

Calculate the mean of IQ for students with and without sen.

Task - solution

Calculate the mean of IQ for students with and without sen.

filter <- which(study[["sen"]] == 0)
mean(study[["IQ"]][filter])

[1] 93

filter <- which(study[["sen"]] == 1)
mean(study[["IQ"]][filter])

[1] 87

Logical Operations

Logical operations are applied to logical values.

Operator	Operation	Example	Results
`!`	Not	`! x`	`TRUE when x = FALSE and FALSE when x = TRUE`
`&`	AND	`x & y`	`TRUE when x and y are TRUE else FALSE`
`\|`	OR	`x \| y`	`TRUE when x or y is TRUE else FALSE`

Note: To get the | sign:
On a german Mac keyboard press: option + 7
On a german Windows keyboard press: AltGr + <

Example

x <- TRUE
y <- FALSE

!x

[1] FALSE

!y

[1] TRUE

x & y

[1] FALSE

x | y

[1] TRUE

Logical Operator with vectors

When applied to vectors, logical operations result in a new vector.
Operations are applied to each element one by one.

x <- c(TRUE, FALSE, TRUE,  FALSE)
y <- c(TRUE, FALSE, FALSE, TRUE)

!x

[1] FALSE  TRUE FALSE  TRUE

x & y

[1]  TRUE FALSE FALSE FALSE

x | y

[1]  TRUE FALSE  TRUE  TRUE

Task

Create two vectors:

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)  
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

Determine for each element whether glasses and hyperintelligent are TRUE at the same time.

Task - solutions

Create two vectors:

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)  
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

Determine for each element whether glasses and hyperintelligent are TRUE at the same time.

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
glasses & hyperintelligent

[1]  TRUE FALSE FALSE  TRUE FALSE

glasses	hyperintelligent	glasses & hyperintelligent
TRUE	TRUE	TRUE
TRUE	FALSE	FALSE
FALSE	FALSE	FALSE
TRUE	TRUE	TRUE
FALSE	FALSE	FALSE

`sum()` and `mean()` with logical vectors:

When a logical vector is applied to a numeric function (e.g. mean() or sum()), TRUE is counted as 1 and FALSE as 0:

sum() then gives the number of elements that are TRUE.
mean() gives the proportion of elements that are TRUE.

# e.g.:
sum(c(TRUE, FALSE, TRUE))

[1] 2

mean(c(TRUE, FALSE, TRUE, FALSE))

[1] 0.5

Task

Take the data from the last example and calculate the sum and proportion of cases that wear glasses and are hyperintelligent.

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

Task - solutions

Take the data from the last example and calculate the sum and proportion of cases that wear glasses and are hyperintelligent.

sum(glasses & hyperintelligent)

[1] 2

mean(glasses & hyperintelligent)

[1] 0.4

Combining logical and relational operators

age <- c(12, 4, 3, 8, 4, 2, 1, 7, 4)
gender <- c(0, 1, 0, 1, 0, 0, 0, 0, 1)
age > 4

[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

gender == 0

[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE

age > 4 & gender == 0

[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

Task

Create a vector
income <- c(5000, 4000, 3000, 2000, 1000) and a vector
happiness <- c(20, 35, 30, 10, 50).
Use relational and logical operations to determine for each element whether the income is larger than 2500 and at the same time happiness is above 25.
Calculate the proportion.

Task - solution

Use relational and logical operations to determine for each element whether the income is larger than 2500 and at the same time happiness is above 25.
Calculate the proportion.

income <- c(5000, 4000, 3000, 2000, 1000)
happiness <- c(20, 35, 30, 10, 50)
income > 2500 & happiness > 25

income	happiness	income > 2500	happiness > 25	income > 2500 & happiness > 25
5000	20	TRUE	FALSE	FALSE
4000	35	TRUE	TRUE	TRUE
3000	30	TRUE	TRUE	TRUE
2000	10	FALSE	FALSE	FALSE
1000	50	FALSE	TRUE	FALSE

… and the proportion

mean(income > 2500 & happiness > 25)

[1] 0.4

Subsetting data frames with logical and relational operators

study

sen	gender	age	IQ
0	M	12	90
1	M	13	85
0	F	11	90
1	M	10	87
0	F	11	99
1	F	14	89

filter <- study[["sen"]] == 1 & study[["gender"]] == "M"
study[filter, ]

	sen	gender	age	IQ
2	1	M	13	85
4	1	M	10	87

Task

Use the ChickWeight data frame for the following task.
The data set is already included in R.
Look into the data set with ?ChickWeight.
Get all variable names of the data frame with the names() function (names(ChickWeight)).
Select cases from ChickWeight with Diet == 1 and Time < 16.
For these cases, calculate the correlation between weight and Time. Note: Use the cor() function (e.g., cor(x, y))
Repeat steps 3 and 4 for Diet == 4.
What can you see?

filter <- ChickWeight[["Diet"]] ==  1 & ChickWeight[["Time"]] < 16
diet1 <- ChickWeight[filter,]
cor(diet1[["weight"]], diet1[["Time"]])

[1] 0.8109772

filter <- ChickWeight[["Diet"]] ==  4 & ChickWeight[["Time"]] < 16
diet4 <- ChickWeight[filter,]
cor(diet4[["weight"]], diet4[["Time"]])

[1] 0.9720822

The correlation is larger for Diet 4. This suggests that Diet 4 has a stronger impact an the chicken’s weight.

The `subset()` function

R comes with a function to make subsetting a bit more straight forward.

subset() has the main arguments:

x : A data.frame
subset : A logical vector for filtering rows
select : expression, indicating columns to select from a data frame

and returns a data.frame.

subset(study, gender == "F" & IQ > 89, c(sen, gender, IQ))

	sen	gender	IQ
3	0	F	90
5	0	F	99

Variable names must be provided without quotes and without the name of the data.frame.

Task

Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.

Task - solutions

Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.

subset(mtcars, cyl == 6 & am == 1, c(mpg, am, gear, cyl))

	mpg	am	gear	cyl
Mazda RX4	21.0	1	4	6
Mazda RX4 Wag	21.0	1	4	6
Ferrari Dino	19.7	1	5	6

So many ways of subsetting … an overview

Subset a data frame (and get a new data frame)

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, 
       c("mpg", "am", "gear", "cyl")]

mtcars[mtcars$cyl == 6 & mtcars$am == 1, c("mpg", "am", "gear", "cyl")]

subset(mtcars, cyl == 6 & am == 1, c(mpg, am, gear, cyl))

with(mtcars, 
  mtcars[cyl == 6 & am == 1, c("mpg", "am", "gear", "cyl")]
)

So many ways of subsetting … an overview

Extract a variable from a data frame (and get a numeric or character vector)

mtcars[["mpg"]][mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1]

mtcars$mpg[mtcars$cyl == 6 & mtcars$am == 1]

subset(mtcars, cyl == 6 & am == 1, mpg, drop = TRUE)

with(mtcars, mpg[cyl == 6 & am == 1])

Odd behaviour:

For base R data frames this creates a vector:

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, "mpg"]

[1] 21.0 21.0 19.7

This should have resulted in a data frame with one variable but is automatically reduced to a vector.
Add drop = FALSE to get standard behavior.

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, "mpg", drop = FALSE]

	mpg
Mazda RX4	21.0
Mazda RX4 Wag	21.0
Ferrari Dino	19.7

Some modern implementations of data frames (like tibbles) changed this behavior.

R Basic Concepts: Subsetting

Subsetting

Selecting elements with square brackets

Task

Task - solution

Subsetting data frames

Extraction and subsetting

Selecting rows and columns

Task

Task - solution

Sophisticated subsetting

Relational operators

Examples

Relational vectors and characters

Relational operators and vectors

Using logical vectors to select values

Using logical vectors to select values

Task

Task - solution

which()

Why use which?

Task

Task - solution

Selecting cases with logical vectors

Task

Task - solution

Logical Operations

Example

Logical Operator with vectors

Task

Task - solutions

sum() and mean() with logical vectors:

Task

Task - solutions

Combining logical and relational operators

Task

Task - solution

Subsetting data frames with logical and relational operators

Task

The subset() function

Task

Task - solutions

So many ways of subsetting … an overview

So many ways of subsetting … an overview

Odd behaviour:

R Basic Concepts:
Subsetting

`sum()` and `mean()` with logical vectors:

The `subset()` function