[1] "Sheldon"
University of Münster
2025-10-30
Selecting elements of a data structure.
By providing a number within square brackets, the respective element is selected from a vector:
When you provide a vector of numbers, multiple elements are selected
You can even change the order or repeat elements:
With negative numbers, columns are dropped:
Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"
Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"
Firstly, we create an example data frame:
Square brackets select a column of a data frame either by a number the column name:
The subsetted object is a data frame with one column.
This is different from extracting a variable with $ or [[ signs:
which returns a vector (!)
While this works:
this throws an error:
Error in median.default(study["age"]) : need numeric data
Providing a vector will select multiple columns:
The extraction of a vector and the selection of elements can be combined:
Or within one step:
Specific cases are selected within square brackets: object_name[rows, columns].
You could also use numbers to address the columns:
Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.
Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.
Subsetting becomes most powerful when it is combined with conditional selections.
For example:
To apply such selections, we have to know about relational and logical operators.
Relational operators compare two values and return a logical value (TRUE or FALSE)
| Operator | Relation | Example |
|---|---|---|
== |
is identical | x == y |
!= |
is not identical | x != y |
> |
is greater | x > y |
>= |
is greater or identical | x >= y |
< |
is less | x < y |
<= |
is less or identical | x <= y |
Only == and != can be applied to non numerical objects:
This behavior is called recycling as is implemented in many (but not all!) R functions.
recycling: An operation is applied to each element of a vector and a vector is returned.
| age | age < 5 |
|---|---|
| 12 | FALSE |
| 4 | TRUE |
| 3 | TRUE |
| 8 | FALSE |
| 4 | TRUE |
| 2 | TRUE |
| 1 | TRUE |
When you put a logical vector within square brackets [ ] after an object, all elements of that object with a TRUE in the logical vector are selected:
| age | x <- age > 5 | Select? | Result |
|---|---|---|---|
| 12 | TRUE | select | 12 |
| 4 | FALSE | drop | |
| 3 | FALSE | drop | |
| 8 | TRUE | select | 8 |
Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.
Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.
The which() functions gives the indices of the elements that are TRUE.
It takes a logical vector as an argument.
which() can handle missing values:
| Index | age | x <- age < 5 | which(x) | age[which(x)] |
|---|---|---|---|---|
| 1 | 12 | FALSE | ||
| 2 | 4 | TRUE | 2 | 4 |
| 3 | 3 | TRUE | 3 | 3 |
| 4 | 8 | FALSE |
Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.
Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.
Logical vectors can also be appplied to data frames for selecting cases.
Let us take an example data frame:
Select with bracket subsetting or the which() function:
Calculate the mean of IQ for students with and without sen.
Calculate the mean of IQ for students with and without sen.
Logical operations are applied to logical values.
| Operator | Operation | Example | Results |
|---|---|---|---|
! |
Not | ! x |
TRUE when x = FALSE and FALSE when x = TRUE |
& |
AND | x & y |
TRUE when x and y are TRUE else FALSE |
| |
OR | x | y |
TRUE when x or y is TRUE else FALSE |
Note: To get the | sign:
On a german Mac keyboard press: option + 7
On a german Windows keyboard press: AltGr + <
When applied to vectors, logical operations result in a new vector.
Operations are applied to each element one by one.
glasses and hyperintelligent are TRUE at the same time.| glasses | hyperintelligent | glasses & hyperintelligent |
|---|---|---|
| TRUE | TRUE | TRUE |
| TRUE | FALSE | FALSE |
| FALSE | FALSE | FALSE |
| TRUE | TRUE | TRUE |
| FALSE | FALSE | FALSE |
sum() and mean() with logical vectors:When a logical vector is applied to a numeric function (e.g. mean() or sum()), TRUE is counted as 1 and FALSE as 0:
sum() then gives the number of elements that are TRUE.
mean() gives the proportion of elements that are TRUE.
income <- c(5000, 4000, 3000, 2000, 1000) and a vectorhappiness <- c(20, 35, 30, 10, 50).income is larger than 2500 and at the same time happiness is above 25.income is larger than 2500 and at the same time happiness is above 25.| income | happiness | income > 2500 | happiness > 25 | income > 2500 & happiness > 25 |
|---|---|---|---|---|
| 5000 | 20 | TRUE | FALSE | FALSE |
| 4000 | 35 | TRUE | TRUE | TRUE |
| 3000 | 30 | TRUE | TRUE | TRUE |
| 2000 | 10 | FALSE | FALSE | FALSE |
| 1000 | 50 | FALSE | TRUE | FALSE |
… and the proportion
ChickWeight data frame for the following task.?ChickWeight.names() function (names(ChickWeight)).Diet == 1 and Time < 16.weight and Time. Note: Use the cor() function (e.g., cor(x, y))Diet == 4.filter <- ChickWeight[["Diet"]] == 1 & ChickWeight[["Time"]] < 16
diet1 <- ChickWeight[filter,]
cor(diet1[["weight"]], diet1[["Time"]])[1] 0.8109772
filter <- ChickWeight[["Diet"]] == 4 & ChickWeight[["Time"]] < 16
diet4 <- ChickWeight[filter,]
cor(diet4[["weight"]], diet4[["Time"]])[1] 0.9720822
The correlation is larger for Diet 4. This suggests that Diet 4 has a stronger impact an the chicken’s weight.
subset() functionR comes with a function to make subsetting a bit more straight forward.
subset() has the main arguments:
x : A data.framesubset : A logical vector for filtering rowsselect : expression, indicating columns to select from a data frameand returns a data.frame.
Variable names must be provided without quotes and without the name of the data.frame.
Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.
Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.
Subset a data frame (and get a new data frame)
Extract a variable from a data frame (and get a numeric or character vector)
For base R data frames this creates a vector:
This should have resulted in a data frame with one variable but is automatically reduced to a vector.
Add drop = FALSE to get standard behavior.
| mpg | |
|---|---|
| Mazda RX4 | 21.0 |
| Mazda RX4 Wag | 21.0 |
| Ferrari Dino | 19.7 |
Some modern implementations of data frames (like tibbles) changed this behavior.
Jürgen Wilbert - Introduction to R