Homework generously provided by the Data Science in Business Analytics class.
This homework is optional and is not graded.
In this optional tutorial you will get an overview of the basic programming concepts in R and main data types. Just enough to get you up and running essential R code. However, for true “beginners”, we highly recommend going through Advanced R - Chapter ‘Foundations’ from which the content of this assignment is mainly (mostly) inspired by.
To follow the tutorial, you can start R Studio and execute statements from the code chuncks in the R Console.
R uses different libraries or packages to load specific functions (read excel files, talk to Twitter, generate plots).
# install package from RStudio console
# ** Note the quotation marks!
# install.packages("name_of_package")
# load package in enviromnet
library(blogdown)
In R, we assign values (numbers, characters, data frames) to objects (vectors, matrices, variables).
To do so, we use the <-
operator:
# name_of_object <- value
an_object <- 2
another_object <- "some string"
# inspect object's value
an_object
print(another_object)
R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types).
Homogeneous | Heterogeneous | |
---|---|---|
1d | Atomic vector | List |
2d | Matrix | Data frame |
nd | Array |
The basic data structure in R is the vector. Vectors can be of two kinds: atomic vectors and lists. They have three common properties:
typeof()
, what it is.length()
, how many elements it contains.attributes()
, additional arbitrary metadata.
They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.There are four common types of atomic vectors:
Atomic vectors are usually created with c()
, short for combine:
dbl_var <- c(1, 2.5, 4.5)
# with the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
# use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")
int_var <- c(1L, 6L, 10L)
typeof(int_var)
is.integer(int_var)
List objects can hold elements of any type, including lists. You construct lists by using list()
instead of c()
:
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list (with unique names). They can be accessed individually with attr()
or all at once (as a list) with attributes().
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
# inspect the attribute of y
attr(y, "my_attribute")
Adding a dim
attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions.
Matrices and arrays are created with matrix()
and array()
, or by using the assignment form of dim()
:
# two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# one vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
# you can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c
length()
and names()
have high-dimensional generalisations:
length()
generalises to nrow()
and ncol()
for matrices, and dim()
for arrays.
names()
generalises to rownames()
and colnames()
for matrices, and dimnames()
, a list of character vectors, for arrays.
c()
generalises to cbind()
and rbind()
for matrices, and to abind()
(provided by the abind package) for arrays. You can transpose a matrix with t()
; the generalised equivalent for arrays is aperm()
.
You can test if an object is a matrix or array using is.matrix()
and is.array()
, or by looking at the length of the dim()
. as.matrix()
and as.array()
make it easy to turn an existing vector into a matrix or array.
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names()
, colnames()
, and rownames()
, although names()
and colnames()
are the same thing. The length()
of a data frame is the length of the underlying list and so is the same as ncol()
; nrow()
gives the number of rows.
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
You can combine data frames using cbind() and rbind():
cbind(df, data.frame(z = 3:1))
Let’s explore the different types of subsetting with a simple vector, x
.
x <- c(2, 4, 3, 5)
Positive integers return elements at the specified positions:
x[c(3, 1)]
Duplicated indices yield duplicated values
x[c(1, 1)]
Real numbers are silently truncated to integers
x[c(2, 9)]
Negative integers omit elements at the specified positions:
x[-c(3, 1)]
You can’t mix positive and negative integers in a single subset: x[c(-1, 2)]
Logical vectors select elements where the corresponding logical value is TRUE. This is probably the most useful type of subsetting because you write the expression that creates the logical vector:
x[c(TRUE, TRUE, FALSE, FALSE)]
x[x > 3]
A missing value in the index always yields a missing value in the output:
x[c(TRUE, TRUE, NA, FALSE)]
Nothing returns the original vector. This is not useful for vectors but is very useful for matrices, data frames, and arrays. It can also be useful in conjunction with assignment.
x[]
Zero returns a zero-length vector. This is not something you usually do on purpose, but it can be helpful for generating test data.
x[0]
If the vector is named, you can also use character vectors to return elements with matching names:
(y <- setNames(x, letters[1:4]))
y[c("d", "c", "a")]
Like integer indices, you can repeat indices:
y[c("a", "a", "a")]
Subsetting with names returns always the match exactly, if any.
z <- c(abc = 1, def = 2)
z[c("a", "d")]
Subsetting a list works in the same way as subsetting an atomic vector. Using [
will always return a list;
[[
and $
, as described below, let you pull out the components of the list.
You can subset higher-dimensional structures in three ways:
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
# multiple vectors
a[1:2, ]
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
# selecting by value of certain vector
df[df$x == 2, ]
The following checklist makes it easier to import data correctly into R:
library(readr)
# import data from .txt file
df <- read_table(
"https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
col_names = FALSE)
df
# import data from .csv file
df <- read.table(
"https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",
header = TRUE,
sep = ",")
Standard format for defining a function in R:
my_function_name <- function(arg1 = "default", arg2 = "default") {
# 'cat' is used for concatenating strings
merged_string <- cat(arg1, arg2)
# if not specified, last evaluated object is returned
return(merged_string)
}
# call a function elsewhere from code
arg1 <- "Hello"
arg2 <- "World!"
a_greeting <- my_function_name(arg1, arg2)
print(a_greeting)
CRAN - the curated repository of R packages provides millions of functions that you could use to tackle data. You simply need to install a package, and then call the function from your R code function_name(somearguments)
. For example, the package stats
helps you in fitting linear model through the function lm()
:
library(stats)
x <- rnorm(500)
y <- x*4 + rnorm(500)
lm.fit <- lm(y~x, data = data.frame(x, y))
print(lm.fit)
How many functions have been used in the example? What does rnorm
mean? You can get informed about any R function by using its documentation ?function_name
or ?packageName::function_name
.
Try to figure out the answers without executing the code. Check your answers in R Studio.
a) Given the vector: x <- c("ww", "ee", "ff", "uu", "kk")
, what will be the output for x[c(2,3)]
?
b) Let a <- c(2, 4, 6, 8)
and b <- c(TRUE, FALSE, TRUE, FALSE)
, what will be the output for the R expression max(a[b])
?
c) Is it possible to apply the function my_function_name
using x
and a
as arguments?
Consider a vector x
such that:
x=c(1, 3, 4, 7, 11, 18, 29)
Write an R statement that will return a list X2
with components of value:
x * 2
, x / 2
, sqrt(x)
and names 'x*2'
, 'x/2'
, 'sqrt(x)'
.
Read the file Table0.txt into an object DS.
a) What is the data type for the object DS?
b) Change the names of the columns to Name, Age, Height, Weight and Sex.
c) Change the row names so that they are the same as Name, and remove the variable Name.
d) Get the number of rows and columns of the data.
a) Convert DS from the previous exercise to a data frame DF.
b) Add an additional column “zeros” in DF with all elements 0.
c) Remove the Weight comuln from DF.