*Homework generously provided by the Data Science in Business Analytics class.*

This homework is optional and is not graded.

In this optional tutorial you will get an overview of the basic programming concepts in R and main data types. Just enough to get you up and running essential R code. However, for true “beginners”, we highly recommend going through Advanced R - Chapter ‘Foundations’ from which the content of this assignment is mainly (mostly) inspired by.

**R**is a programming language for statistical analysis.**RStudio**is the integrated development environment (IDE) for R in which we write and execute R code, plot things and write reports.- Installation guidelines and details (Mac, Windows, Linux)

To follow the tutorial, you can start R Studio and execute statements from the code chuncks in the R Console.

R uses different libraries or packages to load specific functions (read excel files, talk to Twitter, generate plots).

```
# install package from RStudio console
# ** Note the quotation marks!
# install.packages("name_of_package")
# load package in enviromnet
library(blogdown)
```

In R, we assign values (numbers, characters, data frames) to objects (vectors, matrices, variables).
To do so, we use the `<-`

operator:

```
# name_of_object <- value
an_object <- 2
another_object <- "some string"
# inspect object's value
an_object
print(another_object)
```

R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types).

Homogeneous | Heterogeneous | |
---|---|---|

1d | Atomic vector | List |

2d | Matrix | Data frame |

nd | Array |

The basic data structure in R is the vector. Vectors can be of two kinds: atomic vectors and lists. They have three common properties:

- Type,
`typeof()`

, what it is. - Length,
`length()`

, how many elements it contains. - Attributes,
`attributes()`

, additional arbitrary metadata. They differ in the types of their elements:**all elements**of an atomic vector must be the**same type**, whereas the elements of a list can have different types.

There are four common types of atomic vectors:

- logical
- integer
- double (often called numeric)
- character

Atomic vectors are usually created with `c()`

, short for combine:

```
dbl_var <- c(1, 2.5, 4.5)
# with the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
# use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")
int_var <- c(1L, 6L, 10L)
typeof(int_var)
is.integer(int_var)
```

List objects can hold elements of any type, including lists. You construct lists by using `list()`

instead of `c()`

:

```
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
```

All objects can have arbitrary additional attributes, used to store **metadata** about the object. Attributes can be thought of as a named list (with unique names). They can be accessed individually with `attr()`

or all at once (as a list) with attributes().

```
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
# inspect the attribute of y
attr(y, "my_attribute")
```

Adding a `dim`

attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions.
Matrices and arrays are created with `matrix()`

and `array()`

, or by using the assignment form of `dim()`

:

```
# two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# one vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
# you can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c
```

`length()`

and `names()`

have high-dimensional generalisations:

`length()`

generalises to`nrow()`

and`ncol()`

for matrices, and`dim()`

for arrays.`names()`

generalises to`rownames()`

and`colnames()`

for matrices, and`dimnames()`

, a list of character vectors, for arrays.

`c()`

generalises to `cbind()`

and `rbind()`

for matrices, and to `abind()`

(provided by the abind package) for arrays. You can transpose a matrix with `t()`

; the generalised equivalent for arrays is `aperm()`

.

You can test if an object is a matrix or array using `is.matrix()`

and `is.array()`

, or by looking at the length of the `dim()`

. `as.matrix()`

and `as.array()`

make it easy to turn an existing vector into a matrix or array.

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a **list of equal-length vectors**. This makes it a **2-dimensional** structure, so it shares properties of both the matrix and the list. This means that a data frame has `names()`

, `colnames()`

, and `rownames()`

, although `names()`

and `colnames()`

are the same thing. The `length()`

of a data frame is the length of the underlying list and so is the same as `ncol()`

; `nrow()`

gives the number of rows.

```
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
```

You can combine data frames using cbind() and rbind():

```
cbind(df, data.frame(z = 3:1))
```

Let’s explore the different types of subsetting with a simple vector, `x`

.

```
x <- c(2, 4, 3, 5)
```

Positive integers return elements at the specified positions:

```
x[c(3, 1)]
```

Duplicated indices yield duplicated values

```
x[c(1, 1)]
```

Real numbers are silently truncated to integers

```
x[c(2, 9)]
```

Negative integers omit elements at the specified positions:

```
x[-c(3, 1)]
```

You can’t mix positive and negative integers in a single subset: x[c(-1, 2)]

Logical vectors select elements where the corresponding logical value is TRUE. This is probably the most useful type of subsetting because you write the expression that creates the logical vector:

```
x[c(TRUE, TRUE, FALSE, FALSE)]
x[x > 3]
```

A missing value in the index always yields a missing value in the output:

```
x[c(TRUE, TRUE, NA, FALSE)]
```

Nothing returns the original vector. This is not useful for vectors but is very useful for matrices, data frames, and arrays. It can also be useful in conjunction with assignment.

```
x[]
```

Zero returns a zero-length vector. This is not something you usually do on purpose, but it can be helpful for generating test data.

```
x[0]
```

If the vector is named, you can also use character vectors to return elements with matching names:

```
(y <- setNames(x, letters[1:4]))
y[c("d", "c", "a")]
```

Like integer indices, you can repeat indices:

```
y[c("a", "a", "a")]
```

Subsetting with names returns always the match exactly, if any.

```
z <- c(abc = 1, def = 2)
z[c("a", "d")]
```

Subsetting a list works in the same way as subsetting an atomic vector. Using `[`

will always return a list;
`[[`

and `$`

, as described below, let you pull out the components of the list.

You can subset higher-dimensional structures in three ways:

- With multiple vectors.
- With a single vector.
- With a matrix.

```
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
# multiple vectors
a[1:2, ]
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
# selecting by value of certain vector
df[df$x == 2, ]
```

The following checklist makes it easier to import data correctly into R:

- The first row is maybe reserved for the header, while the first column is used to identify the sampling unit;
- Avoid names, values or fields with blank spaces, otherwise each word will be interpreted as a separate variable, resulting in errors that are related to the number of elements per line in your data set;
- Short names are prefered over longer names;
- Try to avoid using names that contain symbols such as ? $ % ^ & * ( ) - # , ; . < > / | \ [ ] { };
- Make sure that any missing values in your data set are indicated with NA.

```
library(readr)
# import data from .txt file
df <- read_table(
"https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
col_names = FALSE)
df
# import data from .csv file
df <- read.table(
"https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",
header = TRUE,
sep = ",")
```

Standard format for defining a function in R:

```
my_function_name <- function(arg1 = "default", arg2 = "default") {
# 'cat' is used for concatenating strings
merged_string <- cat(arg1, arg2)
# if not specified, last evaluated object is returned
return(merged_string)
}
# call a function elsewhere from code
arg1 <- "Hello"
arg2 <- "World!"
a_greeting <- my_function_name(arg1, arg2)
print(a_greeting)
```

CRAN - the curated repository of R packages provides millions of functions that you could use to tackle data. You simply need to install a package, and then call the function from your R code `function_name(somearguments)`

. For example, the package `stats`

helps you in fitting linear model through the function `lm()`

:

```
library(stats)
x <- rnorm(500)
y <- x*4 + rnorm(500)
lm.fit <- lm(y~x, data = data.frame(x, y))
print(lm.fit)
```

How many functions have been used in the example? What does `rnorm`

mean? You can get informed about any R function by using its documentation `?function_name`

or `?packageName::function_name`

.

Try to figure out the answers without executing the code. Check your answers in R Studio.

a) Given the vector: `x <- c("ww", "ee", "ff", "uu", "kk")`

, what will be the output for `x[c(2,3)]`

?

b) Let `a <- c(2, 4, 6, 8)`

and `b <- c(TRUE, FALSE, TRUE, FALSE)`

, what will be the output for the R expression `max(a[b])`

?

c) Is it possible to apply the function `my_function_name`

using `x`

and `a`

as arguments?

Consider a vector `x`

such that:
`x=c(1, 3, 4, 7, 11, 18, 29)`

Write an R statement that will return a list `X2`

with components of value:
`x * 2`

, `x / 2`

, `sqrt(x)`

and names `'x*2'`

, `'x/2'`

, `'sqrt(x)'`

.

Read the file Table0.txt into an object DS.

a) What is the data type for the object DS?

b) Change the names of the columns to Name, Age, Height, Weight and Sex.

c) Change the row names so that they are the same as Name, and remove the variable Name.

d) Get the number of rows and columns of the data.

a) Convert DS from the previous exercise to a data frame DF.

b) Add an additional column “zeros” in DF with all elements 0.

c) Remove the Weight comuln from DF.