# Data Analysis and R, A (continuously updating) Personal Reference

## APPLY:

### SAPPLY:

`sapply(dataframe$col, FUN = sum/mean/etc)`

```
```

sapply: Applies a function to the column and returns a vector with the results.

### LAPPLY:

`lapply(dataframe$col, FUN = sum/mean/etc)`

```
```

lapply: Applies a function to the column and returns a list with the results.

### TAPPLY:

`tapply(dataframe$col1, dataframe$col2, FUN = sum/mean/etc)`

```
```

tapply: Takes data from the first column and applies the function while subsetting by the factor in column two.

## AGGREGATE

```
aggregate(cbind(dataframe$col1, dataframe$col2) ~ dataframe$col3 +
dataframe$col4, data = dataframe, FUN = sum/mean/etc)
```

```
```

aggregate: Creates pivots of quantitative data in col1 and col2 pivoted by col3 and col4 applying some function to the data.

## CUT:

`cut2(dataframe$column, g=numberofgroups)`

```
```

Cut a continuous variable into a factor with g groups.

## SAMPLE:

`sample(1:rows, size=number, replace=T/F)`

```
```

sample to get a list of row numbers of size with replacement or without replacement. Useful for generating random smaller subsets.

## RANDOMLY SUBSETTING TRAIN AND TEST DATA:

```
set.seed(numeric) #set a random seed
i <- rbinom(rownum, size=1, prob=.5) #flip coins to assign rows
train <- dataframe[i==1,] #subset for train
test <- dataframe[i==0,] #subset for test
```

```
```

## PLOT:

```
plot(dataframe$col1, dataframe$col2, pch=bullettype,
col=color or a descriptive variable, cex = size) #single plot
plot(dataframe[,1:4]) #plot first 4 columns against each other
```

```
```

col option can be added with a factor in order to have different types of information put in different colors. colors can also be a formula to have different sized dots.

plotting multiple columns creates a matrix of plots

pch has integer values to represent different types of bullets

cex determines the size and detail in the plots

## OTHER SCATTER PLOTS:

```
smoothscatter(x,y)
hexbin(x,y)
qqplot(x,y)
```

```
```

Smooth has gradients for frequency and hexbin provides a legend of point colors for frequency. QQplot plots quantiles of x vs quantiles of y (smooth distributions lie on a 45 degree line).

## REGRESSION ON FACTORS:

```
lm(dataframe$quantitative ~ as.factor(dataframe$factor))
lm(dataframe$quantitative ~ relevel(dataframe$factor,
ref ="reference variable")) # sets a reference variable for the lm
```

```
```

Creates a linear regression on factor variables for a quantitative variable. The first factor is the reference variable. Use second example to define a different reference variable.