4.1 Data Preprocessing
-We will now discuss some methods for data manipulation to clean a dataset, combine various datasets or extract a variable from a data frame before we jump into some programming basics.
4.1.1 Extracting Data
-Data frames are the most used data structures in R as they offer more flexibility in the way they can handle data.
Lets see some methods to extract data from a data frame. We will use the example dataset called \(\mathtt{us\_stocks.csv}\).
Lets import it using \(\mathtt{read.csv}\)
Date MSFT IBM AAPL MCD PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00 NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62 NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22 NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78 NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88 NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60 NA
The function \(\mathtt{names}\) or \(\mathtt{colnames}\) are used to access the names of the columns (or variables) in the data set as shows below.
The function \(\mathtt{row.names}\) can be used to access row names (if any) from a dataset
[1] "Date" "MSFT" "IBM" "AAPL" "MCD" "PG" "GOOG"
[1] "Date" "MSFT" "IBM" "AAPL" "MCD" "PG" "GOOG"
A specific data variable can be accessed using its name or index (column number) in the data frame.
To select any column use \(\mathtt{\$}\) symbol followed by the column name or its name in square brackets as shown in the example below
[1] 33.52 34.62 34.45 34.28 34.69 34.36
[1] 33.52 34.62 34.45 34.28 34.69 34.36
# the following returns data as a data frame
msft_prices3 = data_stocks["MSFT"] #can also be used to access multiple columns
head(msft_prices3)
MSFT
1 33.52
2 34.62
3 34.45
4 34.28
5 34.69
6 34.36
These data columns can also be accessed like a matrix, using a matrix index.
This method can return a complete row, a complete column or just an element from the dataset.
# MSFT is in the second column and leaving the row index blank returns all the
# rows for the particular column
msft_prices4 = data_stocks[, 2]
head(msft_prices4)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
Date MSFT IBM AAPL MCD PG GOOG
4 7/01/2002 34.28 124.05 11.45 27.2 38.78 NA
4.1.2 Combining Data Frames
It may be required to combine two data frames during a data processing.
This can be done by stacking them row by row or combining them by columns using \(\mathtt{rbind}\) and \(\mathtt{cbind}\) respectively.
When using \(\mathtt{cbind}\) the number of rows in the columns combined must be of equal length likewise in \(\mathtt{rbind}\) the number of columns of the datasets combined should be equal. Lets see an example
# First create a vector having the returns for msft
msft_ret = 100 * diff(log(data_stocks$MSFT))
# combine the vector with the data
data_stocks_r = cbind(data_stocks, MSFT_RET = msft_ret) #this will generate an error message
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2784, 2783
[1] 2783
[1] 2784
# add one more value to vector msft_ret
msft_ret = c(0, msft_ret)
# check the length
length(msft_ret)
[1] 2784
# lets combine now (it should work)
data_stocks_r = cbind(data_stocks, MSFT_RET = msft_ret)
head(data_stocks_r) #shows one more column added to the data
Date MSFT IBM AAPL MCD PG GOOG MSFT_RET
1 2/01/2002 33.52 121.50 11.65 26.49 40.00 NA 0.0000000
2 3/01/2002 34.62 123.66 11.79 26.79 39.62 NA 3.2289274
3 4/01/2002 34.45 125.60 11.84 26.99 39.22 NA -0.4922552
4 7/01/2002 34.28 124.05 11.45 27.20 38.78 NA -0.4946904
5 8/01/2002 34.69 124.70 11.30 27.36 38.88 NA 1.1889367
6 9/01/2002 34.36 124.49 10.82 26.88 38.60 NA -0.9558364
- The following example adds a row to the data frame.
# create two dataframes from data_stocks
data_r1 = data_stocks[1:10, ] #first 10 rows
data_r2 = data_stocks[2775:2784, ] #last 10 rows
data_stocks_rbind = rbind(data_r1, data_r2)
print(data_stocks_rbind)
Date MSFT IBM AAPL MCD PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00 NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62 NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22 NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78 NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88 NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60 NA
7 10/01/2002 34.64 122.14 10.62 26.81 38.46 NA
8 11/01/2002 34.30 120.31 10.52 26.34 38.60 NA
9 14/01/2002 34.24 118.05 10.58 26.02 39.35 NA
10 15/01/2002 34.78 118.85 10.85 26.20 39.82 NA
2775 17/12/2012 27.10 193.62 518.83 89.91 69.93 720.78
2776 18/12/2012 27.56 195.69 533.90 90.52 69.97 721.07
2777 19/12/2012 27.31 195.08 526.31 89.71 69.34 720.11
2778 20/12/2012 27.68 194.77 521.73 90.04 69.82 722.36
2779 21/12/2012 27.45 193.42 519.33 90.18 68.72 715.63
2780 24/12/2012 27.06 192.40 520.17 89.29 68.52 709.50
2781 26/12/2012 26.86 191.95 513.00 88.74 68.00 708.87
2782 27/12/2012 26.96 192.71 515.06 88.72 67.97 706.29
2783 28/12/2012 26.55 189.83 509.59 87.58 67.15 700.01
2784 31/12/2012 26.71 191.55 532.17 88.21 67.89 707.38
4.1.3 Sub setting and Logical Data Selection
Suppose we want to extract data with particular characteristics like values ranges etc.
This can be accomplished using logical statements in bracket notations.
The following example illustrates. See \(\mathtt{help(">")}\) to see more comparison operators.
# select all rows with Apple prices above 100
data_aaplgr100 = data_stocks[data_stocks$AAPL > 100, ]
head(data_aaplgr100)
Date MSFT IBM AAPL MCD PG GOOG
1342 2/05/2007 30.61 102.22 100.39 50.02 62.37 465.78
1343 3/05/2007 30.97 102.80 100.40 49.91 62.00 473.23
1344 4/05/2007 30.56 102.96 100.81 49.92 62.41 471.12
1345 7/05/2007 30.71 103.16 103.92 49.50 62.18 467.27
1346 8/05/2007 30.75 103.29 105.06 49.32 61.75 466.81
1347 9/05/2007 30.78 104.38 106.88 49.84 62.01 469.25
[1] NA
# this give NA as the minimum which indicates that data frame has NA lets remove
# NAs from data_aaplgr100 using na.omit function
data_aaplgr100 = na.omit(data_aaplgr100)
# now check the minimum again
min(data_aaplgr100$AAPL)
[1] 100.06
- The \(\mathtt{na.omit}\) function used in the example above can be used to remove all the empty values in the dataset.
Date MSFT IBM AAPL MCD PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00 NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62 NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22 NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78 NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88 NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60 NA
data_stocks_googlena = data_stocks[!is.na(data_stocks$GOOG), ]
head(data_stocks_googlena) #after removing NAs
Date MSFT IBM AAPL MCD PG GOOG
663 19/08/2004 27.12 84.89 15.36 26.60 54.48 100.34
664 20/08/2004 27.20 85.25 15.40 27.07 54.85 108.31
665 23/08/2004 27.24 84.65 15.54 26.64 54.75 109.40
666 24/08/2004 27.24 84.71 15.98 26.87 54.95 104.87
667 25/08/2004 27.55 85.07 16.52 26.95 55.30 106.00
668 26/08/2004 27.44 84.69 17.33 27.10 55.70 107.91
# the above can still leave NAs in other columns use na.omit to remove all the
# blank data
data_stocks_naomit = na.omit(data_stocks)
There can be a requirement in data pre processing where one might have to select data in a range.
The following example selects data where MSFT prices lie between 20 and 30.
\(\mathtt{\&}\) is a Logic operator in R see help(“&”) to see more details and other Logic operators.
data_msft = data_stocks_naomit[data_stocks_naomit$MSFT <= 30 & data_stocks_naomit$MSFT >
20, ]
min(data_msft$MSFT) #check
[1] 20.06
These selections can also be performed using the function \(\mathtt{subset}\).
The following example uses \(\mathtt{subset}\) function to select rows with AAPL>100. The arguments to the function are also shows in the example
function (x, subset, select, drop = FALSE, ...)
NULL
Date MSFT IBM AAPL MCD PG GOOG
1342 2/05/2007 30.61 102.22 100.39 50.02 62.37 465.78
1343 3/05/2007 30.97 102.80 100.40 49.91 62.00 473.23
1344 4/05/2007 30.56 102.96 100.81 49.92 62.41 471.12
1345 7/05/2007 30.71 103.16 103.92 49.50 62.18 467.27
1346 8/05/2007 30.75 103.29 105.06 49.32 61.75 466.81
1347 9/05/2007 30.78 104.38 106.88 49.84 62.01 469.25
[1] 100.06