4.1 Data Preprocessing

-We will now discuss some methods for data manipulation to clean a dataset, combine various datasets or extract a variable from a data frame before we jump into some programming basics.

4.1.1 Extracting Data

-Data frames are the most used data structures in R as they offer more flexibility in the way they can handle data.

  • Lets see some methods to extract data from a data frame. We will use the example dataset called \(\mathtt{us\_stocks.csv}\).

  • Lets import it using \(\mathtt{read.csv}\)

data_stocks = read.csv(file = "data/us_stocks.csv", header = TRUE)
head(data_stocks)
       Date  MSFT    IBM  AAPL   MCD    PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00   NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62   NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22   NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78   NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88   NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60   NA
  • The function \(\mathtt{names}\) or \(\mathtt{colnames}\) are used to access the names of the columns (or variables) in the data set as shows below.

  • The function \(\mathtt{row.names}\) can be used to access row names (if any) from a dataset

names(data_stocks)
[1] "Date" "MSFT" "IBM"  "AAPL" "MCD"  "PG"   "GOOG"
colnames(data_stocks)
[1] "Date" "MSFT" "IBM"  "AAPL" "MCD"  "PG"   "GOOG"
  • A specific data variable can be accessed using its name or index (column number) in the data frame.

  • To select any column use \(\mathtt{\$}\) symbol followed by the column name or its name in square brackets as shown in the example below

msft_prices1 = data_stocks$MSFT  #the data is returned as a vector
head(msft_prices1)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
msft_prices2 = data_stocks[["MSFT"]]  #the data is returned as a vector
head(msft_prices2)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
# the following returns data as a data frame
msft_prices3 = data_stocks["MSFT"]  #can also be used to access multiple columns
head(msft_prices3)
   MSFT
1 33.52
2 34.62
3 34.45
4 34.28
5 34.69
6 34.36
  • These data columns can also be accessed like a matrix, using a matrix index.

  • This method can return a complete row, a complete column or just an element from the dataset.

# MSFT is in the second column and leaving the row index blank returns all the
# rows for the particular column

msft_prices4 = data_stocks[, 2]

head(msft_prices4)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
# all the elements in row 4
data_stocks[4, ]
       Date  MSFT    IBM  AAPL  MCD    PG GOOG
4 7/01/2002 34.28 124.05 11.45 27.2 38.78   NA

4.1.2 Combining Data Frames

  • It may be required to combine two data frames during a data processing.

  • This can be done by stacking them row by row or combining them by columns using \(\mathtt{rbind}\) and \(\mathtt{cbind}\) respectively.

  • When using \(\mathtt{cbind}\) the number of rows in the columns combined must be of equal length likewise in \(\mathtt{rbind}\) the number of columns of the datasets combined should be equal. Lets see an example

# First create a vector having the returns for msft
msft_ret = 100 * diff(log(data_stocks$MSFT))
# combine the vector with the data
data_stocks_r = cbind(data_stocks, MSFT_RET = msft_ret)  #this will generate an error message 
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2784, 2783
# different length
length(msft_ret)
[1] 2783
length(data_stocks$MSFT)
[1] 2784
# add one more value to vector msft_ret
msft_ret = c(0, msft_ret)
# check the length
length(msft_ret)
[1] 2784
# lets combine now (it should work)
data_stocks_r = cbind(data_stocks, MSFT_RET = msft_ret)
head(data_stocks_r)  #shows one more column added to the data
       Date  MSFT    IBM  AAPL   MCD    PG GOOG   MSFT_RET
1 2/01/2002 33.52 121.50 11.65 26.49 40.00   NA  0.0000000
2 3/01/2002 34.62 123.66 11.79 26.79 39.62   NA  3.2289274
3 4/01/2002 34.45 125.60 11.84 26.99 39.22   NA -0.4922552
4 7/01/2002 34.28 124.05 11.45 27.20 38.78   NA -0.4946904
5 8/01/2002 34.69 124.70 11.30 27.36 38.88   NA  1.1889367
6 9/01/2002 34.36 124.49 10.82 26.88 38.60   NA -0.9558364
  • The following example adds a row to the data frame.
# create two dataframes from data_stocks
data_r1 = data_stocks[1:10, ]  #first 10 rows
data_r2 = data_stocks[2775:2784, ]  #last 10 rows
data_stocks_rbind = rbind(data_r1, data_r2)
print(data_stocks_rbind)
           Date  MSFT    IBM   AAPL   MCD    PG   GOOG
1     2/01/2002 33.52 121.50  11.65 26.49 40.00     NA
2     3/01/2002 34.62 123.66  11.79 26.79 39.62     NA
3     4/01/2002 34.45 125.60  11.84 26.99 39.22     NA
4     7/01/2002 34.28 124.05  11.45 27.20 38.78     NA
5     8/01/2002 34.69 124.70  11.30 27.36 38.88     NA
6     9/01/2002 34.36 124.49  10.82 26.88 38.60     NA
7    10/01/2002 34.64 122.14  10.62 26.81 38.46     NA
8    11/01/2002 34.30 120.31  10.52 26.34 38.60     NA
9    14/01/2002 34.24 118.05  10.58 26.02 39.35     NA
10   15/01/2002 34.78 118.85  10.85 26.20 39.82     NA
2775 17/12/2012 27.10 193.62 518.83 89.91 69.93 720.78
2776 18/12/2012 27.56 195.69 533.90 90.52 69.97 721.07
2777 19/12/2012 27.31 195.08 526.31 89.71 69.34 720.11
2778 20/12/2012 27.68 194.77 521.73 90.04 69.82 722.36
2779 21/12/2012 27.45 193.42 519.33 90.18 68.72 715.63
2780 24/12/2012 27.06 192.40 520.17 89.29 68.52 709.50
2781 26/12/2012 26.86 191.95 513.00 88.74 68.00 708.87
2782 27/12/2012 26.96 192.71 515.06 88.72 67.97 706.29
2783 28/12/2012 26.55 189.83 509.59 87.58 67.15 700.01
2784 31/12/2012 26.71 191.55 532.17 88.21 67.89 707.38

4.1.3 Sub setting and Logical Data Selection

  • Suppose we want to extract data with particular characteristics like values ranges etc.

  • This can be accomplished using logical statements in bracket notations.

  • The following example illustrates. See \(\mathtt{help(">")}\) to see more comparison operators.

# select all rows with Apple prices above 100
data_aaplgr100 = data_stocks[data_stocks$AAPL > 100, ]
head(data_aaplgr100)
          Date  MSFT    IBM   AAPL   MCD    PG   GOOG
1342 2/05/2007 30.61 102.22 100.39 50.02 62.37 465.78
1343 3/05/2007 30.97 102.80 100.40 49.91 62.00 473.23
1344 4/05/2007 30.56 102.96 100.81 49.92 62.41 471.12
1345 7/05/2007 30.71 103.16 103.92 49.50 62.18 467.27
1346 8/05/2007 30.75 103.29 105.06 49.32 61.75 466.81
1347 9/05/2007 30.78 104.38 106.88 49.84 62.01 469.25
min(data_aaplgr100$AAPL)  #check if the prices are above 100
[1] NA
# this give NA as the minimum which indicates that data frame has NA lets remove
# NAs from data_aaplgr100 using na.omit function
data_aaplgr100 = na.omit(data_aaplgr100)
# now check the minimum again
min(data_aaplgr100$AAPL)
[1] 100.06
  • The \(\mathtt{na.omit}\) function used in the example above can be used to remove all the empty values in the dataset.
head(data_stocks)  #notice NAs in GOOG
       Date  MSFT    IBM  AAPL   MCD    PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00   NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62   NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22   NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78   NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88   NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60   NA
data_stocks_googlena = data_stocks[!is.na(data_stocks$GOOG), ]
head(data_stocks_googlena)  #after removing NAs
          Date  MSFT   IBM  AAPL   MCD    PG   GOOG
663 19/08/2004 27.12 84.89 15.36 26.60 54.48 100.34
664 20/08/2004 27.20 85.25 15.40 27.07 54.85 108.31
665 23/08/2004 27.24 84.65 15.54 26.64 54.75 109.40
666 24/08/2004 27.24 84.71 15.98 26.87 54.95 104.87
667 25/08/2004 27.55 85.07 16.52 26.95 55.30 106.00
668 26/08/2004 27.44 84.69 17.33 27.10 55.70 107.91
# the above can still leave NAs in other columns use na.omit to remove all the
# blank data
data_stocks_naomit = na.omit(data_stocks)
  • There can be a requirement in data pre processing where one might have to select data in a range.

  • The following example selects data where MSFT prices lie between 20 and 30.

  • \(\mathtt{\&}\) is a Logic operator in R see help(“&”) to see more details and other Logic operators.

data_msft = data_stocks_naomit[data_stocks_naomit$MSFT <= 30 & data_stocks_naomit$MSFT > 
    20, ]
min(data_msft$MSFT)  #check 
[1] 20.06
  • These selections can also be performed using the function \(\mathtt{subset}\).

  • The following example uses \(\mathtt{subset}\) function to select rows with AAPL>100. The arguments to the function are also shows in the example

args(subset.data.frame)
function (x, subset, select, drop = FALSE, ...) 
NULL
aaplgr100 = subset(data_stocks_naomit, AAPL > 100)
head(aaplgr100)
          Date  MSFT    IBM   AAPL   MCD    PG   GOOG
1342 2/05/2007 30.61 102.22 100.39 50.02 62.37 465.78
1343 3/05/2007 30.97 102.80 100.40 49.91 62.00 473.23
1344 4/05/2007 30.56 102.96 100.81 49.92 62.41 471.12
1345 7/05/2007 30.71 103.16 103.92 49.50 62.18 467.27
1346 8/05/2007 30.75 103.29 105.06 49.32 61.75 466.81
1347 9/05/2007 30.78 104.38 106.88 49.84 62.01 469.25
min(aaplgr100$AAPL)
[1] 100.06