Appendix: A quick intro to R and RStudio

In this appendix we will cover all you need to know about R in order to benefit from the contents of this book. You will be introduced to the programming language, R and the RStudio interface. We assume you have successfully managed to install these two applications. Here we will acquaint you with them and will teach you three R basics: operators, objects, and packages.

Exploring RStudio

You can use R without using RStudio, but RStudio is an app that makes it easier to work with R. RStudio is what we call an IDE, an integrated development environment. It is a fancy way of saying that it is a cool interface designed to write programming code. Every time you open up RStudio, you are in fact starting an R session. RStudio automatically runs R in the background.

A screen capture of the desktop application RStudio. A menu bar runs along the top, along with icons for quick functions. The main pane of the app is divided into three parts, most of which is taken up by the R console, with a tab for Terminal near its top. To the top right, the smallest section contains tabs for Environment, History, Connections, and Git, and underneath is the last section which contains tabs for Files, Plots, Packages, Help, and Viewer.

FIGURE 13.1: Screenshot of RStudio IDE

When you first open RStudio, you will see (as in the image above) that there are three main windows. The bigger one to your left is the console. If you read the text in the console you will see that RStudio is indeed opening R, and you can see what version of R you are running. Since R is constantly being updated, the version you installed is likely more recent than the one we used at time of writing.

Opening up the script pane

The view in RStudio is structured so that you have four open windows in a regular session. However, when you first open, you might be starting with only three. To open the script pane (the one missing), click in the File drop-down Menu, select New File, then R Script.

The RStudio desktop application now has a drop-down menu from the top bar open. The 'File' menu has been selected, and among the options from it, the 'New File' submenu has been opened. Within this submenu, 'R Script' is highlighted.

FIGURE 13.2: Navigate to File > New File> R Script to open script pane

You will now see the four window areas in display. In each of these areas you can shift between different views and panels. You can also use your mouse to re-size the different windows if that is convenient.

A screen capture of the RStudio desktop app, now with four panes dividing the main window. Compared to before, the pane containing 'R Console' has been squished down, and an equal size pane has appeared above it. That pane contains tabs for two open files, along with some buttons running along the top of the the pane, followed by a large text entry area. The four panes are labeled one through four, going left to right, such the new pane has the number one, the one with the 'Environment' tab two, 'Console' three, and the one with the 'Files' tab number four.

FIGURE 13.3: The 4 panes of RStudio

The four panes of RStudio

The purposes of the four panes in the figure above are the following:

Script and data view: where you type your code that tells R what you want to do. These are essentially instructions that you type and save as a script, so that you can return to it later to remember what you did and to share it with others so that they can reproduce what you did.
Environment and history view:

Environment tab - gives you the names of all the (data) objects that you have defined during your current R session, including number of observations and rows in those objects. We learn more about objects later.
History tab - shows you a history of all the code you have previously evaluated in the main console. One of the key advantages of doing data analysis this way, with code versus with a point and click interface like Excel or ArcGIS, is that you are producing a written record of every step you take in the analysis. First time around, it will take you time to write these instructions; it may be slower than pointing and clicking. And unlike with pointing and clicking you need to know the “words” and “grammar” of this language.

Main console: this is considered R's heart, and it is where R evaluates the codes that you run. You can type your code directly in the console, but for the sake of good habits, type them in the script and data view so you can save a record of them. Only type and run code from here if you want to debug or do some quick analysis.
File Directory, Plots, Packages, Help:

Files tab- allows you to see the files in the folder that is currently set as your working directory.
Plots tab- you will see any data visualizations that you produce here. You have not produced any yet, so it is empty now.
Packages tab- you will see the packages that are currently available to install. We will explain what these are soon, but know that they are an essential feature when working with R.
Help tab- you can access further information on the various packages.

Interacting with the four panes

In the previous section, you opened up the ‘script’ pane. We now write some code in it, and see what happens.

To do this, go to your open version of RStudio, and type in the script pane the following:

print("Hello world!")

When you have typed this, you will have typed your first bit of code. Yet nothing is happening? That is because you also have to run the code.

You can do this by highlighting the code you wish to run, and clicking on ‘run’ in the top right-hand corner. When you ‘run’ the code, it will print the text ‘Hello World!’ in the bottom pane, which is the console. That means you have written and executed your first line of code.

In the rest of the appendix, we will be unpacking how this all works, and getting more familiar and comfortable with using RStudio.

To recap: the script is where you write your programming code. A script is nothing but a text file with some code on it. Unlike other programs for data analysis you may have used in the past (Excel, SPSS), you need to interact with R by means of writing down instructions and asking R to evaluate those instructions. R is an interpreted programming language: you write instructions (code) that the R engine has to interpret in order to do something. And all the instructions we write can and should be saved in a script, so that you can return later to what you did.

As mentioned earlier, one of the key advantages of doing spatial data analysis with code versus with a point-and-click interface like ArcGIS or MapInfo (or even QGIS) is that you are producing a written record of every step you take in the analysis. First time around it may be slower than pointing and clicking, however over time you can re-run and re-use your code and will save lots of time (as well as build transparency and reproducibility into your analysis). Once you have written your instructions and saved them in a file, you will be able to share this file with others and run your every time you want in a matter of seconds. This creates a reproducible record of your analysis: something that your collaborators or someone else anywhere (including your future self, the one that will have forgotten how to do the stuff) could run and get the same results than you did at some point earlier. This makes science more transparent, and transparency brings with it many advantages. For example, it makes your research more trustworthy. Don’t underestimate how critical this is. Reproducibility is becoming a key criteria to assess good quality research. You can read up on reproducibility and its importance in Leyser, Kingsley, and Grange (2015) or Ritchie (2020) for more detail.

Customising the RStudio look

RStudio allows you to customise the way it looks. Working with white backgrounds is not generally a good idea if you care about your eyesight. If you don’t want to end up with dry eyes, not only it is good that you follow the 20-20-20 rule (every 20 minutes, look for 20 seconds to an object located 20 feet away from you), but it may also be a good idea to use more eye-friendly screen displays.

Click in the Tools menu and select Global Options. This will open up a pop-up window with various options. Select Appearance. In this section you can change the font type and size, but also the kind of theme background that R will use in the various windows.

A partial screen capture of the RStudio desktop application, with a drop-down menu from the top bar open. The 'Tools' menu has been selected, and among the options from it, 'Global Options...' is highlighted.

FIGURE 13.4: Find Tools > Global options

You can make any changes you’d like to here, including the theme background that R will use as the interface. For example, you may choose a darker theme like ‘tomorrow night bright’.

A screen capture of the 'Options' screen in the app, with categories for options running down the entire left side of the window. The 'Appearance' category has been selected, and three selectable options appear to the right, 'Rstudio theme', 'Editor font size', and 'Editor theme'. This last one has a long list of options, of which 'Tomorrow Night Bright' has been selected. A preview of the theme appears to the right.

FIGURE 13.5: Select tomorrow night bright theme

Elements of R grammar

R is a programming language, and as a language it has a number of elements that are basic to its understanding. In this section we discuss functions (the “verbs” in the R grammar) and objects (the “nouns” in the R grammar).

Functions

Functions do things. They are called by a certain name, usually a name which represents what they do, and they are followed by brackets (). Within the brackets, you can put whatever it is that you want the function to work with. For example, the code we wrote in previously was the print() function. This function told R to print into the console whatever we put in the brackets (“Hello World!”).

It is the same idea with a personalised greeting: if you want to print ‘Hello Reka’, you will need to have “Hello Reka” inside the brackets:

print("Hello Reka")

## [1] "Hello Reka"

There are so many functions in R. We will be learning many of them throughout the book. Print is fun, but most of the time, we will be using functions to help us with our data analysis. For example, getting the minimum, maximum, or mean of a list of numbers. R does this using functions in a very similar way.

For example, if we have a bunch of numbers, we just find the appropriate function to get the summary we want:

mean(c(10, 34, 5, 3, 77))

## [1] 25.8

min(c(10, 34, 5, 3, 77))

## [1] 3

max(c(10, 34, 5, 3, 77))

## [1] 77

How can you find the function you need? Throughout this book, you will learn a list that appears at the top of each lesson. A recommendation when you are starting with R is to also create a ‘function cookbook’, where you write down a list of functions, what the functions do, and some examples. Here is an example:

A screen capture of the Google Documents web app, with a file titled 'My R Function Cookbook' open. The document contains a table with three columns, labeled 'Function', 'What it does', and 'An example'. The first three rows are visible, with entries for the functions 'print', 'mean', and 'min', along with their descriptions and an example usage.

FIGURE 13.6: An example of an R function cookbook

You can use Google to make your cookbook; and the website stackoverflow, in particular, can help you find the function you need. But be wary, especially in the beginning, that you understand what the function does. There can be several different functions for the same action.

NOTE: R is case-sensitive! For example:

# Calculating the logarithm 
Log(100)

# ERROR!

# Instead, it should be:
log(100)

## [1] 4.60517

So make sure that you are specific about upper and lower case letters in your collection of relevant functions. In general, almost all functions will start with lower case letter. One exception is the View() function.

Objects

You can think of objects as boxes where you put things. Imagine a big, empty cardboard box. We can create this big empty box in R by simply giving it a name. Usually, you want your object/box to have a good descriptive name, which will tell people what is in it. Imagine moving house. If you have a cardboard box full of places, you might want to label it “plates”. That way, when carrying, you know to be careful, and when unpacking, you know its contents will go in the kitchen. On the other hand, if you named it “box1”, then this is a lot less helpful when it comes to unpacking.

Creating an object

Let us create an object called ‘plates’. To do this, you go to your script, and type ‘plates’.

But if you run this code, you will get an error. Try it and you will see the error ‘Error! Object plates not found’. This is because you have not yet put anything inside the plates ‘box’. Remember objects are like boxes,so there must be something inside our object ‘plates’. In order for this object to exist, you have to put something inside it, or in R-speak assign it some value.

Therefore, we make an object by using an assignment operator ( <- ). In other words, we assign something to an object (i.e., put something in the box). For example:

plates <- "yellow plate"

Now if we run this, we will see no error message, but instead, we will see the plates object appear in our environment pane.

Here are some more examples to illustrate:

# Putting '10' in the 'a' box
a <- 10

# Putting 'Hello!' in the 'abc123' box
abc123 <- "Hello!"

In these examples, we are putting the value of 10 into the object a, and the value of ‘Hello!’ into the object abc123.

Earlier, we introduced you to the Environment and History pane. We mentioned that it lists objects you defined. After making the ‘a’ and ‘abc123’ objects, they should appear in that pane under the Environment tab.

Types of objects

Why are objects important? We will be storing everything in our data analysis process in these objects. Depending on what is inside them, they can become a different type of object. Here are some examples:

Data structures are important objects that store your data, and there are five main types but we focus on three for this course:

(atomic) vector: an ordered set of elements that are of the same class. Vectors are a basic data structure in R. Below are five different classes of vectors:

# 1. numeric vector with three elements
my_1st_vector <- c(0.5, 8.9, 0.6) 

# 2. integer vector with addition of L at the end of the value
my_2nd_vector <- c(1L, 2L, 3L)  

# 3. logical vector
my_3rd_vector <- c(TRUE, FALSE, FALSE) 
# 'my_4th_vector' creates a logical vector using abbreviations of True and False, 
#   but you should use the full words instead
my_4th_vector <- c(T, F) 

# 4. character vector
my_5th_vector <- c("a", "b", "c") 

# 5. complex vector (we will not use this for our class)
my_6th_vector <- c(1+0i, 2+4i)

lists: technically they, too, are vectors but they are more complex because they are not restricted on the length, structure, or class of the included elements. For example, to create a list containing strings, numbers, vectors and a logical, use the list() function, and inside the brackets, put everything you want to combine into a list:

list_data <- list("teal", "sky blue", c(10, 5, 10), TRUE, 68.26, 95.46, 99.7)

Above, we created list_data, an object that contains all those things that we put inside the list() function. This function serves to create a list from combining everything that is put inside its brackets.

Use the class() function to confirm that the objects have been defined as a list

class(list_data)

## [1] "list"

data frames: also store elements but differ from lists because they are defined by their number of columns and rows; the vectors (columns) must be of the same length. Data frames can contain different classes, but each column must be of the same class. For example, if you want to combine some related vectors to make a data frame on violent American cities, use the function data.frame():

# Making some relevant vectors
TopVioCities <- c("St. Louis", "Detroit", "Baltimore") # some violent US cities
VioRatePer1k = c(20.8, 20.6, 20.3) # their violence rates per 1,000 persons
State <- c("Missouri", "Michigan", "Maryland") # in what states are these cities found

#Join them to make a data frame called 'df'
df<-data.frame(TopVioCities, VioRatePer1k, State)

We can then view the data frame, ‘df’, with the View() function:

View(df)

Doing things to objects

We have learned what functions are (i.e., things that do things) and what are objects (i.e., the boxes that hold things). We also saw some functions which helped us create objects. Functions can also do things to objects. For example, we saw the function class() that told us about what kind of object list_data was, and View() which allowed us to have a look at our dataframe we called df.

Let us look back at our plates object. Remember, it was the object that held our kitchen items. We added ‘yellow plate’ to it. Now let us add some more items and let us use the concatenate c() function for this again:

plates <- c("yellow plate", "purple plate", "silver plate", "orange bowl")

Let us say that we suddenly forgot what was in our object called ‘plates’. Like what we learned earlier, we use the function print() to see what is inside this object:

print(plates)

## [1] "yellow plate" "purple plate" "silver plate" "orange bowl"

This can apply to obtaining the mean, minimum, and maximum. You could assign those statistics to an object this time:

nums <- c(10, 34, 5, 3, 77)

Now if we want to know the mean, we can take the mean of the object nums, which we just created:

mean(nums)

## [1] 25.8

The object we will use most frequently though is data frames. These hold your data in a format whereby each column represents a variable, and each row an observation.

Just earlier, we had created a dataframe called df previously. If you have not yet copied this over into your own RStudio, do this now. You should have the object df in your environment. When you run View(df), you should see this dataset:

A screen capture of the RStudio desktop app. Some work is present in the four panes, with two areas of importance circled in orange, with arrows drawn pointing towards them. The top left pane is circled, and has the tab 'df' open, which shows a table of the contents of the dataframe, with columns labeled 'TopVioCities', 'VioRatePer1k', and 'State', and numbered columns. The top right frame with the 'Environment' tab has an entry circled under the 'Data' heading. The entry is labeled 'df', and is described as 'three obs of three variables'.

FIGURE 13.7: View your dataframes

To do something to an entire dataframe, we would use the name of the object (df) to refer to it. In the case of the View() function, we want to see the whole thing, so we will call View(df). On the other hand, if we want to refer to only one variable in the data, (remember back to term 1 - each variable is held in each column) there is a special notation to do this.

To refer to a variable (column) inside a dataframe, you use:

$\text{dataframe name} + \$ + \text{variable name}$

For example, to refer to the variable VioRatePer1k, we use the notation df$VioRatePer1k.

And if we wanted to View only that column, we use:

View(df$VioRatePer1k)

You should see one column of the dataframe appear in the top left pane.

Say, we wanted to know the mean violence rate across our units of analysis, the cities, for example, we would take the numeric column to calculate this:

mean(df$VioRatePer1k)

## [1] 20.56667

Packages

Packages are a very important element of R. Packages are elements that add the functionality of R. What most packages do is they introduce new functions that allow you to ask R to do new different things. Anybody can write a package, so consequently R packages vary on quality and complexity. You can find packages in different places, as well, from official repositories (which means they have passed a minimum of quality control), something called GitHub (a webpage where software developers post work in progress), to personal webpages (danger danger!).

Throughout the book, and hopefully afterwards, you will find yourself installing numerous open-source software packages that allow R to do new and different things. There are loads of packages out there. In early 2020, there were over 150,000 packages available. Anyone can write one, so you will need to be careful on which ones you use, as the quality can vary. Official repositories, like CRAN, are your best bet for packages, as they will have passed some quality controls.

You can see what packages are available in your local install by looking at the packages tab in the File Directory, Plots, Packages pane.

A number of the packages we will use belong to a set of packages called tidyverse. These packages help make your data tidy. According to Statistician and Chief Scientist at RStudio, Hadley Wickham, transforming your data into tidy data is one of the most important steps of the data analysis process. It will ensure your data are in the format you need to conduct your analyses. We will also be using the simple features package sf and many more associated with spatial data analysis.

Packages can be installed using the install.packages() function. Remember that while you only need to install packages once, they need to be loaded with the library()function each time you open up RStudio. Let us install the package dplyr from tidyverse and load it:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

A lot of code and activity appears in the console. Warnings may manifest. Most of the time, the warnings explain what is being loaded and confirm that the package is successfully loaded. If there is an error, you will have to figure out what the warnings are telling you to successfully load the package. This happens and is normal.

To double-check that you have actually installed dplyr, go to that File Directory, Plots, Packages pane and click on the Packages tab. The list of packages is in alphabetical order and dplyr should be there. If there is a tick in its box, it means that this package is currently loaded and you can use it; if there is no tick, it means that it is inactive, and you will have to bring it up with library(), or just tick its box.

On masking: sometimes packages introduce functions that have the same name as those that are already loaded into your session. When that happens, the newly loaded ones will override the previous ones. You can still use them, but you will have to refer to them explicitly by bringing them up and specifying to which package they belong with library().

How do you find out what a package does? You look at the relevant documentation. In the Packages window, scroll down until you find listed the new package we installed. Here you will see the name of the package (dplyr), a brief description of what the program is about, and the version you have installed (an indication that a package is a good package is that it has gone through several versions, that means that someone is making sure the package gets regular updates and improvements).

Click on the name dplyr. You will see that RStudio has now brought you to the Help tab. Here is where you find the help files for this package, including all the available documentation.

Every beginner in R will find these Help files a bit confusing. But after a while, their format and structure will begin to make sense to you. Click where it says User Guides, Package Vignettes, and Other Documentation. Documentation in R has become much better since people started to write vignettes for their packages. They are little tutorials that explain with examples what each package does.

Exploring data with R

Now that we know the basic component, let’s play around with using R as we will throughout the book, for some data analysis.

We will get some data by installing a package which has data in it as well as functions, and then go on to produce some basic summaries. This should give some practice!

Playing around with data

We are going to look at some data that are part of the fivethirtyeight package. This package contains datasets and code behind the stories in this particular online magazine (http://fivethirtyeight.com/). This package is not part of the base installation of R, so you will need to install it first.

Remember, first we have to load the package if we want to use it:

library("fivethirtyeight")
data(package="fivethirtyeight") #Show all data frames available in named package

Notice that this package has some datasets that relate to stories covered in this newspaper that had a criminological angle. Let’s look for example at the hate_crimes dataset. How do you that? First, we have to load the data frame into our global environment. To do so use the following code:

data("hate_crimes")

This function will search among all the loaded packages and locate the hate_crimes dataset. Notice that it now appears in the global environment, although it also says “promise” next to it. To see the data in full, you need to do something to it first. So let’s do that.

Every object in R can have attributes. These are names, dimensions (for matrices and arrays: number of rows and columns) and dimension names, class of object (numeric, character, etc.), length (for a vector this will be the number of elements in the vector), and other user-defined ones. You can access the attributes of an object using the attributes() function. Let’s query R for the attributes of this data frame.

attributes(hate_crimes)

## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $names
##  [1] "state"                       "state_abbrev"               
##  [3] "median_house_inc"            "share_unemp_seas"           
##  [5] "share_pop_metro"             "share_pop_hs"               
##  [7] "share_non_citizen"           "share_white_poverty"        
##  [9] "gini_index"                  "share_non_white"            
## [11] "share_vote_trump"            "hate_crimes_per_100k_splc"  
## [13] "avg_hatecrimes_per_100k_fbi"

This prints out the row names (not very exciting here…) the class (see above when we used class() function) and the names, which are the column headers - or the names of the variables within this dataset. You can see there are things like state, and share who voted for Trump in the 2016 election.

Now use the View() function to glance at your data frame. What you get there is a spreadsheet with 12 variables and 51 observations. Each variable in this case is providing you with information (demographics, voting patterns, and hate crime) about each of the US states.

Ok, let’s now have a quick look at the data. There are so many different ways of producing summary stats for data stored in R that is impossible to cover them all! We will just introduce a few functions that you may find useful for summarising data. Before we do any of that, it is important you get a sense for what is available in this dataset. Go to the help tab and in the search box input the name of the data frame, this will take you to the documentation for this data frame. Here you can see a list of the available variables.

Let’s start with the mean. This function takes as an argument the numeric variable for which you want to obtain the mean. If you want to obtain the mean of the variable that gives us the proportion of people that voted for Donald Trump, you can use the following expression:

mean(hate_crimes$share_vote_trump)

## [1] 0.49

Another function you may want to use with numeric variables is summary():

summary(hate_crimes$share_vote_trump)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.040   0.415   0.490   0.490   0.575   0.700

This gives you the five number summary (minimum, first quartile, median, third quartile, and maximum, plus the mean and the count of missing values if there are any).

You don’t have to specify a variable; you can ask for these summaries from the whole data frame:

summary(hate_crimes)

##     state           state_abbrev       median_house_inc share_unemp_seas 
##  Length:51          Length:51          Min.   :35521    Min.   :0.02800  
##  Class :character   Class :character   1st Qu.:48657    1st Qu.:0.04200  
##  Mode  :character   Mode  :character   Median :54916    Median :0.05100  
##                                        Mean   :55224    Mean   :0.04957  
##                                        3rd Qu.:60719    3rd Qu.:0.05750  
##                                        Max.   :76165    Max.   :0.07300  
##                                                                          
##  share_pop_metro   share_pop_hs    share_non_citizen share_white_poverty
##  Min.   :0.3100   Min.   :0.7990   Min.   :0.01000   Min.   :0.04000    
##  1st Qu.:0.6300   1st Qu.:0.8405   1st Qu.:0.03000   1st Qu.:0.07500    
##  Median :0.7900   Median :0.8740   Median :0.04500   Median :0.09000    
##  Mean   :0.7502   Mean   :0.8691   Mean   :0.05458   Mean   :0.09176    
##  3rd Qu.:0.8950   3rd Qu.:0.8980   3rd Qu.:0.08000   3rd Qu.:0.10000    
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.13000   Max.   :0.17000    
##                                    NA's   :3                            
##    gini_index     share_non_white  share_vote_trump hate_crimes_per_100k_splc
##  Min.   :0.4190   Min.   :0.0600   Min.   :0.040    Min.   :0.06745          
##  1st Qu.:0.4400   1st Qu.:0.1950   1st Qu.:0.415    1st Qu.:0.14271          
##  Median :0.4540   Median :0.2800   Median :0.490    Median :0.22620          
##  Mean   :0.4538   Mean   :0.3157   Mean   :0.490    Mean   :0.30409          
##  3rd Qu.:0.4665   3rd Qu.:0.4200   3rd Qu.:0.575    3rd Qu.:0.35694          
##  Max.   :0.5320   Max.   :0.8100   Max.   :0.700    Max.   :1.52230          
##                                                     NA's   :4                
##  avg_hatecrimes_per_100k_fbi
##  Min.   : 0.2669            
##  1st Qu.: 1.2931            
##  Median : 1.9871            
##  Mean   : 2.3676            
##  3rd Qu.: 3.1843            
##  Max.   :10.9535            
##  NA's   :1

There are multiple ways of getting results in R. Particularly for basic- and intermediate-level statistical analysis many core functions and packages can give you the answer that you are looking for. For example, there are a variety of packages that allow you to look at summary statistics using functions defined within those packages. You will need to install these packages before you can use them.

We are only going to introduce one of them here: skimr. You will need to install it before anything else.

Once you have loaded the skimr package you can use it. Its main function is skim. Like summary for data frames, skim() presents results for all the columns, and the statistics will depend on the class of the variable.

skim(hate_crimes)

Hopefully in your statistical modules you had taken previously, you have learned some things about how to graphically display variables. So you may have some memory about the amount of work involved with this. Hopefully R will offer some respite. Of course, there are many different ways of producing graphics in R. In this course we rely on a package called ggplot2, which is part of the tidyverse set of packages mentioned earlier.

library(ggplot2)

Then we will use one of its functions to create a scatterplot.

ggplot(hate_crimes, aes(x=share_vote_trump, y=avg_hatecrimes_per_100k_fbi)) +
    geom_point(shape=1) +
     geom_smooth(method=lm)

Graphing is very powerful in R, and much of the spatial visualisation we will produce throughout the book will build on this. If you are not already familiar with this, we recommend a read of the data visualisation chapter of Wickham and Grolemund (2017).

Getting organised: R Projects

One thing that can help you tremendously is keeping your code organised. RStudio helps with this by virtue of something called R Projects.

Technically, a RStudio project is just a directory with the name of the project, and a few files and folders created by RStudio for internal purposes. This is where you should hold your scripts, your data, and reports. You can manage this folder with your own operating system manager (discussed earlier, e.g., Windows) or through the RStudio file manager (that you access in the bottom-right corner set of windows in RStudio).

When a project is reopened, RStudio opens every file and data view that was open when the project was closed last time around. Trust me, this is a really helpful thing! If you create a project for this module, you can keep everything in one place, and refer back to your old code and your learnings throughout the module.

Saving your work and projects

First things first, hopefully you have already created a separate folder on your computer. Now save the script you’ve been working on into this folder. By clicking on “File” and “Save as…”.

Then navigate to your folder for this module, and for your script make sure to give it some meaningful name like appendixlab.R or something like this. Then click ‘save’.

Now, go back to “File” and select “New project…”

Then in the options that appear, choose “Existing Directory”. This is because you already have a folder for this work; this is where you saved your script just before. For me this was my folder called ‘crime_mapping’.

So select ‘Existing Directory’, and on the next page use the “Browse” button to select this folder (directory) where you saved the script earlier. Once you have done this, click on ‘Create Project’ on the bottom.

This will now open up a new RStudio window with your project. In the future, you can start right back up where you finished last time by navigating to the .Rproj file, and double clicking it. It helps you keep everything in one place, and lets R read everything from that folder.

With simple projects, a single script file and a data file is all you may have. But with more complex projects, things can rapidly become messy. So you may want to create sub-directories within this project folder. You could use the following structure to put all files of a certain type in the same sub-directory:

Scripts and code: Here you put all the text files with the analytic code, including Rmarkdown files which is something we don’t introduce here.
Source data: Here you can put the original data. I would tend not to touch this once I have obtained the original data.
Documentation: This is the sub-directory where you can place all the data documentation (e.g., codebooks, questionnaires, etc.)
Modified data: All analysis involves doing transformations and changing things in the original data files. You don’t want to mess up the original data files, so what you should do is create new data files as soon as you start changing your source data. I would go so far as to place them in a different sub-directory.
Literature: Analysis is all about answering research questions. There is always a literature about these questions. I would place the relevant literature for the analytic project I am conducting in this sub-directory.
Reports and write-up: Here is where I would file all the reports and data visualisations that are associated with my analysis.

You can read up on why projects are useful in Chan (2020).

Summary and further reading

This has been a very superficial overview of using R and RStudio to write code. If you were able to follow along, then you should be OK to follow along with the contents of this book as well. If you are not yet feeling confident, and are looking to further develop your R skills, we highly recommend Wickham and Grolemund (2017). Throughout the book we use Tidyverse packages and syntax wherever possible. For an overview of this, check out Wickham et al. (2019). And for those who want to hone their R skills even further, there is Wickham (2019) for generally more advanced R skills, and of course, Bivand, Pebesma, and Gómez-Rubio (2013) and Lovelace, Nowosad, and Muenchow (2019) are important reference texts.

References

Bivand, Roger, Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with r. 2nd ed. New York: Springer.

Chan, Martin. 2020. “RStudio Projects and Working Directories: A Beginner’s Guide.” https://www.r-bloggers.com/2020/01/rstudio-projects-and-working-directories-a-beginners-guide/.

Leyser, Ottoline, Danny Kingsley, and Jim Grange. 2015. “The Science ‘Reproducibility Crisis’ – and What Can Be Done about It.” https://theconversation.com/the-science-reproducibility-crisis-and-what-can-be-done-about-it-74198.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with r. Boca Raton, FL: Chapman & Hall/CRC Press.

Ritchie, Stuart. 2020. Science Fictions: Exposing Fraud, Bias, Negligence and Hype in Science. Random House.

Wickham, Hadley. 2019. Advanced r. CRC Press.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy McGowan, Roman François, Garrett Grolemund, Alez Hayes, Lionel Henry, and Jim Hester. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Boston, MA: O’Reilly.