Setting Up A Data Science Environment on OSX

Setting up software for data analysis and statistical computing can be a major pain. Although setup on OSX is substantially less troublesome than on Windows (a lesson I’ve had to learn the hard way at the office), the process is nonetheless time-consuming and can be difficult to get right. I have finally arrived at a pretty clean, maintainable setup which perhaps merits sharing so that others can avoid spending as much time as I have configuring a new machine. This is intended to be a living document and so, I will update this post as technologies or my processes change.

Here is a general overview of the tools I use and what will need to be installed:

  • XCode

  • Homebrew

  • Git

  • Unix Shell (get bash profile with aliases, prompt string, etc from Git)

  • Vim

  • Python (Anaconda or not to Anaconda)

  • R (with RStudio)

  • LaTeX

XCode

The first thing you need to do is install Apple’s command line tools (or XCode). This can be done by installing XCode from the App Store, going to preferences, downloads and then installing command line tools.

In versions beyond OSX 10.9, you can also install XCode directly from the command line with

xcode-select --install

Homebrew

Now we can install Homebrew, a free package management system that simplifies the installation of software on the macOS operating system. To do this, open your Terminal app or whatever terminal emulator you use and enter:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Follow the command line prompts and enter your User password when instructed. By default, Homebrew will be installed such that we can use the brew command without having to type sudo and providing a password.

Run brew doctor after to make sure the installation was successful and that Homebrew is working properly.

Claires-MacBook-Pro-2:Code clairesaint-donat$ brew doctor
Your system is ready to brew.

To make sure we have the latest Homebrew version and the latest formulas (which we should), we can run brew update && brew upgrade.

Now we are ready to install some software!

We can install Git with:

brew install git

Now we’re ready with Git and Homebrew!

Unix Shell

I do like to have htop to monitor system usage so I will install it with:

brew install htop-osx

It’s also helpful to have wget which can be installed with:

brew install wget

At this moment, I do not have a very particular set up for my Unix Shell environment or dotfiles I am particularly attached to so I will move on for now. Watch this space as more steps will be added here shortly.

VIM

Vim is a highly configurable text editor built to make creating and changing any kind of text very efficient. It is already included with Apple OS X but we will make sure we have the most recent version installed and configure the editor to our tastes.

To install the latest version, use Homebrew with the following command:

brew install vim && brew install macvim
brew link macvim

Now we should have the most recent version of Vim installed (if you would like to check, type vim —version). Additionally, we have also install MacVim, a Vim port to Mac OSX that is meant to look better and integrate more seamlessly with your Mac. The homebrew command above includes the installation of the CLI mvim and the Mac application (which point point to the same thing).

If you don’t already have a Vimrc file that you like to use, I recommend the Ultimate Vim configuration. On the other hand, if you are just learning vim and starting to incorporate it into your workflow, it might be a good idea to start with a blank rc file and get comfortable with the basic functionality of the editor without any plugins. In my first few weeks of using Vim, I found it useful to slowly incorporate packages and new features into my workflow over time.

If you would like to use the Ultimate vimrc, you can do so by cloning the repo and running the install script:

git clone --depth=1 https://github.com/amix/vimrc.git ~/.vim_runtime
sh ~/.vim_runtime/install_awesome_vimrc.sh

Python

I debated a lot on the best way to install Python on a system and ultimately landed on the Anaconda distribution. Anaconda is probably the most popular distribution for Data Science since it abstracts away many of the complexities associated with package management, helps keep dependencies updated and comes with many useful Python tools such as Jupyter Notebooks.

Install Anaconda

The easiest way to install the distribution is through the graphical installer.

  1. Go to the Anaconda Website and choose a Python 3.x graphical installer (at the time of writing this, I am installing Python 3.7). Select only one version of Python and do not install both! If you need to run a program in Python 2, virtual environments allow you to create different versions of Python depending on the project you are working on.

  2. Locate the download and double click it

  3. Click through by hitting “Continue” to install with default settings. Give your admin password when prompted.

  4. I choose to not install Microsoft VS Code, a Python IDE that comes with the distribution, however, you should feel free to if you like to use it.

Note that when you install Anaconda, the program will automatically update your bash profile with anaconda3.

You can also install anaconda from the command line with the following code:

# Go to home directory
cd ~
# You can change what anaconda version you want at 
# https://repo.continuum.io/archive/
curl https://repo.continuum.io/archive/Anaconda3-2018.12-MacOSX-x86_64.sh -o anaconda3.sh
bash anaconda3.sh -b -p ~/anaconda3
rm anaconda3.sh
echo 'export PATH="~/anaconda3/bin:$PATH"' >> ~/.bash_profile 
# Refresh basically
source .bash_profile
conda update conda

Test your Installation

Note that you need to open a new Terminal window for the changes in your environment variables to take effect.

  1. Run python —version in a Terminal window to make sure you have installed the correct version of Python and that your PATH variable has updated correctly. You should get output like:

    Claires-MBP-2:~ clairesaint-donat$ python --version
    Python 3.7.1
  2. Type conda update conda to make sure that the conda function is working properly and that you are up-to-date.

  3. Another helpful test is to confirm the installation of Jupyter. Run the command jupyter notebook to see if a notebook instance launches.

If you have multiple versions of python installed on your computer these tests will not work and you will need to update your .bash_profile to point to the correct installation of Python.

Python “Hello World”

R

Most people, myself included, install RStudio alongside R. Almost everyone uses the RStudio IDE and it’s generally considered the easiest and best way to work with R.

Install R and RStudio

There are several ways to do this but I choose to install using the command line and our newly-minted Homebrew.

  1. Type brew install r in the Terminal

  2. Update your bash profile with the following command: echo 'Sys.setlocale(category="LC_ALL", locale = "en_US.UTF-8")' >> ~/.bash_profile

  3. Install R studio by entering brew cask install rstudio

Install R Packages & Change the RStudio environment

Open RStudio. On the left panel, you should have an R console and terminal. In the console, you can type an R command followed by enter and R will execute the command for you.

  1. To install Tidyverse install.packages("tidyverse", repos = 'https://cran.us.r-project.org')

  2. You can install a few more useful packages using the syntax

    install.packages(<package_name>)

    Some useful packages are:

    • XML : Read and write XML documents with R

    • jsonlite: Read and create JSON data tables with R

    • httr: A set of useful tools for working with http connections

    • rvest : Very useful tool for webscraping

  3. I also like to change the editing colors of RStudio. You can do this by going to Tools > Global Options > Appearance. I personally like the Cobalt theme with Monaco font to reduce eye strain.

R “Hello, World!”

To check that everything works, try creating a simple plot in RStudio.

In the same console panel, load the ggplot library by typing library(ggplot2). Then type in the command

ggplot(airquality, aes(x = Day, y = Ozone)) +
  geom_point()

What this does is instruct R to use airquality , a pre-loaded dataset, and plot Day versus Ozone. The resulting plot should look something like:

LaTeX

Prepare to set aside a good amount of time (about an hour) to install LaTeX. Since you will have to download a large file, a high-speed internet connection is advisable.

To install LaTeX applications on your Mac:

  1. Visit http://tug.org/mactex/ and click on the MacTex download link. The file is about 3.2 GB so may take a little while (about 20 minutes) to download.

  2. Once the file has downloaded, double-clink mactex.pkg to begin the installation.

  3. Read and accept the conditions and follow the on-screen instructions to install. The installation may take a few minutes.

  4. After the installation is complete, you can delete the mactex.pkg file.

Since I don’t write that advanced LaTeX anymore, I have found that the editors that come with the MacTeX download are sufficient. TexShop is my go-to editor for editing most LaTex documents.