Intro to Analytics Learning Guide
Last Updated: December 3, 2020
This is an evolving document. I regularly add new links, training, books, etc to this. Also new sections will be added over time. PLEASE share freely with others. If you have a question, suggestion or a request please reach out to me at [email protected] Thank you.
This is an intro to analytics learning guide. This is a collection of training resources, links, books and practical advice in the field of analytics and data science. It is meant for people interested in learning more about analytics or data science as a career. The goal is to help people looking to transition from an analyst to a data scientist position. Or from a different field altogether into analytics.
The field of data science is broad and complicated. Even people who have worked with data for years can get overwhelmed. This guide is not exhaustive but it attempts to introduce you to the variety of tools and techniques out there. This guide is based on my experience and skills which are admittedly pretty narrow.
The training guide is broken into two major sections: tools and techniques.
1. Tools - This section focuses on the technologies or programming languages that are required for a career in data science. Expertise or experience in all of these tools is not required or typically expected. Some are more valuable for basic analytics of small data (Excel). And others are more complex and scalable with larger volumes of data to answer harder problems (Python, R). This list is of course incomplete.
2. Techniques - This section focuses on different analytical techniques that are available to derive insight from data regardless of the toolset. Some careers emphasize deep knowledge in one or two of these techniques. Whereas others require a more diverse set of skills in a variety of techniques. This includes different disciplines and careers within the analytics community. This list is also incomplete.
Learning tools without techniques is appealing but often incredibly limiting especially as your career progresses. If you know how to do something (e.g build a regression model in Python) without knowing why or when to do it, you may be in trouble. It would be be like using a power saw to hammer in a nail. You might get the job done but there are better tools and techniques given your problem. Understanding your problem and how to solve it is key.
There are also sections on general books about analytics, free data sets and a simple FAQs section. If you have anything to contribute to this guide please email me.
This is an intro to some of the key tools in data science. Some are open source programming languages (R, Python) and others are solutions that cost money to implement at scale (Tableau). Some of these tools are simple to learn while others have a high learning curve. In the data science context Excel is crawling, SQL is walking and R/Python is running. This list is not exhaustive
Excel & Google Sheets
Excel and its friend Google Sheets are good introductory tools for analytics. Some people scoff at these tools but most analysis in the world is probably done in Excel. Both Excel and Google Sheets are very simple to learn and use. You can easily view, manipulate, aggregate, and visualize smaller volumes of data. You can even do some basic statistical analysis. The major limitations of both are scalability, reproducibility, quality control, and the ability to perform complex analysis.
* Chandoo - The best excel learning resource and I used to go to it everyday
* Policy Viz - Data viz best practices with an Excel focus
* Getting Started with Google Sheets - Thanks Andy R for suggestion
* Data Camp - Data analysis in Excel (Data Camp was involved in a sexual assault scandal)
Structured Query Language is used to manipulate data in order to create new tables/views, build reports or conduct analysis. It is fundamental to the ETL (extract, transform, load) process. Almost every analyst has some experience in SQL depending on their role. The language is used in a variety of software applications and integrates with other languages like Python and R. It is foundational to performing most data analysis.
* Code Academy - Basic SQL
* Data Camp - SQL for data science, this is my favorite SQL course but... (Data Camp was involved in a sexual assault scandal)
* YouTube - SQL beginners to intermediate course
* YouTube - SQL for beginners
* Coursera - SQL for data science
* Lynda - SQL
R is a programming language used by statisticians and data scientists to conduct data transformation, analysis and visualization. R is awesome. It is very flexible and powerful. The learning curve is generally higher than for SQL. R vs Python is a common discussion in the analytics community. They have similar capabilities from a data science perspective. See comparison here. I love R.
* R and R Studio - Download both to begin using. R Studio is a UI.
* R Packages – R runs on open source packages. This is where you download them.
* DataCamp - Intro to R (Data Camp was involved in a sexual assault scandal).
* Coursera Data Science Course – U Washington data science course.
* Coursera Data Analysis Course – JHU course. I hear it’s really good.
* Coursera Computing for Data Analysis Course – JHU course. I don’t know how this one is.
* R YouTube Series – This is similar to the Coursera course from Roger Peng
* Khan Academy - Great site for free courses on a variety of topics not just analytics
* Udacity – Another good site for courses
* Stat Methods - Great site I use all the time for R code snippets:
* MIT R site - Another good one:
* R Blogger - Good general blog for R users.
* Flowing Data – one of my favorite blogs with R code
* Naïve Bayes in R and another one
* Text mining in R – Haven’t used yet but looks interesting
* Data mining examples in R
* Some general R scripts with actual code
R is a programming language used by statisticians and data scientists to conduct data transformation, analysis and visualization. Python is very robust and used in diverse areas by software engineers, data scientists, data engineers etc. Python is more code heavy compared to R. Thank you Kiruthika Sankaran for contributing to this section.
* Installation - I usually prefer using Jupyter Notebooks for coding Python because it can be embedded with code, comment, any ideas you have into a nice format.
* Install Anaconda distribution which has Jupyter Notebook - Window guide, Mac guide
* Make sure to install pip and after that any package you install in python use pip install package_name, DO NOT use conda install method (It causes a lot of nightmares!)
* Mac comes with default Python version 2, so Python3 need to be set as default, you can do that by opening terminal and type vi ~/.bash_profile
* Leave whatever is already written in the file during Anaconda Installation
* Type at the bottom alias python='python3'
* Close the file by pressing ESC and type :wq
* Refer here
* Helpful guide on install pyenv and managing versions on a mac. Thanks Alayna G for the recommendation.
* Udemy Python Bootcamp - Most of the Udemy courses are not that good, but this one is the best introductory python Data Science course. It starts from Installation goes step by step to introducing data types, data analysis, data visualization and then introduces Machine Learning concepts. This guy also has similar bootcamp in R (never tried)
* Python Data Camp - I never took this course but data camp in general is a great learning resource.
* Data Camp was involved in a sexual assault scandal
* YouTube Data School - Tons of great introductory and advanced python courses. Thx Rob G for the recommendation.
* Try practicing questions from W3resources to get a hang of the python programming - choose a topic and practice some 10-15 problems a day. (It has SQL exercises also, really helpful for interviews)
* Useful packages in Python
* Numpy, Pandas - for data analysis
* Matplotlib, Seaborn - for data visualization
* SciKit, SKLearn - for Machine learning
* NLTK - text mining
* Other include - list, dictionary, set, tuples, apply, lambda functions
* Geeksforgeeks - goto website after stack overflow
* https://towardsdatascience.com/collecting-data-science-cheat-sheets-d2cdff092855 - Cheatsheets are super helpful (R cheat sheets are super handy)
Simple, drag and drop tool for creating data visualizations, maps and dashboard. Easy to learn. Driven by a great community. Alternatives include Qlik, PowerBI (MS) and Looker (Google). But I love Tableau.
In my opinion the best way to learn Tableau is by doing it. And the easiest way to learn by doing is to participate in MakeOver Monday. It is a free and open source learning community. Every week a dataset is shared and people rebuild a visualization in Tableau (or any tool really). People share their vizzes, you can download and reverse engineer. Download Tableau Public which is free and use the free videos to get started. Create a Tableau Public profile. Mine here.
* Tableau Public (free) - download
* Similar in most ways to Tableau Desktop. Free. But workbooks can only be saved online publicly so not good for professional purposes.
* Download Tableau public workbooks and reverse engineer
* Makeover Monday - Free community collaboration with clean data
* This is arguably the best way to get started using Tableau. Data sets are provided. A community is built in. A lot of positive support. Tons of examples.
* How to get s