To become data scientist, you have a formidable challenge ahead. You’ll need to master a variety of skills, ranging from machine learning to business analytics.
However, the rewards are worth it. Organizations will prize alchemists who can turn raw data into smarter decisions, better products, happier customers, and ultimately more profit. Plus, you’ll get to solve interesting problems and master new, impactful technologies.
If that sounds like a career you’d enjoy, then bookmark this page and read on because we compiled this list just for you.
1 Foundational Skills:
Programming and Data Wrangling
Statistics and Probability
2 Technical Skills:
• Data Collection
• Data Visualization
• Applied Machine Learning
3 Business Skills:
• Creativity and Innovation
• Operations and Strategy
• Business Analytics
4 Supplementary Skills:
• Natural Language Processing
• Recommendation Systems
• Time Series Analysis
• Problem Solving Challenges
*Note: Advanced, Niche, or Industry-Specific Skills
Certain roles might require other skills, such as:
Deep Learning, Big Data, Optimization, Anomaly Detection, Graph and Network Models, Quantitative Finance, Research Leadership, Project Management, Product Design, Software Engineering, Spacial Data Analysis, etc…
1. Foundational Skills
Foundational skills form the basis of true understanding, which will in turn allow you to discover novel solutions, build more accurate models, and make better decisions.
1.1. Programming and Data Wrangling
First, you’ll need to know at least one scripting language well enough to wrangle datasets, prototype models, and perform analyses.
I strongly recommend choosing between Python or R, as they are both open-source (free), widely adopted, and supported by active communities. They each have their own strengths, but we recommend picking just one at the start.
• Python is more common in software startups, large tech firms, and adTech. Python tends to be more flexible because it’s a general purpose programming language. It’s also better for deep learning and processing data.
• R / RStudio is popular in research, finance, and analytics. R is a statistical programming language that has mature libraries for econometrics, statistics, and machine learning.
• Also provided with a more detailed comparison of Python vs. R for data science.
If you’re still on the fence, I would recommend starting with Python due to its breadth and flexibility (and it’s a bit more beginner-friendly).
Tip: Each resource link below opens in a new tab, so you won’t lose your place.
• Learn Python the Hard Way (Online Book) – Recommended for beginners who want a complete course in programming with Python.
• LearnPython.org (Interactive Tutorial) – Short, interactive tutorial for those who just need a quick way to pick up Python syntax.
• How to Think Like a Computer Scientist (Interactive Book) – Interactive “CS 101” course taught in Python that really focuses on the art of problem solving. This goes beyond the bare minimum needed to get started, but it’s such a wonderful gem that we had to include it here.
• PythonChallenge.com (Online Puzzle) – Fun puzzle with 33 levels that you can solve with Python programming.
• How to Learn Python for Data Science, The Self-Starter Way – The guide that covers these resources in more detail.
R / RStudio Resources:
• R for Data Science (Online Book) – Recommended for beginners who want a complete course in data science with R.
• Swirl (Interactive R Package) – Very cool R package that you can install and learn the language directly from inside RStudio (the most common interface used to run R).
• Introduction to Data Science with R (Video Series) – For those who learn better by watching someone else walk through the steps.
1.2. Statistics and Probability
A strong statistics foundation helps you fully understand machine learning, conditional probability, A/B testing, and many other core skills. It also helps you “think like a data scientist,” which include spotting biases, efficiently iterating on predictive models, and knowing how to extract insights from data.
Plus, learning the common probability distributions (especially Gaussian, Binomial, Uniform, Exponential, Poisson) is critical for implementing many real-world applications, such as multi-armed bandits, market-basket analyses, and anomaly detection programs.
• Statistics and Probability (Khan Academy) – Practical introduction to statistics and probability from Khan Academy. Recommended for getting up to speed quickly.
• Harvard Stats 110: Probability (Video Series) – Rigorous treatment of probability theory from Harvard. Recommended for building deeper mastery.
• Think Stats: Probability and Statistics for Programmers (PDF) – Excellent resource for those with programming backgrounds. Quote: “The thesis of this book is that if you know how to program, you can use that skill to help you understand probability and statistics.”
• Crash Course on Basic Statistics (PDF) – Short PDF that covers a whirlwind review of key topics. We like this review sheet because it has simple intuitive explanations for each concept.
• How to Learn Statistics for Data Science, The Self-Starter Way – the guide that covers these resources in more detail.
2. Technical Skills
Data science is all about converting raw data into insights, predictions, software, and so on. Therefore, you’ll need to be comfortable working with data.
Core technical skills include collecting, cleaning, managing, and visualizing data, plus the big umbrella of applied machine learning.
2.1. Data Collection
Everything hinges on the quality and quantity of your data. Just as a chemist needs the right chemicals, you’ll need relevant data.
There are 4 common ways to collect data:
1 Internal Data. This is proprietary data that your company collects through its operations or through partnerships with other providers. This is usually the most relevant data.
2 Searching Online. Need a labeled set of 8 million videos? There’s a webpage for that… Seriously, you’d be surprised at what you can find out there. Online datasets allow you to prototype before investing in proprietary data.
3 API’s. API’s allow you to programmatically (and legally) access datasets that other companies collect. You can find anything from Twitter feeds to weather data to financial data.
4 Web Scraping. Web crawling and scraping is a powerful tool that you must use responsibly. It opens a whole new world, but make sure to respect terms of services.
• Python: requests Quickstart Guide (Tutorial) – How to use the requests library to request data from API’s.
• R: httr Quickstart Guide (Tutorial) – How to use the httr library to request data from API’s.
Web Scraping Resources:
• R: rvest (Tutorial) – Basic web scraping with the rvest library.
• Python Web Scraping Libraries – Our overview of the Python web scraping landscape.
SQL is the lingua franca for database management and querying, and you should be able to write complex queries.
Learning SQL also gives a better understanding of relational data in general (i.e. data in “table” format), which will improve your data analysis skills in any language.
• Intro to SQL by Khan Academy (Course) – Comprehensive video series that covers every important SQL topic.
• sqlcourse.com (Interactive Tutorial) – Great to use review or a quick crash course.
• SQL Fundamentals (Course) – Course that covers the basics of SQL. Includes quizzes along the way to test your understanding.
2.3. Data Visualization
Data visualization is important for exploratory analysis and for communicating your insights, and no list of data science resources would be complete without this topic.
Raw data can be difficult to interpret, so you’ll need to investigate trends and distributions with plots and charts.
• Data Visualization in Python (Video Series) – Tutorial on using the matplotlib library in Python.
• Data Visualization in R (Video Series) – Tutorial on using the ggplot library in R.
• Python Seaborn Tutorial – This is a tutorial for the seaborn library in Python, which i strongly recommend for beginners.
2.4. Applied Machine Learning
Machine learning is a broad umbrella term that contains many sub-tasks. In a nutshell, it’s about teaching computers how to learn patterns and models from data.
To some people, machine learning is synonymous with data science, but we consider it a separate field that heavily overlaps with data science. There’s no doubt that machine learning is a powerful toolset, and it’s the meatiest skill on this list.
• Machine Learning by Andrew Ng (Video Series) – This is the gold standard when it comes to learning the theory behind machine learning courses.
• Elements of Statistical Learning (PDF) – Reference text. This is one of the classic textbooks of the industry, but it requires a solid math background.
• An Introduction to Statistical Learning in R (PDF) – Reference text. Another classic textbook that has gentler math requirements.
• How to Learn Machine Learning, the Self-Starter Way – Our beginner-friendly overview of the machine learning landscape.
• Modern Machine Learning Algorithms: Strengths and Weaknesses – this is a concise tour of machine learning algorithms.
• Python Machine Learning Tutorial – this is a end-to-end tutorial for training your first model using Python’s Scikit-Learn library.
3. Business Skills
Business skills and soft skills are sometimes overlooked in data science curricula, but they are supremely important, and employers will look out for them.
Data science is never performed in a vacuum. You’ll need to anticipate business needs, think creatively about solutions, and communicate your insights clearly.
As machine learning libraries mature and algorithms become easier to use “out-of-the-box,” businesses will value people who can work with data and work with people. This section of our list of data science resources will help you stand out.
If a tree falls in a forest but no one is around to hear it, does it make a sound? If data is analyzed but no one can explain the results, does it really matter?
Effective communication skills are universal, but data scientists have the added challenge of discussing highly technical or mathematical topics.
During data scientist interviews, you’ll often be asked to “explain a technical concept to a layperson” or “describe a previous project you’ve worked on.” Employers will specifically look for clarity, conciseness, and organization.
• The best stats you’ve ever seen (TED Talk) – This is an iconic TED talk and a fun display of storytelling with data.
• Think Fast, Talk Smart (Video) – This is a workshop at the Stanford Graduate School of Business on how to overcome anxiety and speak spontaneously. Not only will this help you for the rest of your career, but it will also allow you to stand out during your interview.
• 7 Tips for Improving Communication (Video) – Simple, practical tips on how to communicate effectively on a daily basis.
• How to Win Friends and Influence People (PDF), (Free Audiobook Version) – This is a book we’d recommend for anyone, data scientist or not. While some of the verbiage is a bit dated, the teachings about interpersonal relationships are timeless.
• Practice teaching a technical concept to a friend – This will help you solidify your understanding of the concept while getting valuable communication practice. Try explaining an interesting machine learning algorithm, including its strengths, weaknesses, and proper use cases.
• Practice describing projects that you’ve completed – This will help you practice organizing the many moving parts of data science into coherent narratives.
3.2. Creativity & Innovation
Data scientists are hired to build new products, perform complex analyses, and invent valuable ways to use data.
In fact, they rarely solve the same problem twice. Even if you can apply the same methods to an adjacent dataset, you’ll need to be creative about feature engineering, supplemental data, and business implications.
You’ll naturally become a better creative thinker as you gain more experience, but the following resources can help jumpstart your problem-solving and innovation skills.
• Machine Intelligence and Data Products (Video) – Future-looking discussion of data products and data science.
• Machine Intelligence Landscape (Chart) – Venture capitalist’s perspective on the landscape of machine intelligence applications.
• The art of innovation (TED Talk) – Great TED talk on innovation by Guy Kawasaki.
• 7 steps of creative thinking (TED Talk) – Creative thinking tips from the perspective of a serial artist and entrepreneur.
• Working backwards to solve a problem (TED Talk) – Chess grand-master Maurice Ashley on how to see the endgame and work backwards.
3.3. Business Operations and Strategy
Here’s a question you should ask yourself every day: “What are some ways I can improve this business?”
At the end of the day, companies don’t hire you to analyze data… they hire you to help them grow or become more profitable. This means that you should have an understand how data can help make better decisions and build better products.
• Data Driven Decisions (Video) – How to take business objectives, extract testable hypotheses from them, and then design experiments to evaluate.
• How to be data driven and build great products by DJ Patil (Video) – Lecture by DJ Patil before he become Chief Data Scientist of the USA.
• Big Data: New Tricks for Econometrics by Hal Varian (PDF) – Hal Varian, Chief Economist at Google, gives an excellent overview of the technology and methodology landscape for data analysis.
• How data will transform business (TED Talk) – Thought-provoking discussion of the relationship between business strategy and technology. Explains why the two long-standing theories of business strategy have become invalidated by the rise of big data.
• Victor Cheng’s Case Interview Workshop (Video Series) – Some employers like to ask consulting-style “case” questions during the interview. This is more common for Data Scientists in business operations, strategy, or analytics roles. This is an excellent crash course on tackling case interviews.
3.4. Business Analytics
Business analytic skills are critical for data scientists in operational roles. Python and R will allow you to perform more complex analyses than Excel can, thanks to the flexibility of programming languages.
After you master the technical tools, building strong domain knowledge will lead to greater business impact.
• Introduction to Business Analytics (Video) – Short and sweet intro to how businesses use analytics, including case studies.
• Marketing Metrics and Analytics (Video) – Introduction to common metrics and analytics methods using in marketing.
• Effective Cross-Selling using Market Basket Analysis (Tutorial) – How to do smarter cross-selling.
• An Intuitive Guide to A/B Testing (Video) – Overview of A/B testing and interpretation.
• 25 Examples of Business KPIs (Examples) – “What gets measured gets managed.” Here are 25 examples of business Key Performance Indicators (KPIs).
• Analytics Academy by Google (Courses) – Practical courses on digital analytics, e-commerce analytics, and other topics.
4. Supplementary Skills
Supplementary Skills are more situational depending on the role, but they help you become a well-rounded data scientist. Here are data science resources for NLP, recommender systems, and time series analysis.
4.2. Natural Language Processing (NLP)
Natural Language Processing (NLP), or Text Mining, is an exciting sub-field of machine learning for extracting structure, grammar, and insights from text.
Famous applications include Sentiment Analysis, Article Classification, and even teaching a Neural Network to write Shakespeare.
• Stanford NLP (Video Series) – Full course on “traditional” Natural Language Processing, including sentiment analysis, Naive Bayes models, n-grams, etc.
• CS224D: Deep Learning for Natural Language Processing (Course), (Course materials here) – Introduction to the theory behind deep learning for NLP.
• Python NLP Libraries – Our overview of Python libraries for NLP. Once you have basic programming skills and a solid understanding of applied machine learning, you can actually jump straight here.
4.4 Recommondation Systems:
Recommendation Systems, or Collaborative Filters, are one of the great success stories of data science, especially in e-Commerce.
They power many amazing websites and apps, including Amazon, Yelp, Netflix, and Spotify. In a nutshell, recommendation systems find other users who have similar tastes to you to make better recommendations for you. This produces a huge win-win by improving user experience while driving up revenue.
• Recommendation engine tutorial (Video Series) – Introduction to collaborative filters using Python. Does a very nice job of explaining the intuition behind the algorithm.
• Recommender Systems (Video Series) – Discussion of the theory and math behind collaborative filters by Andrew Ng. More math-heavy, and it’ll be easier to follow if you have some background with Linear Algebra.
• Collaborative Filtering with Python (Tutorial) – Reference tutorial that implements a music recommender system in Python.
• Collaborative Filtering with R (Tutorial) – The same tutorial as the previous one, except in R.
4.3. Time Series Analysis
Time Series Analysis deals with data series that are indexed by time. For example, stock prices, precipitation amounts, and Twitter hashtags by hour would all be considered time series. Time series analysis is commonly used in Finance, Forecasting, and Econometrics.
While much of machine learning deals with “cross-sectional data” (data without regard to differences in time), there are also models specifically designed to handle time series.
• Time Series (Course Material) – Lecture slides, homework, and R Code for the Time Series course at Oregon State University.
• The Little Book of R for Time Series (Online Book) – Very practical step-by-step introduction to using R for time series analysis. Includes code and outputs for each step.
• Time Series Forecasting with Python (Tutorial) – Tutorial on performing time series visualization, analysis, and forecasting with Python.
• Seasonal ARIMA with Python (Tutorial) – Introduction to ARIMA models in Python. Includes all code.
• Statistical forecasting, Fuqua School of Business (Online Book) – Course notes from the statistical forecasting course taught at the Fuqua School of Business at Duke University.
Practice projects have two main purposes:
1 They help you solidify concepts and practice pulling together all the moving pieces of data science.
2 They arm you with something tangible to show employers. If a picture is worth 1000 words, a project is worth a million…
By nature, projects are personal undertakings, and you should pick topics you’re interested in. Here are a few places to find project ideas:
• 6 Fun Machine Learning Projects for Beginners – Our list of 6 fun machine learning project ideas for beginners.
• Predict Titanic Survival (Kaggle Competition) – Kaggle is a site that hosts data science competitions, many of which are beginner-friendly. The Titanic Survival Prediction challenge is a classic, with detailed tutorials for both Python and R.
• Hacker Rank (Programming Challenges) – Short programming challenges that are good for sharpening your skills without committing to a longer project.
And that’s a wrap! Hope you have gained some useful knowledge from this post.