Should I get a data science degree? (Part III) — The road-maps

This article is the third and last article from the should I get a data science degree series. In the first two articles (check out part I and part II), I gave my perspective on self-taught data science vs classically trained. I tried to answer this question from my perspective i.e through the lens that my experience of first self-learning through MOOCs and various online resources, then actually getting a data science degree got me. You can go back to the two articles if you want more details, but I promised that this article wasn’t going to be about me or babbling about my experiences, but more of a concrete road map to get what I would call “good data science base”.

Again, to each his/her journey. This roadmap is just a backbone of subjects you can take a look at, swap out with similar content or completely dismiss. I tried to include various resources types for the same subject to take into account the difference between the learning style of each. I’m more of a visual/sound learner, I like to watch videos/Moocs and work on implementations, exercises, etc … I know that I’m a slow reader and I often catch myself reading the same paragraph 3 times because I couldn’t focus the first two times so I keep book reading for the ‘hard stuff’ i.e proofs, math, theoretical stuff or papers implementation.

Ok, let’s get to the good stuff!

unsplash picture, this is javascript code… Nothing to do here….

Two roadmaps

I wrote two separate roadmaps :

  • The KISS (keep it simple, stupid): The first one is for a total beginner or someone who wants to check what all that data science hype is about. Maybe you are an electrical engineer who wants to learn about this stuff or a developer who wants to get an understanding of ML/DL before implementing a model. For beginners, I tried to focus on the basics, keep the math-y stuff to a minimum and provide enough depth to understand all the sexy subjects like Neural Networks, Clustering, or Reinforcement Learning (although this might need more theory to get a full understanding of RL) at a deep (enough) level
  • The ALL-IN: The second one is for the All IN on the data science. I tried to gather all the resources and structure them in a coherent roadmap to get a master’s degree level in data science. I probably missed some important stuff so please feel free to contact me if you see some glaring hole in this roadmap, I’ll update it accordingly.

THE KISS:

Data science is basically a combination of three fields: Maths, Programming, Statistical learning. The third can be viewed as the combination of the first two (as in computational statistical learning) but I keep it separate in my view. Let’s start with the fundamentals :

Foundations

Math prerequisite

Linear Algebra

  • Essence of Linear Algebra — 3Blue1Brown: I highly recommend this youtube channel, Grant Sanderson has that amazing gift where he can show elegantly the beauty of maths with extreme simplicity. It feels like you could have created Linear algebra yourself!

Calculus:

Statistics & Probability:

Books :

  • I recommend one book ( and it’s freely available), it has all the mathematical foundation you need in a concise and beautifully linked manner: https://mml-book.github.io/

Programming

I find that a hand’s-on approach with programming is the best one, python is a simple language that was created following the ABC programming language paradigm and was first intended for teaching programming to middle school students! I don’t think you need to spend time learning programming language design, algorithms, and data structures.. (if you want to check out the ALL-IN roadmap). You can check out these resources for a basic python :

  1. Password Generator: In this project you will create a program to generate passwords for you, using the random module in Python3.
  2. Guess Game: In this project, you will create a guessing game, in which the user makes attempts to identify the number chosen by the computer.
  3. Tic Tac Toe Game: This is a two-player-based “tic tac toe game”, using various Python modules.
  4. Credit Card Validator: Simple implementation of the “Luhn Algorithm” or Mod 10 Algorithm, verifies a valid credit card number.

Machine Learning and Deep learning

Now you can move on to the juicy stuff: Deep learning. I will recommend these Moocs and books

  • Deep Learning Specialization: You will learn to implement Neural Network from scratch only using Numpy as a dependency. It is very very powerful and I would like to see this practice more widely used to teach Neural networks and move from the already available framework like Tensorflow or Pytorch when teaching this stuff the first time. You also learn about CNN, LSTMs, Model language, and a lot of fun stuff …
  • The second course I highly recommend is Practical Deep Learning for Coders: https://course.fast.ai/. If you have a good background in programming, take this course. The only downside is that it uses the Fast.ai DL library for teaching which can be a little bit confusing when first learning this stuff.
  • I found that deep learning books either don’t add additional understanding for technical implementations or are a little bit more involved in the theory stuff. I can’t recommend any books here (i didn’t read ALL the deep learning books out there so please hit me up if you know some good ones) but feel free to check out the ALL-IN roadmap for books recommendations, but strap in when reading them.

At this stage, you can look at papers, try implementing some classical ones like Inception Net, LSTMs, or GAN. Try GANs, It is fun…

BONUS: Introduction to Deep Reinforcement learning

  • Look at this article by Andrej Karpathy: Amazing Deep RL implementation of the game of pong using only Numpy!

THE ALL-IN :

Welcome to the ALL-IN roadmap, you are here for quite a ride. These resources are gathered to form a pretty strong theoretical and practical foundation for data scientists, and we are not taking shortcuts! It will take time to read, watch and understand all these resources, so don’t get discouraged! It takes a decade to be good a something, we are all on a journey! You can either follow the links in order or choose a subject of the field that interests you more, although I highly recommend sticking with the foundational stuff first, before picking a subject.

Foundations

Math prerequisite

If you just need a refresher on some of these concepts, I highly recommend this book ( and it’s freely available), it has all the mathematical foundation you need in a concise and beautifully linked manner: https://mml-book.github.io/

Linear Algebra

Statistics & Probability:

Calculus and Optimization:

Information Theory:

Coming from a computer science background to Statistical learning, I think that information theory is a good tool to have in your bag. It will give you a lot of AHA moments when seeing a KL divergence like loss definition.

Time Series

I don’t like time series, I was coerced to move this topic in the fundamental section, doing anything with 1D data is absolute witchcraft. Learning the fundamentals of time series mangling before working with deep learning and ML prediction framework is probably the right thing to do.

Programming :

There are basically two views on programming for data scientists. The first one is you just need a somewhat good understanding of programming languages (specifically Python in 99.9% of cases) and you need to focus on the research part of data science. The second view is that a data scientist writes code, period! If you write code, you need to understand how to write proficient, low overhead code. You need to understand the tradeoffs you make when writing your ML pipeline. You need to understand how to leverage hardware acceleration, why GPUs are used so widely in data science, and all the distributed programming stuff. Ok, I know that this takes time and you probably don’t need all this. Skip it if you want or check it out, you might be an engineer at heart or a pure scientist!

Either or, here are the foundations for programming:

You need to learn Python, there is no way around this :

Learn SQL and NoSQL : Working with structured, real-life data usually requires querying databases. So you’ll need to use declarative languages like SQL and NoSQL (like mongo syntax style) languages :

Algorithms and data structures: there are a lot of resources about this subject. I’m just trying to keep it as focused as possible. Spend your time working on problems directly using specific data structures and algorithms and less time worrying about the esoteric stuff. You’ll probably need a double-ended queue in your code at some point, learning that it exists is probably a good start :

Statistical Learning

Why can we learn from data? How much data do we need before getting accurate results? Can we learn from all data distributions? These are not trivial questions to answer and you’ll need some pretty advanced maths to have a solid framework to answer these types of questions.

Machine learning

Theory :

Practice :

  • I highly recommend the Machine Learning course by Andrew Ng, it was the first Mooc I did and it hooked me instantaneously to the beautiful field of machine learning, it’s in taught in Matlab, but I think you :
  • To move from Matlab to Python, I would recommend this MOOC: Applied Data Science with Python Specialization. It is somewhat slow but it gives you a good base for using python libraries like Numpy, Pandas, Matplotlib, Scikitlearn, and some NLP libraries

Deep Learning

  • Deep Learning Specialization: You will learn to implement Neural Network from scratch only using Numpy as a dependency. It is very very powerful and I would like to see this practice more widely used to teach Neural Networks and move from the already available framework like Tensorflow or Pytorch when teaching this stuff the first time. You also learn about CNN, LSTMs, Model language, and a lot of fun stuff …
  • The second course I highly recommend is Practical Deep Learning for Coders: https://course.fast.ai/. If you have a good background in programming, take this course. The only downside is that it uses Fast.ai
  • Deep Learning by Ian Goodfellow: I usually read these chapters from this book like blog articles, it’s a very dense book but gives some pretty cool nuggets into deep learning research.

Sadly, there isn’t a lot of theory about deep learning and neural networks :/. You should probably spend your time learning the basics about Neural Networks (forward and backpropagation, Jacobian vector product implementation, CNN, LSTM, transformers…) and work on implementing widely used architecture and seminal papers in each field.

Bandit Algorithms /Reinforcement Learning

Let’s get something straight, Reinforcement learning is pretty hard. In fact, I found that deep learning is the easiest of the data science fields, working through bandits algorithms upper bound proofs is pretty tricky. You should have a good mathematical base for this stuff but it is the most rewarding of all! You truly feel something amazing happening when your OpenAI model starts ‘learning’ how to play some weird game!

Bandits Algorithms:

Reinforcement learning

Advanced Subjects

These subjects are either tangent to the data science field or viewed as ‘advanced’, I think that all that programming-related stuff is still very important to learn. Parallel/distributed programming is single-handedly the field that accelerated the deep learning revolution, so take a look at this stuff just to get a feel of what happens under the hood when you can a nb_jobs= -1 or use Pytorch distributed.

Learn C :

  • C programming is “The” low-level language to learn in my opinion ( I hear you, all you Rust fans screaming right now). The foremost thing in which it helps is to understand the underlying architecture of how the memory is laid out. I recommend the one and only book: The C programming language

C++ (originally C with classes) is basically a bloated version of C with some nice abstraction, classes, and a ton of features I couldn’t explain if my life depended on it. It is still the gold standard in Finance, Embedded Systems, and Graphics … Checkout :

  • A tour of C++
  • These lectures are pretty good if you prefer video style.
  • C++ tooling is a complete nightmare. I like Cmake, check out this youtube playlist. I hate that you have to learn to use tools just to correctly compile code but you can’t do much about that … (switch to Rust maybe ?)
  • Effective Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14
  • You probably found out that the Python you have been using is not so great (slow,…) but why? Only a handful of data scientists I met could answer this question when they all repeated the same thing. You should probably first understand how the Python and CPython interpreter works. This series of articles is amazing: Python behind the scenes.

CUDA / GPU programming:

Now that we have learned some “real” programming languages, we can start using some hardware accelerated stuff, and understand the power of parallelism.

The Cuda library is ( sadly) the only player in town when it comes to ML GPU programming. Nvidia is leaps and bounds ahead of the competition, libraries like OpenCL are not well supported (yet ) by high-level frameworks like TF or Pytorch.

  • Start with this guide.
  • At this time of writing in early 2022, You can either sell a kidney if you still have two and buy a GPU or use Google Collab free GPU for testing Cuda code, it’s your choice.

Distributed programming

For learning data-parallel and model parallel computing in machine learning, you’ll have to have some key concepts in Parallel and distributed computing, the main trade-offs, performance choices, and the key concepts and issues when designing a data-intensive scalable machine learning solution.

  • Scaling Up Machine Learning Parallel and Distributed Approaches is “THE” book to read.
  • Check out libraries like MPI and books on MPI and OpenMP like this one.
  • Checkout machine learning specific libraries like Dask and Ray

I think I’ll stop here, they are probably tens of thousands of resources out there. I really tried to focus on those that I liked, the ones that will give you the most bang for your buck in terms of fundamental understanding of each field, breadth, and depth of knowledge, and link to the data science field.

I’ll try to keep this post updated with the most recent resources I might stumble upon or resources some of you share with me. This is the last article of this series, I really hope that my humble effort in trying to depict my own data science journey from telling my story to gathering the resources I read and watched during the last 3 years in a condensed format could be a beacon of inspiration for those who gave their precious time to read these words. Thank you!