Should I get a data science degree? (Part III) — The road-maps
This article is the third and last article from the should I get a data science degree series. In the first two articles (check out part I and part II), I gave my perspective on self-taught data science vs classically trained. I tried to answer this question from my perspective i.e through the lens that my experience of first self-learning through MOOCs and various online resources, then actually getting a data science degree got me. You can go back to the two articles if you want more details, but I promised that this article wasn’t going to be about me or babbling about my experiences, but more of a concrete road map to get what I would call “good data science base”.
Again, to each his/her journey. This roadmap is just a backbone of subjects you can take a look at, swap out with similar content or completely dismiss. I tried to include various resources types for the same subject to take into account the difference between the learning style of each. I’m more of a visual/sound learner, I like to watch videos/Moocs and work on implementations, exercises, etc … I know that I’m a slow reader and I often catch myself reading the same paragraph 3 times because I couldn’t focus the first two times so I keep book reading for the ‘hard stuff’ i.e proofs, math, theoretical stuff or papers implementation.
Ok, let’s get to the good stuff!
unsplash picture, this is javascript code… Nothing to do here….
Two roadmaps
I wrote two separate roadmaps :
- The KISS (keep it simple, stupid): The first one is for a total beginner or someone who wants to check what all that data science hype is about. Maybe you are an electrical engineer who wants to learn about this stuff or a developer who wants to get an understanding of ML/DL before implementing a model. For beginners, I tried to focus on the basics, keep the math-y stuff to a minimum and provide enough depth to understand all the sexy subjects like Neural Networks, Clustering, or Reinforcement Learning (although this might need more theory to get a full understanding of RL) at a deep (enough) level
- The ALL-IN: The second one is for the All IN on the data science. I tried to gather all the resources and structure them in a coherent roadmap to get a master’s degree level in data science. I probably missed some important stuff so please feel free to contact me if you see some glaring hole in this roadmap, I’ll update it accordingly.
THE KISS:
Data science is basically a combination of three fields: Maths, Programming, Statistical learning. The third can be viewed as the combination of the first two (as in computational statistical learning) but I keep it separate in my view. Let’s start with the fundamentals :
Foundations
Math prerequisite
Linear Algebra
- Essence of Linear Algebra — 3Blue1Brown: I highly recommend this youtube channel, Grant Sanderson has that amazing gift where he can show elegantly the beauty of maths with extreme simplicity. It feels like you could have created Linear algebra yourself!
Calculus:
- Essence of Calculus — 3Blue1Brown: again checkout this channel, Grant Sanderson is the man!
- Differential Calculus — Khan Academy
- Multivariable Calculus — Khan Academy
Statistics & Probability:
- Statistics and Probability
- I would also recommend you dive a little bit deeper into Proba and Stats with these two courses although not necessary I find that you
- Probability Harvard 110
Books :
- I recommend one book ( and it’s freely available), it has all the mathematical foundation you need in a concise and beautifully linked manner: https://mml-book.github.io/
Programming
I find that a hand’s-on approach with programming is the best one, python is a simple language that was created following the ABC programming language paradigm and was first intended for teaching programming to middle school students! I don’t think you need to spend time learning programming language design, algorithms, and data structures.. (if you want to check out the ALL-IN roadmap). You can check out these resources for a basic python :
- LearXinYminutes : https://learnxinyminutes.com/docs/python/
- Learn python the hard way: http://www.accorsi.net/docs/LearnPythonTheHardWay.pdf
- Get some Object Oriented programming understanding, at this stage I think you get familiar with these concepts.
- Try to implement some basic python projects :
- Password Generator: In this project you will create a program to generate passwords for you, using the random module in Python3.
- Guess Game: In this project, you will create a guessing game, in which the user makes attempts to identify the number chosen by the computer.
- Tic Tac Toe Game: This is a two-player-based “tic tac toe game”, using various Python modules.
- Credit Card Validator: Simple implementation of the “Luhn Algorithm” or Mod 10 Algorithm, verifies a valid credit card number.
- A tiny bit of Matlab to get used to the syntax: https://learnxinyminutes.com/docs/matlab/
Machine Learning and Deep learning
- I highly recommend the Machine Learning course by Andrew Ng, it was the first Mooc I did and it hooked me instantaneously to the beautiful field of machine learning, it’s in taught in Matlab, but I think you :
- To move from Matlab to Python, I would recommend this MOOC: Applied Data Science with Python Specialization. It is somewhat slow but it gives you a good base for using python libraries like Numpy, Pandas, Matplotlib, Scikitlearn, and some NLP libraries
- I would also recommend this book: An Introduction to Statistical Learning
- Check out this 100-page book: The Hundred-Page Machine Learning
Now you can move on to the juicy stuff: Deep learning. I will recommend these Moocs and books
- Deep Learning Specialization: You will learn to implement Neural Network from scratch only using Numpy as a dependency. It is very very powerful and I would like to see this practice more widely used to teach Neural networks and move from the already available framework like Tensorflow or Pytorch when teaching this stuff the first time. You also learn about CNN, LSTMs, Model language, and a lot of fun stuff …
- The second course I highly recommend is Practical Deep Learning for Coders: https://course.fast.ai/. If you have a good background in programming, take this course. The only downside is that it uses the Fast.ai DL library for teaching which can be a little bit confusing when first learning this stuff.
- I found that deep learning books either don’t add additional understanding for technical implementations or are a little bit more involved in the theory stuff. I can’t recommend any books here (i didn’t read ALL the deep learning books out there so please hit me up if you know some good ones) but feel free to check out the ALL-IN roadmap for books recommendations, but strap in when reading them.
At this stage, you can look at papers, try implementing some classical ones like Inception Net, LSTMs, or GAN. Try GANs, It is fun…
BONUS: Introduction to Deep Reinforcement learning
- Look at this article by Andrej Karpathy: Amazing Deep RL implementation of the game of pong using only Numpy!
THE ALL-IN :
Welcome to the ALL-IN roadmap, you are here for quite a ride. These resources are gathered to form a pretty strong theoretical and practical foundation for data scientists, and we are not taking shortcuts! It will take time to read, watch and understand all these resources, so don’t get discouraged! It takes a decade to be good a something, we are all on a journey! You can either follow the links in order or choose a subject of the field that interests you more, although I highly recommend sticking with the foundational stuff first, before picking a subject.
Foundations
Math prerequisite
If you just need a refresher on some of these concepts, I highly recommend this book ( and it’s freely available), it has all the mathematical foundation you need in a concise and beautifully linked manner: https://mml-book.github.io/
Linear Algebra
- Essence of Linear Algebra — 3Blue1Brown:
- Linear Algebra by Gilbert Strang: try to check out the assignment problem set and work your way into difficult exercises. A good linear algebra understanding is very important in the ML curriculum.
Statistics & Probability:
- Probability Harvard 110 and the book Introduction to probability, I find myself sometimes going back to refresh my memory on Adam’s or Eve’s law …
- The probability primer by mathematicalmonk on youtube is amazing for all you need-a-video learners out there. Probability taught the right way starting from measure theory
- If you need a Measure theory primer
- Statistics for Applications: an MIT French maths teacher, so you know you’ll do proofs … Good, that’s what you need at this stage!
- Concentration Inequalities: A Nonasymptotic Theory of Independence: You’ll thank me when working on bandit proofs and understand where some obscure inequality passage came from**:**
Calculus and Optimization:
- Essence of Calculus — 3Blue1Brown: again checkout this channel, Grant Sanderson is the man!
- Multivariable Calculus — Khan Academy: This is still a pretty good base
- I found this book written by Garret Thomas with a pretty good calculus and optimization chapter
- The Convex Optimization bible, deep breaths for this one …
Information Theory:
Coming from a computer science background to Statistical learning, I think that information theory is a good tool to have in your bag. It will give you a lot of AHA moments when seeing a KL divergence like loss definition.
- Information Theory mathematicalmonk
- Information Theory, Inference, and Learning Algorithms: Conventional courses on information theory cover theoretical ideas of Shannon and practical solutions to communication problems. This book goes further, bringing in Bayesian data modeling, Monte Carlo methods, variational methods, clustering algorithms, and neural networks…
Time Series
I don’t like time series, I was coerced to move this topic in the fundamental section, doing anything with 1D data is absolute witchcraft. Learning the fundamentals of time series mangling before working with deep learning and ML prediction framework is probably the right thing to do.
- The Time Series Analysis by James Hamilton is a THICCCK bool (3 C here) but has everything from a theory standpoint
- I still have this open book on my bucket list for a more practical approach to time series analysis
Programming :
There are basically two views on programming for data scientists. The first one is you just need a somewhat good understanding of programming languages (specifically Python in 99.9% of cases) and you need to focus on the research part of data science. The second view is that a data scientist writes code, period! If you write code, you need to understand how to write proficient, low overhead code. You need to understand the tradeoffs you make when writing your ML pipeline. You need to understand how to leverage hardware acceleration, why GPUs are used so widely in data science, and all the distributed programming stuff. Ok, I know that this takes time and you probably don’t need all this. Skip it if you want or check it out, you might be an engineer at heart or a pure scientist!
Either or, here are the foundations for programming:
You need to learn Python, there is no way around this :
- LearnXinYminutes : https://learnxinyminutes.com/docs/python/
- Learn python the hard way: http://www.accorsi.net/docs/LearnPythonTheHardWay.pdf
- Now you can dive a little bit deeper into Python and understand subjects like Duck typing, internal data structures implementation, Functions as first-class objects, async programming and more: Fluent Python is a good resource
- Work on Projects !! checkout this link for ideas.
Learn SQL and NoSQL : Working with structured, real-life data usually requires querying databases. So you’ll need to use declarative languages like SQL and NoSQL (like mongo syntax style) languages :
- Learning SQL: Generate, Manipulate, and Retrieve Data
- For practical data stuff: Practical SQL: A Beginner’s Guide to Storytelling with Data
- For MongoDB NoSQL, I only read the MongoDB online doc but I found this book, I never read it to be completely honest. It explains the NoSQL data modeling paradigm which is pretty useful to have in mind if you ever find yourself making database design decisions.
Algorithms and data structures: there are a lot of resources about this subject. I’m just trying to keep it as focused as possible. Spend your time working on problems directly using specific data structures and algorithms and less time worrying about the esoteric stuff. You’ll probably need a double-ended queue in your code at some point, learning that it exists is probably a good start :
- Algorithms Specialization: learn about O() notation, divide and conquer approaches, graph algorithms, dynamic programming …
- If you are a psycho you can read the bible: Introduction to Algorithms
- Try practicing these newly learned algorithms using online sites like: https://leetcode.com/
Statistical Learning
Why can we learn from data? How much data do we need before getting accurate results? Can we learn from all data distributions? These are not trivial questions to answer and you’ll need some pretty advanced maths to have a solid framework to answer these types of questions.
- Highly recommend the lecture Machine Learning Theory by Shai Ben-David and his book Understanding Machine Learning
- Foundations of Machine Learning is a great book that treats the same subject of PAC learning but with a different structure, I prefer the Understanding machine learning one but to each his own…
Machine learning
Theory :
- Machine Learning youtube playlist is amazing!
- I recommend The Elements of Statistical Learning: I usually used it as a reference book for checking the details on various ML algorithms
Practice :
- I highly recommend the Machine Learning course by Andrew Ng, it was the first Mooc I did and it hooked me instantaneously to the beautiful field of machine learning, it’s in taught in Matlab, but I think you :
- To move from Matlab to Python, I would recommend this MOOC: Applied Data Science with Python Specialization. It is somewhat slow but it gives you a good base for using python libraries like Numpy, Pandas, Matplotlib, Scikitlearn, and some NLP libraries
Deep Learning
- Deep Learning Specialization: You will learn to implement Neural Network from scratch only using Numpy as a dependency. It is very very powerful and I would like to see this practice more widely used to teach Neural Networks and move from the already available framework like Tensorflow or Pytorch when teaching this stuff the first time. You also learn about CNN, LSTMs, Model language, and a lot of fun stuff …
- The second course I highly recommend is Practical Deep Learning for Coders: https://course.fast.ai/. If you have a good background in programming, take this course. The only downside is that it uses Fast.ai …
- Deep Learning by Ian Goodfellow: I usually read these chapters from this book like blog articles, it’s a very dense book but gives some pretty cool nuggets into deep learning research.
Sadly, there isn’t a lot of theory about deep learning and neural networks :/. You should probably spend your time learning the basics about Neural Networks (forward and backpropagation, Jacobian vector product implementation, CNN, LSTM, transformers…) and work on implementing widely used architecture and seminal papers in each field.
Bandit Algorithms /Reinforcement Learning
Let’s get something straight, Reinforcement learning is pretty hard. In fact, I found that deep learning is the easiest of the data science fields, working through bandits algorithms upper bound proofs is pretty tricky. You should have a good mathematical base for this stuff but it is the most rewarding of all! You truly feel something amazing happening when your OpenAI model starts ‘learning’ how to play some weird game!
Bandits Algorithms:
- Bandit Algorithms by Tor Lattimore and Csaba Szepesvari
- All the lectures from Csaba Szepesvári are really good!
- Aleksandrs Slivkins paper: Introduction to Multi-Armed Bandits
Reinforcement learning
- David silver RL course, truly amazing!
- John Schulman’s youtube lecture: Deep Reinforcement Learning, the video and audio quality are bad, but grow up, deal with it, the quality of content is very good.
- Reinforcement Learning Richard S. Sutton and Andrew G. Barto: I actually hate this book but it is the gold standard in RL, so I read it.
Advanced Subjects
These subjects are either tangent to the data science field or viewed as ‘advanced’, I think that all that programming-related stuff is still very important to learn. Parallel/distributed programming is single-handedly the field that accelerated the deep learning revolution, so take a look at this stuff just to get a feel of what happens under the hood when you can a nb_jobs= -1 or use Pytorch distributed.
Learn C :
- C programming is “The” low-level language to learn in my opinion ( I hear you, all you Rust fans screaming right now). The foremost thing in which it helps is to understand the underlying architecture of how the memory is laid out. I recommend the one and only book: The C programming language
C++ (originally C with classes) is basically a bloated version of C with some nice abstraction, classes, and a ton of features I couldn’t explain if my life depended on it. It is still the gold standard in Finance, Embedded Systems, and Graphics … Checkout :
- A tour of C++
- These lectures are pretty good if you prefer video style.
- C++ tooling is a complete nightmare. I like Cmake, check out this youtube playlist. I hate that you have to learn to use tools just to correctly compile code but you can’t do much about that … (switch to Rust maybe ?)
- Effective Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14
- You probably found out that the Python you have been using is not so great (slow,…) but why? Only a handful of data scientists I met could answer this question when they all repeated the same thing. You should probably first understand how the Python and CPython interpreter works. This series of articles is amazing: Python behind the scenes.
CUDA / GPU programming:
Now that we have learned some “real” programming languages, we can start using some hardware accelerated stuff, and understand the power of parallelism.
The Cuda library is ( sadly) the only player in town when it comes to ML GPU programming. Nvidia is leaps and bounds ahead of the competition, libraries like OpenCL are not well supported (yet ) by high-level frameworks like TF or Pytorch.
- Start with this guide.
- At this time of writing in early 2022, You can either sell a kidney if you still have two and buy a GPU or use Google Collab free GPU for testing Cuda code, it’s your choice.
Distributed programming
For learning data-parallel and model parallel computing in machine learning, you’ll have to have some key concepts in Parallel and distributed computing, the main trade-offs, performance choices, and the key concepts and issues when designing a data-intensive scalable machine learning solution.
- Scaling Up Machine Learning Parallel and Distributed Approaches is “THE” book to read.
- Check out libraries like MPI and books on MPI and OpenMP like this one.
- Checkout machine learning specific libraries like Dask and Ray
I think I’ll stop here, they are probably tens of thousands of resources out there. I really tried to focus on those that I liked, the ones that will give you the most bang for your buck in terms of fundamental understanding of each field, breadth, and depth of knowledge, and link to the data science field.
I’ll try to keep this post updated with the most recent resources I might stumble upon or resources some of you share with me. This is the last article of this series, I really hope that my humble effort in trying to depict my own data science journey from telling my story to gathering the resources I read and watched during the last 3 years in a condensed format could be a beacon of inspiration for those who gave their precious time to read these words. Thank you!