Should I get a data science degree? (Part II)
This article is the second part of a three-part series of articles on self-taught vs “classically” taught data scientists from my perspective. In the first part, I wrote about my journey from hating my job to starting a self-taught data science journey. I gave some brief insight on the online resources that helped me along the way, I also took some time to explain my background and why I’m writing this series of articles. I won’t repeat those things, so if you stumble upon this article feel free to read Part I or just guess most of my life until this point, whatever is easiest.
Another unsplash image for medium article points
So, I joined the Master in Data Science at ENSAE after spending two years on my self-composed data science curriculum, 3 years of writing PowerPoint presentations and going to useless meetings as a consultant, and 4 years of ‘general’ engineering studies at Telecom. I recall being extremely skeptical and anxious about going back to the classroom. Did I make the best choice? Would I’ve been better off continuing working on my startup full time? or just continuing on my personal data science curriculum? The thing is, I made that choice and I was going to find out. You win or you learn.
The first month of the curriculum was dedicated to the ‘basics’. It was a pretty intense month where you have to attend to condensed versions of the fundamental courses they teach the second year at ENSAE. The same year they call ‘The marathon’. You have to take Probability and Statistics, Time series, Econometry, and statistical learning. There were also two bullshit courses about Python and R programming language.
Let’s start with the obvious, the bullshit courses were exactly that. Clearly, you can’t go from 0 coding experience to George Hotz coding level in 6 hours of Python courses. You just can’t. It takes time to be proficient at writing code, you need to learn the basics from algorithms and data structures, then write shit code for years, then gradually get better over a long series of shitty written code. Those 6 hours weren’t going to cut it for coding and I was lucky for having dedicated some time to trying to get better at coding before that, I have spent some time trying to understand Python at a deeper level, I tried for example to read the source code of some ML libraries like Pytorch ( I highly recommend this practice) to actually get a sense of what production level Python code looks like. I do understand why they felt the need to put these courses in and maybe some of my classmates did learn some of the basic Python code while attending these courses. The thing is, I think that you get a much better ROI if you actually learned coding on your own with projects, open-source participation, and hours of dark-themed IDE screen staring. This was a little bit of a letdown on the curriculum but let’s get to the meaty part.
Now to the less obvious, well, to me at least. Remember how I told you in Part I that I graduated as some kind of ‘general’ engineer? Well, it was exactly that, engineering school courses were a bunch of guidelines and keywords they spit out on projected slides for 300 students. The exam is where you have to vomit some character successions you saw the day before to pass the class. To my big surprise (and joy), ENSAE is not that kind of school. On the first day, we got handouts for the statistics course and the professor was actually writing proofs on a blackboard with chalk !! Craziness! I remember thinking: this is what real scientific education looks like. I could actually query the lecturer IRL, think about the steps of the proof, and try to solve complex exercises. I did not expect that but I loved it. In my humble view, all technical subjects should not be taught on slides in a lecture hall of 300 students. It’s either a blackboard and chalk or projects and hands-on. I’ll come back to this at length in an article I’ll probably name: ‘why slides suck …’
As an aside, I wanted to talk about the maths part of the self-learning journey. I, for instance, watched the course of Gilbert Strang Linear algebra and hoped I had him as a teacher when learning all this the first time. The quality of the content and the pedagogy possessed by MIT (and other Ivy league school ) professors are otherworldly. The only issue in this form of learning is the depth of understanding you acquire, the one that comes only when trying, hours at a time, to solve some complex exercise before finding the solution. I didn’t have an exam to pass so I didn’t do that (obviously I’m not a psycho) but I taught: hmmm, do I reaaaally understand this? I mean, yes I understand how Singular value decomposition works but how many perspective shifts do I have in mind if I wanted to solve a very hard problem or rewrite the proof of some intricate theorem?
The thing is I always felt like I messed up my education by taking the path of least resistance, some were choices I made, some were circumstantial, but I’ve always felt that I didn’t learn hard stuff when I needed/had time to. I was not going to do the same mistakes again, or at least I was going to try hard this time. I was going into this master with the tough of probably half-assing it because of the ‘system bro’ but I had to check myself. A month to study in-depth subjects like Statistics, time series, econometry, and statistical learning was very challenging. Thankfully, I had already done a pass on this material, I brushed up on my maths skills and spent time watching the Harvard course Statistics 110, MIT statistics and rereading (and this time doing all the exercises) Introduction to Probability. I really consider these resources combined with linear algebra and multivariate calculus as the best math base you can have for data science.
ENSAE is really special in the fact that they invest so much time in teaching the theory, it is one of the few schools that still teach mathematics at a very high level. After this special month, each one of the two semesters consists of a mix of mandatory courses and self-chosen ones. The mandatory ones were machine learning, deep learning, database, and some other bullshit-filled courses where they teach you about satan in slides, something of that sort, I never paid attention. In the list of available optional courses, I took some pretty difficult ones. My thinking was that these courses are the real plus academia can offer. I took advanced optimization where we basically did proofs. This course gave me very deep insight into convex optimization and a solid base where I could actually read now some very technical papers about convex, non-convex, and stochastic optimization. I would never have had the chance to develop these skills on my own. First of all, the lecturer was absolutely amazing, he took time to really provide a very good course and immense dedication and time to answer each question about the assignments (keep in mind we were studying remotely due to the pandemic), but also the fact that unless there is a structured environment for studying, reading maths textbooks hours at a time is not that fun (unless you’re some kind of psycho again), especially when you have time to do the fun, sexy stuff. I’m still unsure about the degree of importance of deep maths skills in a technical field like data science but hey, it can’t hurt to have them. To hammer this point home, one of the courses I also took was advanced statistical learning which provides a very good theoretical foundation on the learning for data stuff, it changed the whole data science paradigm in my head and gave me a deep insight into the fundamental problem of trying to learn anything from data. I basically read Understanding Machine Learning cover to cover and watched the series of lectures on youtube by Shai Ben-David and it covered more than enough material to ace the tests and to get some pretty good insight on the field.
Having easy access to great lecturers and professors is also one of those things I didn’t think about beforehand. I got instant access to one of the best minds in academia and this was truly amazing. I probably was the person that interrupted the lectures the most (sorry classmates !), I already had the details of the implementations and the big picture stuff down but I could now ask brilliant people who spend years developing their skills as much as I wanted about their intuition, what matters to them and how they see the field was evolving so I took advantage of this.
Now, for all that math glory, the curriculum was absolutely horrible on the coding/practical side of the formation. Basically, every MOOC I took was heaps and bounds better than anything we did there, except the projects where we could choose what we wanted to work on.
I want to emphasize one discovery I made while attending this master, it kind of took me time because I’m slow but it is very important: ALL LEARNING IS SELF LEARNING. Your education is your responsibility. Period.
Whether you have access to amazing classroom education or a sucky one, going to school may give you a good learning roadmap, good teachers, and a good overall environment to learn in. You can’t stop there if you want to be good at what you do. Some parts of the education may be harder or not fun others may be time-consuming but it’s all the same: you need to give time and effort to dive deeper, work on projects, practice, and learn from multiple sources. I recall having to finish 4 projects at the same time and I kept thinking that this kind of concurrency is not optimal. You have just enough time to post in something that doesn’t suck, barely scratching the surface of each of the four subjects.
Academic studies are breadth-first search, your role is to work on the depth.
If I have to sum up this year of ‘structured’ data science learning, I think that my yearning for mathematics took over the first part of the year before getting regulated by the reality of the projects when doing data science. The theory is only one piece of the puzzle that solving real-world problems leveraging data science tools is. It is a very important one, but all in all the actual skill of writing code and spending time on projects, paper implementation, and using libraries may be more importanter. (importer? importanterer? ).
Putting all these skills into a comprehensive curriculum is the subject of my next article. I promise it will be a more down-to-earth, roadmap-style curriculum where I will try to distill these three years of trial and error into one or two roadmaps that you can follow as a backbone for your data science learning journey. It is a journey, I loved every aha moment when something clicks and every long day of rereading the same paragraph 30 times to understand some obscure bandit algorithm detail. I feel lucky to have found this path, I have a lot to learn and a lot to discover yet.
My verdict on self-taught vs classically trained? I believe that you can learn everything yourself but I couldn’t have gotten this insight before taking that master’s course. If you have the chance to take a course in Data science nothing stops you from deep diving on your own and if you can’t / don’t want to follow the classical route, nothing can’t be learned on your own. Google is your friend, discipline and ambition are all you need (and a minimum IQ of 100, let’s be real here).