Julien Beaulieu Data Scientist

Comprehensive Project Based Data Science Curriculum

Aug 2022 update: This blogpost is now outdated. The content has moved to and is now part of the following project: Open Curriculai. Visit the curriculum section for the most recent version.

updated June 2022: curriculum v4.0

Summary

This curriculum offers a mix of best in class resources and a suggested path to become a data scientist. It is intended to be a complete education in data science using online materials. All resources have been heavily researched and used by myself in my journey to becoming a developper and a Data Scientist.

Why I wrote this curriculum

There is a lack of curated online resource that organizes material found online into a long term learning plan that covers all aspects of data science. Most curriculums only suggest content from their own platform, or suggest too many options from which to choose from. Bootcamps inaccurately make the promise that you’ll get hired after graduating. As a matter of fact, in most cases, a bootcamp will only help you get started if you are new to the field.

Who is this for?

The following curriculum is intended for anyone who wants to learn data science and programming, irrespective of what their current background is (I used to work in digital marketing). It therefore assumes no prior knowledge of data science or coding, and only a basic knowledge of high school math.

This curriculum does, however, assume that you are extremely eager to learn data science, are self driven and motivated because a lot of the resources are self-paced. Completing the curriculum end to end can easily take a year and a half. That said, the programme goes much deeper than a bootcamp and will give you more hands-on experience than most master’s degrees.

The resources are chosen to prepare you to be up and running for an industry role. If you’re interested in academia and research, enroll in a university degree instead.

If you already have experience with machine learning and are looking to refine your skills, I encourage you to look at the different modules and hand-pick what you find relevant.

This initiative is inspired by OSSU’s amazing self-taught, open source education in Computer Science.

Why choose a self-taught education?

  • There is an abundance of high-quality resources available online. This curriculum includes many courses from top universities (MIT, Stanford, University of San Francisco), MOOCs (Massive open online course) and bootcamps with outstanding reviews (Deeplearning.ai, Fastai, Le Wagon), and content from world class creators in the form of blog posts, videos, and books.
  • Focus on state-of-the-art techniques that you can apply in industry. There aren’t many university courses or bootcamps that teach the latest techniques such as those found in Fastai or at Standford’s CS224n Deep Learning for Natural Language Processing.
  • Have the flexibility to learn from anywhere around the world, and to continue pursuing your path part time if you find a job during the process.
  • Become proficient in continuous learning. Following this path will teach you how to be autonomous in how you acquire new skills and rigorous in how you choose new learning material. They will also be your go to resources at work as you progress throughout your career.
  • It is an opportunity to put yourself out there. Since you won’t rely on a college degree as a signal to get hired, you’ll be incentivised to create, share and communicate your best work. These are essential skills when working at a company.

Objectives with this coursework

  • Work on real world practical projects that you are passionate about.
  • Become a good developer with solid software engineering and computer science abilities.
  • Develop a strong foundation in math - this includes linear algebra, calculus, statistics and probability.
  • Be able to read and apply scientific papers or redo the experiments on your own.
  • Deploy models with elegant and reusable code.
  • Get hired as a data analyst, data scientist, or machine learning engineer.

Visual Overview

Made with Visme Infographic Maker

Curriculum

Data Science Intro & Learning How to Learn

I believe that the best way to get into data science is to first learn how to program and to gain some familiarity with computer science foundations. Also, since you are about to engage in a lifetime of learning new concepts and skills, I highly recommend viewing the resources related to becoming a better learner.

Note: ❤️s represent material I particularly enjoyed and recommend.

Topics covered: Introduction to AI , Introduction to computer science and Python, Learning how to learn

Resources Source Format
CS50 introduction to programming with Python Harvard Videos & coding exercises
CS50 introduction to Computer Science Harvard Videos & coding exercises
Data Analyst Nanodegree Udacity Videos & coding projects
Learning How to Learn ❤️ Coursera  Videos and quizzes
A Mind for Numbers Barbara Oakley Book
Pragmatic Thinking and Learning ❤️ Andy Hunt Book

Overview

  • If you are absolutely new to programming, start with Harvard’s CS50’s introduction to programming with Python. David Malan is one of the pioneers of online teaching and his content is always top notch. This class is given for the first time in 2022.
  • If you’re up for a real challenge, or if you’re already somewhat familiar with programming, look at CS50’s introduction to Computer Science from the same professor. The exercises are more involved and there is a focus on the C programming language instead of Python. That said, it covers more material and computer science concepts. If you’re not taking this course, and you probably shouldn’t be if you’re a total beginner, at least watch lecture 0 - Scratch for its marvelous introduction to computer science in general.
  • Since you’re about to start an epic learning journey, make sure you know how to apply the best techniques to efficient learning. Taking “Learning how to learn” on Coursera will yield immense benefits long term and give you a competitive advantage over your peers. The book on which the course is based on, “A Mind for Numbers”, and the brilliant “Pragmatic Thinking and Learning” are good complementary options too. For a quick summary of all three resources, refer to this blogpost or listen to this podcast by Dr Paul Pen for an excellent overview of the subject.
  • Start your data science journey with Udacity’s Data Analyst Nanodegree. This is a great place to start to learn about statistics, probability, data wrangling, data visualization, etc.

Core Data Science

If you can afford a bootcamp, I recommend taking what I consider is the best data science one out there: Le Wagon**. There is a heavy emphasis on practical exercises, the curriculum is constantly evolving with the latest libraries and technologies, and the final project involves deploying your own machine learning model. This intensive bootcamp, also available part time, will give you the big picture of what data science encompases end to end: math theory, data wrangling, data visualization, programming inside an IDE, Git, machine learning, deep learning, and data engineering.

If you’re not willing to spend too much money, you’re still in luck. Andrew Ng’s machine learning course on Coursera, which was originally created in 2012 and has been taken by millions, just got updated! This is a unique opportunity to get quality education for cheap. Take this course as well as Udacity’s data Analyst Nanodegree, and you’re guaranteed to start off with solid foundations.

Next, check out Fastai’s Introduction to Machine Learning for Coders. I recommend only watching the first 6 lectures which focus on tree-based models. The rest of the videos are on deep learning which is better covered in their more recent course listed below. Although this content is almost 5 years old, don’t let that discourage you from watching it: it is taught by one of the most respected data scientists in the world - Jeremy Howard - and is full of gems. Case in point: I have impressed many colleagues even today with techniques taken directly from this course.

Pair the above with Andriy Burkov’s famous and succint The Hundred-Page Machine Learning Book, as well as Wes McKinney’s Python for Data Analysis. One is more focused on ML theory, while the other is hands-on with Python. They are complementary and will solidify your understanding of all concepts covered so far. As a testament to the quality of the people I am referencing, Wes McKinney is the creator of the widely-used open-source pandas package.

Topics covered: Data wrangling Data collection with an API, SQL, Statistical tests & experiments, Data visualization, Machine Learning, Deep Learning, Random Forests, Model interpretation techniques

Resources Source Format
Data Science Bootcamp ❤️ Le Wagon In person / remote lectures - 9 weeks
Machine Learning Course Coursera Online videos & coding projects
Introduction to Machine Learning for Coders - Fastai ❤️ U of San Francisco Online videos & coding projects
The Hundred-Page Machine Learning Book ❤️ Andriy Burkov Book
Python for Data Analysis, 2nd Edition Wes McKinney  Book

Overview

  • Gain experience in the most important data science related tasks by taking Le Wagon’s bootcamp. For a cheaper version look at Udacity’s Data Analyst Nanodegree combined with Coursera’s Machine Learning Course
  • Get a practical approach to machine learning with tree-based models and model interpretation with Fastai.
  • Complement your learning with the very well written and concise Hundred Page Machine Learning Book.
  • Python for data analysis book is a practical, modern introduction for manipulating, processing, cleaning, and crunching datasets in Python. It is ideal for beginners and is a great way to get better at pandas, Numpy, and IPython.

Core Programming

The following resources will help you become a good programmer, understand some core software engineering principles, and give you the tools to pass the technical tests that most employers send you during recruitment. I suggest reviewing this material early in your education because being a good programmer will pay off very fast. You don’t need to go through all of this material in a linear way. Review this on an as-need basis but make sure you’re regularly coming back to it.

Resources Source Format
Python with Corey Schafer ❤️ YouTube  Videos
Fluent Python ❤️ Luciano Ramalho Book
Coding Exercises HackerRank  Coding exercises
Intro to Data Structures and Algorithms Udacity Self-paced videos and coding environment
Missing Semester MIT Self-paced videos & coding exercises

Overview

  • If you’re struggling with any programming concept in Python, make sure you search for videos of Corey Schafer explaining the subject. His videos are always well-built, clear and enlightening.
  • This book is a practical, modern introduction for manipulating, processing, cleaning, and crunching datasets in Python. It is ideal for beginners and is a great way to get better at pandas, Numpy, and IPython.
  • Familiarize yourself with common data structures and algorithms in Python with with Udacitys course which features practice exercises.
  • Complete HackerRank exercises to refine your Python skills with interview-style questions.
  • Once you’re comfortable with the basics of Python - that is to say, after at least 1 year of coding experience - you can slowly start reading Fluent Python for a deep dive on Python core language features and libraries. Note 1: Some of the later chapters are very advanced and optional. Note 2: Keep an eye out for the updated edition of the book which is coming soon.
  • If you still aren’t comfortable with the shell, version control (Git) and debugging, watch the lectures from MIT’s Missing Semester and do the exercises. Seriously, don’t neglect the exercises!

Core Math

Machine learning is a mix of Statistics, Linear Algebra, Probability, and Calculus. Some say that it’s not strictly necessary to go deep into mathematical theory and that it’s better to focus on coding. While there is some truth to this, if your end goal is to read, write, implement papers, and to be a true expert in data science, then do not neglect math. The following list of resources will help you both get started if you’re a beginner, or let you go deep down the math rabbit whole if you’re advanced. Remember to practice solving exercises if you want what you’re learning to stick.

Topics covered: Linear algebra statistics Vector calculus Probability and more

Resources Source Format
Essence of Linear Algebra ❤️ YouTube Videos
StatQuest - Machine Learning ❤️ Youtube Videos
Linear Algebra Khan Academy Videos & exercises
Linear Algebra 18.06 with Gilbert Strang ❤️ MIT Videos & homework
Calculus 1 & 2 Khan Academy Videos & Math excercises
Mathematics for Machine Learning Marc Peter Book

Overview

  • Get a great intuition for linear algebra with this fantastic resource: Essence of Linear Algebra by 3Blue1Brown.
  • Learn all things statistics and machine learning with Statquest. Josh Starmer has a gift for breaking down complex ideas into some of the simplest and best explanations on the Web. He also recently published a book which I encourage you to check out.
  • Delve deep into linear algebra with prof. Gilbert Strang’s amazing lecture which has been viewed by millions before. Complement with exercises in his book (which includes solutions to the exercises). For a less in depth alternative, refer to Khan Academy.
  • Learn the math required for machine learning with Marc Peter (and co.)’s book (advanced). Choose this option if you’re a warrior, and are interested in fundamental research.
  • Don’t forget to actually do the exercises and work on assigments. This is the only way you’ll become good at math.

Deep Learning

After completing the courses in Core Data Science, and with more solid foundations in programming and machine learning theory, you can move onto deep learning if that’s an area that interests you. It can be tempting to jump straight to this section when you’re starting because there are really cool applications to work on. My POV on this is if this is what really motivates you, try it out and see what you can get out of it. Remember to come back to the sections above however, or you will have gaping holes in your fundamentals which will come back and hunt you down the line.

Topics covered: Loss functions and optimization, Convolutional neural networks, Recurrent neural networks, Deep learning hardware and software, Deep learning for tabular data, NLP, Computer vision, Generative models,

Resources Source Format
Practical Deep Learning for Coders - Part 1 ❤️ U of San Francisco Online videos & recommended projects
Fastai Book Jeremy Howard, Sylvain Gugger Book
Deep Learning Specialization ❤️ Coursera - Andrew Ng Online videos & assignments
CS224n: Natural Language Processing with Deep Learning ❤️ Stanford - Chris Manning Online videos, assignments & final project
EECS 498-007 / 598-005 - Deep Learning for Computer Vision ❤️ U of Michigan - Justin Johnson Online videos, assignments & final project
Jay Alammar’s blog Jay Alammar Blogposts

Overview

  • Learn how to create state of the art models using the Fastai Library with Part 1 of their course. I suggest taking both the Fastai and Deep Learning Specialization courses together since one is more focused on coding while the other is more focused on the theory and math behind it. While you’re at it, follow Fastai’s course with their book.
  • Both Chirs Maning’s and Justin Johnson’s (he used to teach the very popular CS231n at Stanford) courses are world class and will give you deep insights into the worlds of Natural Language Processing (NLP) and computer vision. Be sure to do the assignments since they have you code algorithms from scratch and give you a solid foundations to progress further. Both have updated YouTube videos of their 2021 course.
  • The transformer architecture is widely used these days. To get a solid grap of what they are, be sure to read some of Jay Alammar’s blogpost on the subject.

Data Engineering & MLOps

Working on machine learning for a company is usually a lot more involved than just running models inside a Jupyter notebook. The resources below will get you familiarized with the whole life-cyle of a machine learning system. You’ll learn about formulating a problem, ingesting, labeling & cleaning data, building reusable pipelines for each step, deploying a model online and monitoring it, and much more. You’ll gain preliminary notions about what it takes to put a model in production. As the field is maturing, knowing about these steps isn’t optional anymore for anyone doing machine learning unless you’re only doing R&D.

Topics covered:

Resources Source Format
Full Stack Deep Learning ❤️ UC Berkeley Online videos & coding project
Made With ML Made With ML MLOps Course
Machine Learning Engineering for Production (MLOps) Specialization Deeplearning.ai  Online videos & coding projects
Machine Learning Engineering ❤️ Andriy Burkov Book
  • Learn how to create experiment management scripts, unit tests, labelling, linting scripts, continuous integration/continuous development with CircleCI, model versioning, Docker and web deployment with the Full Stack Deep Learning course. The labs walk you through how to build a fully fledged hand writting text recognizer using Pytorch.
  • Design an ML production system end-to-end with Deeplearning.ai’s Machine Learning Engineering for Production (MLOps) specialization.
  • Read Goku Mohandas’ multi-part blog series/course on deploying machine learning models in an automated, reproducible, and auditable manner.
  • Complement this course with Andriy’s amazing Machine Learning Engineering book that will teach you about the whole life cycle of a machine learning project.
  • Go back to some of the models you have built for your projects and deploy them!

Optional Courses

The following are courses should be taken depending on the outcome you want to achieve as a data scientist.

Resources Source Format
Practical Deep Learning for Coders - Part 2 ❤️ U of San Francisco Online videos & coding projects
CS229 - Machine Learning  Standford - Andrew Ng  Online videos and assignments
SQL Mode Analytics  Coding environment exercises
  • Learn to rebuild some Pytorch modules as well as part of the Fastai library from scratch with Part 2 of the course. This is also a great lecture in API design and software engineering.
  • If you wish to specialize in machine learning more so than deep learning, look no further than Andrew Ng’s famous machine learning lecture at Stanford.
  • If SQL is important for your projects and current/future job, become an expert with this SQL tutorial.

Extras

In addition to all of the above, I suggest doing the following:

  • Subscribe to these newsletters: Andriy Burkov, Deeplearning.ai’s The Batch, DataScienceWeekly for a constant flow of curated blogpost to stay up to date in the field.
  • Regularly explore Meetup.com to see if there are meetups on topics you are interested in. Since more meetups are currently happening online, you have access to meetups across the entire world.
  • Attend conferences. One I highly suggest going to is Pycon, even if that means spending a bit of money to attend and travelling to a host city. The value you’ll get from it will be worth it in my experience. It’s a way for you to be inspired by all that is happening in the world of Python, engineering, and machine learning. Otherwise, you can get an online only ticket for cheaper.
  • Participate in Hackathons. Keep an eye out for these events happening in your city, or look on meetup.com to find them.
  • Actively look for and join communities on Reddit, Discord, and Slack. For instance, subscribe to Reddit’s /r/learnmarchinelearning and /r/machinelearning subreddits. Join discord servers to find study groups so that you’re not learning alone. Fastai has a great community. Ask questions there, help out when you know the answer to a thread to help solidify your understanding of a subject.
  • Put yourself out there and start writing! Create a personal blog and write articles about what you learned. Your target audience should be people in the same situation you were in 6 months / 1 year ago.
  • Find a meetup group and ask if you can present a subject you’ve been working on. This will help your oral presentation skills.

Final Notes

While I update resources found in this curriculum quite regularly, some will inevitably become outdated. As a rule of thumb, you can be sure to trust the quality of the following content if you come across their material:

  • All new and old courses from Deeplearning.ai
  • All computer science / machine learning courses at Stanford Online
  • All courses from Fastai and Jeremy Howard specifically
  • Andrew Ng for machine learning
  • Justin Johnson for computer vision
  • Chris Manning for NLP
  • All of Andriy Burkov’s content
  • StatQuest for statistics/ML explanations
  • 3Blue1Brown for math

Please feel free to send me any resources, materials, courses that I have not included that you particularly enjoyed, or to send me a message if you want to chat about my experience learning this material.

**Disclaimer: I am a freelance teacher at Le Wagon’s data science bootcamp. That said, they are not paying me to be included here. I decided to add the bootcamp to the curriculum because of how valuable I think it is.



Subscribe to hear more from me