Data is used in essentially every field: to study DNA to cure diseases, to analyze the position and orientation of stars, to track political campaigns and voter opinions, to help businesses respond to consumer feedback, to determine standardized test scores, and much, much more. Even in the arts and creative fields, data science is becoming a critical skill. The importance of data science education for high school students is increasing. In this post, we will discuss the basics of how to prepare for data science while in high school.
How to Learn Data Science from Scratch in High School (Part 1)
Co-authored by Tanmoy Ray
Why Should You Prepare for Data Science in High School?
The Gap in Data Science Education at the High School Level
Even if you are not aiming to become a data science professional, being a data-literate person is extremely important for everyday life. Data helps us be well-informed citizens and make decisions, from choosing a career path or college to understanding the news, to knowing how we receive our music, movie, and product recommendations, and even understanding how social media news reaffirms our political beliefs.
However, there is a clear gap in training at the high school level.
In the US, out of a large public school’s nearly 200 courses offered, only two have a focus on statistics or data science: an introductory-level class and an advanced placement (AP) class. This is for one of the best public school districts in the US. Read Why Data Science Education Should Be Reformed – A Perspective by a High School Student.
Current Challenge for the Recruiters?
Presently, most students are only exposed to computer science or computational thinking during high school and first interact with Data Science after entering college or in the workforce.
As the ability to mine large amounts of data becomes more feasible, it also becomes more critical for the next generation of students to learn how to analytically and practically interact with larger data sets.
Updating current curricula to reflect these technological paradigm shifts to include data science provides high school students with an initial toolbox that they can use to build additional skills throughout college and their careers.
Current State of Data Science Learning in High Schools
Most high schools teach introductory computer processing and computer science, and some have also incorporated lessons on the basics of newer technologies. Unfortunately, few high schools possess curricula dedicated to learning data science.
Technological advances and the evolution of how society interacts with technology are continuously evolving. Updating high school curricula to reflect these changes is necessary to prepare the next generation to work in the global economy. Read more about data science learning in high school.
In 2020, more than 3,000 high schoolers in 51 high schools across Southern California took data science courses in their curriculum. The introductory course on Data Science has been designed by UCLA.
Computer Science (or Math/Physics) vs Data Science: Which is Better after Class 12?
According to Prof. Arun Kumar (UC San Diego), a Data Science program will offer more statistics/math skills and hands-on experience with data-driven applications (say, in a domain science with messy, real-world datasets) than most CS programs. All that can give you a headstart for careers as a data scientist, ML engineer, etc.
A CS program can also lead to such career pathways but it will likely need a lot more conscious independent effort on the student’s part to fill in gaps in their statistical/math knowledge and avenues to obtain hands-on experiences.
I’d be wary of universities that simply tack on some CS courses to a statistics degree or vice versa to jump on the bandwagon without deeper pedagogical thought on the curriculum.– Prof. Arun Kumar, Associate Professor, UC San Diego
Now, we will move on to the main agenda of this article – how to prepare for data science in high school.
Top Skills to Learn in High School to Prepare for Data Science
Data Science is all about storytelling and making sense of numbers which in turn helps us to understand the situation, enables businesses to make more accurate decisions also known as data-driven decisions.
In a single sentence, we can sum it up as getting meaningful insights from a set of alphanumeric raw data.
Programming is one of the most critical components in the data science skills journey as programming is used in all aspects of data science job functions such as automating tasks, raw data organization, and implementing, modifying, and using machine learning algorithms. A strong foundation in programming would be a key requirement to fulfill all these tasks hence job seekers are expected to have these skill sets by default.
The next important question is which language to choose?
Both R and Python are equally useful foundational programming languages that are used in the industry but choosing the right language depends on the experience and requirements for that specific job.
For example, Python is a general-purpose, versatile and multi-facet programming language, which is used in almost all computational activities. It is an open-sourced, community-based language that has a combination of flexibility and specificity.
It has got hundreds of libraries to carry out domain-specific work. Learning python is super easy as it supports syntax in plain and simple English thus easy to grasp by any user of any level.
On the other hand, R is more statistics-oriented visualization intuitive language that is mostly used by people who are into statistical analysis and or have statistical background find it easy to use.
The syntax in R is slightly more complex than python but favors statisticians and its powerful visualizations have the capacity for effective communication of results. Check out the best online courses on R programming for Data Science.
So, one must choose the language based on their requirement.
Integrative Development Environment (IDE)
IDEs are software or desktop apps that are either standalone or web-based that can be used for building applications, which combines the common developer tools into a single graphical user interface.
An IDE generally consists of the following:
- Source code editor: A text editor which shows and highlights the code for better visual cues also this section might have some extra functionality like code auto-completion, better commenting, and some more.
- Local Build automation: Simple codes which make coding easier by automating repetitive tasks and automatic error correction.
- Debugger: A program that identifies the bug or mistakes in a code for easy correction.
Data science is all about data and data manipulation which is the prime reason to learn about databases and a database-based language like SQL.
SQL stands for structured query language which is pivotal in the domain of data manipulation. Large datasets having millions of rows and columns are often hard to manage through traditional techniques, SQL offers a precise way to access, locate, adjust and check massive data sets.
“Scripting with Python, fundamental statistics, and SQL are critically important regardless of which direction you go in data”– Gwen Britton, Associate Vice President of Southern New Hampshire University (SNHU) Global Campus STEM & Business Programs
Math and Statistics
It is one of the key components in the path to becoming a data scientist since insights from data are all about statistics. Stats describe the nature of the data and based on that we can understand the nature of the data. A strong foundation in stats is crucial as understanding stats enables the data scientist to choose the respective algorithms which might be useful for the analysis.
Statistics helps to decipher the story hidden within the numbers and to gain deeper insights into the connections and patterns which can be drawn from the numbers.
Along with stats, one needs to have a sound knowledge of linear algebra and differential calculus which are the building blocks of machine learning algorithms. To know the working of algorithms, a basic foundation in maths is extremely important as this would enable the student to unlock the full potential of the algorithms instead of treating them as black boxes.
We all know the saying, “A picture is worth 1000 words” this saying is intricately applicable in this domain of Data Science and hence knowing visualization is another key element in this field.
Each programming language has its own method of visualizing data, but the most used libraries are available in both the languages that are R and Python. These libraries are quite powerful in making both static and interactive graphical representations.
Data Visualization Libraries for Python
It’s a 2D plotting library, it can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers and four graphical user interface toolkits.
It is GUI based interface for easy visualization and has the capacity to render the following types of graphical outputs with just a few lines of code.
- Bar charts,
- Power spectra,
- Stem plots,
- Error charts,
- Pie charts and a lot more….
The best part is that the codes to develop these can be easily found in the matplotlib documentation which is freely available on their website.
It’s a 2D as well as 3D plotting library which is a web-based toolkit to explore and create insightful visualizations.
It can be accessed from any python notebook and has a useful API (Application Programming Interface), which can be used freely. It also has a variety of plotting functions and capabilities such as:
- scatter plots,
- line charts,
- bar charts,
- error bars,
- box plots,
- multiple axes,
- subplots and a lot more…
The best part is that the codes to develop these can be easily found in the plotly documentation which is freely available on their website.
Data Visualization Libraries for R
It is a well-documented and popular package in R which is based on the grammar of graphics that is using the basic building blocks of plotting one can create any plot in this library. The basic building blocks include a dataset, the axes, and the labeling which is sufficient for creating graphs using this library. Click here to view its official documentation.
This is another package in R for better data visualization which has the capability of creating graphical stories using the same or similar syntax as that of ggplot2. The library can also be used as a standalone application from RStudio or can be used from the web browser. This package is extremely useful for exploratory data analysis. Check it out.
Need Some Handholding with the Headstart? Here is how Stoodnt can help!
Stoodnt is conducting exciting Summer Bootcamps for high school students (Grade 9 – 12). Check them out!
- 3-Week Programming, Data Science, AI & Cloud Computing Bootcamp (strong emphasis on Python and real-world applications including cloud deployment)
- 3-Week AI/ML & Data Science for Biosciences (covers both Python & R and focuses on applications in the biology and biomedical domains)
- 2-Week Bioinformatics and Biostatistics (covers R programming along with basics of Biotech, Biostatistics, and Bioinformatics)