Data science and machine learning are the hot fields of tech: both are growing, highly-paid specialties, with persistent talent shortages and legions of people who want in.
They’re also connected by a shared history. That history began with the first general purpose computer, ENIAC, which was built in 1943 for researching and operating weapons systems. In the post-war 1950s and 1960s, as more computers were built, scientists began to theorize about other tasks computers could accomplish, if only they were powerful enough.
Some guessed incorrectly. The 1956 Dartmouth conference on artificial intelligence, considered by some the birthplace of the field, included the prediction that a “significant advance” could be made in 2 months of work on the core problems of AI, such as machines using human language or understanding abstract concepts. Needless to say, the researchers underestimated the scale of the problem.
But when it came to data science and machine learning, some theorists hit the nail on the head: the storage and processing capabilities of computers would create new fields centered on the analysis and manipulation of data.
It took over five decades for those predictions to pan out, but when they did, the two fields exploded in popularity. It’s no accident that data science and machine learning became huge in tandem: both rely on processing massive volumes of data. Data science uses that data to produce insights, while machine learning uses it to teach computers to predict which actions to take. While experts in either field have different skill sets, successful practitioners in both share a deep understanding of statistics and quantitative modeling. (In fact, one of the subfields of machine learning is even called statistical learning, showing the link between data and learning algorithms.)
The inflection point for both fields came in the latter half of the 2000s — and while there are no exact measurements of how large each field is today, both have grown exceptionally quickly. In data science, for instance, KDnuggets estimated over 200,000 data scientists globally, and LinkedIn searches revealed 120,000 data scientists and machine learning / AI engineers.
With data science and machine learning having quickly grown into major careers for hundreds of thousands, a look at their history can help show why the two fields are so significant — and where they might end up in the future.
A brief history of data science
The field of data science began with statistics — and, depending on your perspective, it has never fully separated from its parent.
John Tukey, often regarded as the founding father of data science, also founded Yale University’s statistics department. In 1962, Tukey published “The Future of Data Analysis”, which made a distinction between two goals in statistics. The first goal, confirmatory analysis, represented the typical objective of statistics: confirming the numbers behind an existing theory or conclusion. But Tukey favored the second goal, which he named “exploratory data analysis”. With exploratory analysis, Tukey said, statisticians would be able to find new, unexpected insights in data.
During the next thirty years, the idea of exploratory analysis slowly spread through academia and research institutions, such as Tukey’s employer, Bell Labs. By 1993, John Chambers, one of Tukey’s colleagues at Bell, rearticulated the idea of confirmation versus exploration as a crossroads for statistics:
“The statistics profession faces a choice in its future research between continuing concentration on traditional topics – based largely on data analysis supported by mathematical statistics – and a broader viewpoint – based on an inclusive concept of learning from data.”
Chambers’ remark marks both the beginning of the divergence of data science as a new technical field, and an acceleration in its development.
Another accelerant came with the development of new tools. In 1995, the R programming language was released, aimed at statistical computation. Earlier languages were also used by statisticians, including MATLAB, SAS, and S. But R grew beyond its predecessors, giving birth to a flourishing community that built open-source tools atop the language. Soon, a parallel community grew around the Python programming language, and its data analysis libraries like NumPy and Pandas.
The rise of data science tools was paralleled and complemented by rapid growth in storage and computing power. In 1996, digital storage became cheaper than paper; over a 21 year time period, from 1986 to 2007, the world's data went from 99.2 percent analog storage (primarily in paper and videotapes) to 94 percent digital. By 2000, a paper titled How Much Information? found that yearly production of information totalled about 1.5 exabytes (1.5 billion gigabytes). An update just two years later found that number had tripled, to 5 exabytes. And just as data was making the leap to nearly pure digital storage, a new paradigm of cloud storage and processing emerged with Amazon Web Services, which launched in 2006. Leaps in both storage and processing power enabled the large-scale logging and mining of data that data science is now known for.
Data science grew massively through the 2000s, driven by the press around rapidly growing companies like Amazon, LinkedIn and Facebook. Suddenly, everyone wanted in on the latest trend: big data. “It was the perfect storm of technology for logging and storing massive amounts of data, the rise of data intensive web services, every company getting caught up in the FOMO on data so that they invested massively in collecting an absurd amount, as well as the low barrier to entry due to the rise of services like Amazon Web Services,” says Hai Guan, a data science veteran and a former data science lead at LinkedIn.
The consensus on “data science” as the field’s name came in 2011, when DJ Patil, who was best known for his work at LinkedIn, published Building Data Science Teams. Patil noted that data scientists came from backgrounds as diverse as experimental physics, computational chemistry and neurosurgery. The unifying thread is an innate sense of curiosity, the skills to work with a tremendous volume of data, and storytelling - with statistics forming the grammar of this storytelling.
Four years later, in 2015, data science got the ultimate stamp of approval: Patil was announced as the United States’ first Chief Data Scientist. In half a century, data science had gone from a tiny niche of statistics to a full-fledged field in its own right.
A brief history of machine learning
Like data science, the field of machine learning got its start under the wing of a bigger sibling. Artificial intelligence, while still a new field, was in full swing by the mid-1950s, and brought machine learning along as a subset, albeit one that used methods dating back decades or even centuries, such as Markov chains and Bayes' theorem.
machine learning was coined in 1952 by Arthur Samuels, who had created a checkers-playing program for IBM's first commercial computer, the 701. Samuels' checkers program used a
rote learning technique, storing previous board positions for use in a search tree, which looked ahead to discover possible results of a move.
Six years later, in 1958, the perceptron algorithm was invented for use in image recognition by the psychologist Frank Rosenblatt. The Perceptron Project was one of the first artificial neural networks in existence.
But perceptrons represented more than just a technical advance — in a significant way, they also began a history of overhyped claims, pushed forward more by public imagination than scientific realism. The Office of Naval Research, which funded the Perceptron Project, boldly claimed that future neural networks would be able to see and talk with humans. These statements attracted strong pushback from the wider AI community and prompted two AI researchers, Marvin Minsky and Seymour Papert, to release a paper taking aim at the legitimacy of neural networks. As a history of the controversy notes:
Minsky and Papert were worried by the fact that many researchers were being attracted by neural nets. Their motivating force was... to try to stop what for them was an unjustified diversion of resources to an area of dubious scientific and practical value, and to push the balance of AI funding and research towards the symbol-processing side.
The attack largely worked: while a core of researchers continued their work, the overall field of machine learning experienced periods of stagnation between the 1960s and 1980s, along with the periods of AI winter that slowed the wider field. Still, some notable projects were launched during this time period. In 1979 it was the Stanford Cart, a self-driving vehicle that could slowly navigate through cluttered rooms; in 1985 it was NETtalk, which learned to correctly pronounce 20,000 English words; and in 1986 the backpropagation algorithm was invented for neural networks.
A major boost to AI came in 1997, when IBM's Deep Blue beat world chess champion Gary Kasparov. While Deep Blue primarily relied on an algorithmic approach to chess, its use of AI techniques such as alphabeta pruning and heuristic evaluation sparked new interest from other developers. Deep Blue was a bit of ML, a bit of AI, a lot of brute computing power; however it’s win triggered a new narrative of machine versus human, which has since played out through headline-grabbing demonstrations of ML techniques, including IBM Watson's win on Jeopardy in 2011, Google AlphaGo's wins against Go world champions in 2016, and the release of Google’s AlphaZero in 2017. Each of these major stories helped fire the imaginations of both specialists and the general public.
In 2006, the phrase
deep learning was coined to describe machine learning algorithms with multiple layers (most of which are neural networks). Five years later, in 2011, Google launched Google Brain, a deep learning project that would go on to make advances in image recognition, translation and robotics. Today, an open source release of Google Brain called TensorFlow is widely used for machine learning.
One additional trend accelerated machine learning: the proliferation of large, high-quality training datasets, such as the 16 million item Reuters News Wire Headline database and ImageNet, containing 14 million images, over a million of which were labeled by humans to help machines learn more easily. Open source databases such as these have driven major advances: in 2012, a convolutional neural network named AlexNet achieved a stunningly low error rate of 15 percent in a competition to identify images on ImageNet, transforming the field of computer vision. Five years later, top performers in the competition had reduced the error rate to 5 percent.
The last twenty years have brought a nonstop parade of significant advancements and good press for machine learning, leading to the current prominence of the field. As with data science, which outgrew its parent field of statistics, nearly all of today’s major “artificial intelligence” projects and commercial frameworks are, in fact, driven by machine learning techniques.
A shared future
In many ways, data science and machine learning have had parallel histories. Both began in the aftermath of World War II, as commercial computers entered use and theorists began to devise uses for them. But theory ran ahead of capability: both fields needed more powerful computers than existed at the time. More recently, our growing abilities to store, compute and share massive amounts of data have enabled the growth of both data science and machine learning, while trends such as cloud computing, ubiquitous computing and the open source movement have accelerated their spread.
These similarities suggest that data science and machine learning may also follow similar paths in the future.
For job seekers, either field has historically presented an intimidating exterior: the most visible professionals are masters of many disconnected fields. But more recently, several tools have appeared to help ease processes and make less trained employees more effective. Python and R have several packages and libraries to abstract away many complexities of data science. In machine learning, major cloud platforms - AWS, Google and Azure all provide ML/AI services with pre-trained models. Such tools are likely to become even more important in the future, allowing people who specialize in a given tool to find employment.
Similarly, the two fields are spreading into more niches. Most major consumer-facing companies already capitalize on their massive databases of customer / user data, using data science to discover future trends or machine learning to choose which product to market to an individual and what variation of an advertisement to deliver to them. Payment processors such as Visa, Mastercard and Paypal employ ML and Data Science to identify suspicious transactions. Triplebyte’s software engineering quizzes and candidate-company matching models rely heavily on advanced machine learning models.
Beyond the enterprise, some theorists predict that data science and machine learning will have a massive impact on the scientific research that drives our society forward. In the future, says Stanford professor David Donoho,
“Individual computational results reported in a paper, and the code and the data underlying those results, will be universally citable and programmatically retrievable… Code sharing and data sharing will allow large numbers of datasets and analysis workflows to be derived from studies science-wide”
These changes would reinforce the iterative nature of science. Analysis of our growing stockpiles of scientific data with data science and machine learning techniques could help surface patterns that would never be found by starting with hypotheses.
Finally, machine learning has potential applications across society. In a recent talk, Google’s Jeff Dean noted that an ML algorithm on a smartphone could identify diabetic retinopathy, an eye disease, more accurately than opthamologists. Dean believes that the field of machine learning could eventually produce a multi-purpose model: “We shouldn’t be training a thousand or a million different models. We should be training one model so when the million-and-oneth task comes along, we can leverage knowledge learned from previous tasks,” said Dean.
For more on how Triplebyte helps candidates get hired for machine learning and data science roles, take a look at our past post on building our machine learning track.