THE GENESIS OF DATA ANALYSIS
The coupling of the very mature discipline of STATISTICS with the very young discipline of COMPUTER SCIENCE birthed the intriguing story of how Data Scientist came to be known as sexy. “Data Science” is now a term used to describe a profession dedicated to making sense of vast volumes of big data. Trying to understand data is a science with long history and subject of discuss among Scientists, Statisticians, Librarians, Computer Scientists and many professional for years. Now, we shall take a journey through time to explore the evolution of the term “Data Science” its use, attempts on definitions and other related terms.
In 1962, John W. Tukey in “The Future of Data Analysis” writes, “For a long time I thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and doubt… I have come to feel that my central interest is in data analysis… Data analysis, and the parts of statistics which adhere to it, must…take on the characteristics of science rather than those of mathematics… data analysis is intrinsically an empirical science… How vital and how important… is the rise of the stored-program electronic computer? In many instances the answer may surprise many by being ‘important but not vital,’ although in others there is no doubt but what the computer has been ‘vital.” Tukey coined the term “Bits” in 1947 a term used a year later by Claude Shannon in his paper “A Mathematical Theory of Communications”. Tukey in 1977, published “Exploratory Data Analysis”, arguing that more emphasis is required on using data to suggest hypotheses for tests and that Exploratory Data Analysis and Confirmatory Data Analysis “can—and should—proceed side by side.”
In 1974, Peter Naur published “Concise Survey of Computer Methods” in both Sweden and United States. The book is a survey of modern data processing with a wide range of applications organised around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing: “[Data is] a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process“. The Preface informs the reader about a course plan by the IFIC congress in 1968 titled, “DATALOGY, the science of data and of data processes and its place in education“. According to the book, the term ‘data science’ has been used freely. Naur hereby defines data sciences as “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”
1977 The International Association for Statistical Computing (IASC) is founded as a section of ISI. “it is the mission of the IASC to link traditional statistical methodology, modern computer technology and the knowledge of domain experts in order to convert data into information and knowledge.”
In 1989, Gregory Piatetsky-Shapiro chairs and organises the first Knowledge Discovery in Databases (KDD) workshop . In 1995, it became the annual ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
In September 1994 Business Week published a cover story on Database Marketing stating “Companies are collecting mountains of information about you, crunching it to predict how likely you are to buy a product, and using that knowledge to craft a marketing message precisely calibrated to get you to do so… An earlier flush of enthusiasm prompted by the spread of checkout scanners in the 1980s ended in widespread disappointment: Many companies were too overwhelmed by the sheer quantity of data to do anything useful with the information… Still, many companies believe they have no choice but to brave the database-marketing frontier.”
In 1996, Members of the International Federation of Classification Societies (IFCS) met in Kobe, Japan, for their biennial conference and for the first time, the term ‘data science’ was used in the conference’s title (“Data science, classification, and related methods”). IFCS was founded by six countries and language-specific- classification societies in 1985. Publications by one of such societies – The Classification Society founded in 1964 made use of terms like data analysis, data mining and data science.
1996, Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth wrote in their publication “From Data Mining to Knowledge Discovery in Databases” that Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing… In our view, KDD [Knowledge Discovery in Databases] refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data… the additional steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining, are essential to ensure that useful knowledge is derived from the data. Blind application of data-mining methods (rightly criticized as data dredging in the statistical literature) can be a dangerous activity, easily leading to the discovery of meaningless and invalid patterns”.
In 1997, Professor C.F. Jeff Wu (currently at the Georgia Institute of Technology), called for the renaming of statistics to data science and statisticians be called data scientists during his inaugural lecture for the H. C. Carver, Chair in Statistics at the University of Michigan.
1997 The journal Data Mining and Knowledge Discovery is launched and the tile having data mining first of its two terms, reflects how data mining came to be the more popular way of designating “extracting information from large database”.
In December 1999, Jacob Zahavi was referenced by Knowledge@Wharton in “Mining Data for Nuggets of Knowledge” saying, “Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Scalability is a huge issue in data mining. Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address web-site decisions”.