"ex-libris" of a Data Scientist, part II: Model

"En l'an 2000, a l'ecole" - Jean-Marc Cote, 1899

abstract: I will cover some of the essential books for data science in a 6 part series. After an overview, this part II covers models (see part I: data and databases, part III: technology, part IV: code, Part V: visualization and part VI: communication).

"Since all models are wrong, the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena - George Box

In the first part of this series, I covered books and many essential papers related to data and databases.

Although it required a good amount of work to put together, this was nothing in comparison to selecting important books about this whole world of data science models (due to how long this document is, I will publish a list of important papers as a separate document, if there is interest).

Contrary to popular belief, only a fraction of this relates to building machine learning models. The part that is most often forgotten is about setting goals and metrics and testing hypotheses. Another part is that there is life beyond (and before) ML: econometrics, statistics, math and signal processing are but a few areas one should investigate when building models (signal processing will not be covered in this list, nor the whole field of computer vision).

But before we even get into all of that, there is a minimum mathematical foundation that is needed.

Math

I studied a lot of this while in school, in French, so those books would probably not be the best suited for a majority of my audience. But I did encounter a few additional interesting books on the subject, over the years.

First, the hard reality: there is a minimum amount of basic math that is required for building, measuring and understanding models, such as high school geometry and college level linear algebra, analysis (calculus, in the US) and probability.

My suggestion here would be to find an online course on the subject and use the book that is recommended by the teacher. However, if you want to get into some interesting stuff, here are some related books from my bookshelf, ordered by date published:

Calculus Made Easy 2nd (1914), MacMillan, Silvanius Thomson (the tagline: what one fool can do, another can)
Descriptive Geometry 4th (1946), Charles Schumann (gives a visual tour to geometry)
Lectures on Linear Algebra (1961), Interscience, I.M. Gelfand (translated from Russian)
Thinking Mathematically (1982), Addison-Wesley, John Mason (highly recommended, there is also a more recent edition available)
Probability: Theory and examples (1991), Wadsworth, Richard Durrett (current edition is 2010, 4th edition and a pdf available through author)
Geometry: Euclid and Beyond (2000), Springer, Robin Hartshorne
Convex Optimization (2004), Cambridge University Press, Stephen Boyd, Lieven Vandenberghe
Handbook of Linear Algebra (2007), Chapman & Hall, Leslie Hogben editor

More advanced subjects worth looking into include combinatorics and more specifically topological combinatorics, analytic combinatorics, graph theory and word combinatorics. There is a lot of overlap of these areas with computer science (ie. automata theory and languages - these will be covered in part IV of this series)

Speaking of videos and self-study guides, you definitely want to take a look at the machine learning self-study resources compiled by my colleague Rob Agle. The first section talks about math resources and online classes.

Metrics

I cannot overemphasize the importance of this section. If you do not evaluate and measure, you have nothing. Without metrics, who is to say that a coin toss is not a better model? (In other words, you are doing worse than random).

The books covering data science metrics specifically and in detail are far and few. This is due in part because the field of data science is relatively recent, and most publications focus more on the data than on the science part. Having said that, in each of the subsequent sections, some of the books include some aspects of building experiments, testing and measuring, but I feel it is often glossed over. The following books and publications should be reviewed:

Residuals and Influence in Regression (1982), Chapman and Hall, Sanford Weisberg (good starting point when evaluating regression through residuals, pdf is available through the author)
Evaluating Natural Language Processing Systems (1996), Springer, (read this some years back, would love to get a paper copy)
Evaluation and analysis of Supervised Learning Algorithms and Classifiers (2006), Blekinge Institute of Technology, Niklas Lavesson (thesis, pdf is available)
Evaluating Learning Algorithms: A classification perspective (2011), Cambridge University Press, Nathalie Japkowicz, Mohak Shah (it does have an emphasis on R code and should have been covered in part IV of this series, but it is the only relatively comprehensive book on the subject so I put it here)
Evaluating Machine Learning Models (2015), O'Reilly, Alice Zheng (more of a basic intro in booklet form, pdf available for free)

There are also metrics that are domain specific, and you might be asked to estimate or measure the effects of business decisions in terms of these. For example, when dealing with marketing or finance:

Measuring Marketing: 103 metrics every marketer needs (2007), Wiley, John Davis
Impact Evaluation in Practice (2011), The World Bank, Paul J. Gertler et al. (a bit more specialized, from a macroeconomic perspective but worth a read: free pdf)
Designing Qualitative Research, 5th (2011), SAGE Publications, Catherine Marshall, Gretchen B. Rossman (if surveys are part of your project, either as input or somewhere in the pipeline, perhaps as user-provided ratings, you should read this book, now in its 6th edition)
Survey Methodology (2004), Wiley, Robert M. Groves, Floyd J. Fowler, Mick P. Couper, James M. Lepkowski, Eleanor Singer, and Roger Tourangeau (this is the quantitative equivalent to the above)
Marketing Metrics 2nd edition (2011), Pearson, Paul W. Farris et al.

Note: Papers, on the other hand, are a lot more common, but the challenge is then to find those that are relevant to your problem. I typically start with survey(s) on the specific topic I want to research if they exist. These give many pointers to related publications.

Operations Research

Linear Programming, Game and Decision Theory, Markov models, Stochastic Programming? All are considered core to operations research (with an overlap of applied math, probability and statistics). If this field is completely alien to you, here's some reading material (some could arguably be found in other sections of this document):

Theory of Games and Economic Behavior (1944), Princeton University Press, John Von Neuman, Oskar Morgenstern
Linear Programming and Extensions (1963), Princeton University Press, George Dantzig
Operations Research: An Introduction, 5th (1992), Macmillan, Hamdy A. Taha (highly recommended. latest edition available is the 10th, published by Pearson)
Monte Carlo Markov Chains in Practice (1996), Chapman and Hall, W. R. Gilks, S. Richardson, D. J. Spigelhalter
Fundamentals of Queuing Theory, 3rd (1998), Wiley, Donald Gross, Carl M. Harris
Introduction to Stochastic Programming (2000), Springer, John R. Birge, Francois Louveaux
In Pursuit of the Traveling Salesman: Mathematics at the Limits of Computation (2012), Princeton University Press, William Cook

Over the years, the term has come to mean different things to different people, but the subjects covered in the above OR: An Introduction book should be familiar to a data scientist.

Econometrics and time series

A similar set of techniques are found in Operations Research, Signal Processing, Statistics and Machine Learning so it is worth exploring, particularly when it comes to speaking the same language. If you deal at all with financial data, forecasting and any time series problems, well worth your time to investigate what has been done in this space.

Theory and Practice of Econometrics (1980), Wiley, George G. Judge, R. Carter Hill, William E. Griffiths et al. (2nd edition 1985, I do not know a more recent equivalent that is as precisely presented)
Introduction to the Theory and Practice of Econometrics (1982), Wiley, George G. Judge, R. Carter Hill et al (2nd edition in 1988, undergraduate level)
Studies in Econometrics, Time Series and Multivariate Statistics (1983), Academic Press, S. Karlin, T. Amemiya, L.A. Goodman
The Collected Works of John W. Tukey Vol. I: Time series 1949-1965 (1985), Wadsworth, John Tukey
The Collected Works of John W. Tukey Vol. II: Time series 1965-1985 (1985), Wadsworth, John Tukey
Forecasting and Time Series: An Applied Approach (1993), Duxbury, Bruce L. Bowerman (there is a 2000 edition and there are more recent books on time series by Bowerman)
Mostly Harmless Econometrics (2009), Princeton University Press, Joshua D. Angrist, Jorn-Steffen Pischke

Statistics

First, if you need an introduction to the subject, start with the OpenIntro stat course and book (OpenIntro Statistics, 3rd edition). My suggestion from there is to pick some of the books below as use cases surface, except for The Lady Tasting Tea which you should read right away (to get historical context and a healthy dose of skepticism), and perhaps the bright green book (Data Analysis and Regression: A Second course in statistics). The orange book (Tukey's EDA) will be covered in part V, Visualization (coincidentally).

Elements of Statistics (1901), P.S. King, Arthur L. Bowley (already the main ideas of EDA showing up...)
The Foundations of Statistics (1958), Leonard J. Savage (another book included for the historical perspective and "controversies")
Data Analysis and Regression: A Second Course in Statistics (1977), Addison Wesley, Frederick Mosteller, John W. Tukey
Outliers in Statistical Data (1978), Wiley, Vic Barnett, Toby Lewis (for the longest time, the only book on the subject)
Robust Statistics (1981), Wiley, Peter J. Huber (insightful)
Bayes Theory (1983), Springer, John A. Hartigan
Statistical Decision Theory and Bayesian Analysis (1985), Springer, Jim Berger (another key reading, 2nd edition is augmented on the Bayesian side)
Empirical Model-Building and Response Surfaces (1987), Wiley, George E.P. Box, Norman R. Draper
Finding Groups in Data: An Introduction to Cluster Analysis (1990), Wiley, Leonard Kaufman, Peter J. Rousseeuw
Linear Models: Least Squares and Alternatives (1995), Springer, C. R. Rao, Helge Toutenburg
Survival Analysis: A Self-Learning Text (1996), Springer, David G. Kleinbaum (fairly easy to follow)
The Elements of Statistical Learning Data Mining, Inference Prediction (2001), Springer, Trevor Hastie, Robert Tibshirani, Jerome Friedman (I hesitated some between putting this under Statistics or ML, since it covers also trees, boosting, neural networks etc)
The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (2001), Henry Holt & Co, David Salsburg

And don't forget to review Durrett's book on Probability (in the math section towards the beginning) and MCMC in Practice (in the OR section).

Machine Learning

I think the order in which I covered each section makes it obvious that Machine Learning builds on and shares many components with other fields of research, yet many "data science" programs focus only on machine learning algorithms.

Here, I am concentrating more on books covering the theory, as to cover a broad range of algorithms. It is important to learn which types of algorithms tend to be better suited to certain problems. And at the same time, one algorithm that does extremely well with one data set might do extremely poorly with another (do read up on the "no free lunch" theorem when you get a minute). If you are more hands-on and less fond of theory, the bulk of language-specific machine learning books will be covered in part IV of this series.

Machine Intelligence and Related Topics (1982), Gordon and Breach Science Publishers, Donald Michie
Machine Learning: An Artificial Intelligence Approach I (1983), R.S. Michalski, J.G. Carbonell, T.M. Mitchell
Machine Learning: An Artificial Intelligence Approach II (1986), R.S. Michalski, J.G. Carbonell, T.M. Mitchell
The Computational Complexity of Machine Learning (1989), MIT Press, Michael J. Kearns
Algorithmic Learning (1994), Oxford University Press, Alan Hutchinson (unfortunately hard to find, I hope you have access to a great library)
Analogical Natural Language Processing (1996), UCL Press, Daniel Jones
Graphical Models for Machine Learning and Digital Communication (1998), MIT Press, Brendan J. Frey (unusual approach, worth the read)
Information Theory, Inference, and Learning Algorithms (2003), Cambridge University Press, David J.C. MacKay (there is a free pdf available from the author)
Survey of Text Mining: Clustering, Classification Retrieval (2004), Springer, Michael Berry editor
Pattern Recognition and Machine Learning (2006), Springer, Christopher M. Bishop (I'd probably chose between this and David MacKay's book)
The Top Ten Algorithms in Data Mining (2009), CRC Press, Xindong Wu, Vipin Kumar (there is also a paper covering those ten algorithms in fewer details)
Learning From Data: A Short Course (2012), AMLbook.com, Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin (this book is a companion to CalTech CS156 and full course is available online on youtube, hands down the best introductory class to machine learning)
Boosting: Foundations and Algorithms (2012), MIT Press, Robert E. Schapire, Yoav Freund (there is also a more affordable paperback edition published in 2014)
Computer Age Statistical Inference (2016), Cambridge University Press, Bradley Efron, Trevor Hastie (could have put this under statistics, but heavy on machine learning, also available for free as pdf)
Deep Learning (2016), MIT Press, Ian Goodfellow, Yoshua Bengio, Aaron Courville (the book to get if you want to learn about deep learning)

If you are wondering where are the scikit-learn, , and caret related books, you'll have to wait until part IV (Code). Next article (part III) in the series will cover Technology.

Francois Dion

Chief Data Scientist, Dion Research LLC

@f_dion

NB: Also published on LinkedIn at: https://www.linkedin.com/pulse/ex-libris-data-scientist-part-ii-model-francois-dion/

The Dion Research Blog

Search This Blog