"ex-libris" of a Data Scientist, part IV: Code

Underwood No. 5 - Francois Dion
abstract: I will cover some of the essential books for data science in a 6 part series. This part IV covers code (see part I for introduction and: data and databases, part II: model, part III: technology, Part V: visualization and part VI: communication).

"Those who have learned to walk on the threshold of the unknown worlds, by means of what are commonly termed par excellence the exact sciences, may then with the fair white wings of Imagination hope to soar further into the unexplored amidst which we live." - Augusta Ada King, Countess of Lovelace

This was again a challenge, due to the number of titles I had to ignore. I feel like I'm missing several important books, but it is what it is. So let's get right into it.

Algorithms and Theory

It is sometimes said that coding is more an art than a science. Interestingly enough, in a circular reference, Donald Knuth, gave his series of books (on algorithms and computer science theory) the title of "The Art of Computer Programming". Perhaps we should start with some of that, then... (picture from Algorithms and Automatic Computing Machines by B. A. Trakhtenbrot)

  • Graph Algorithms (1979), Computer Science Press, Shimon Even (There is a 2012 2nd edition, edited by Guy Even)
  • Art of Computer Programming, Vol. 1: Fundamental Algorithms (1969), Donald Knuth (this one I did a pretty good job of reading the bulk of it)
  • Art of Computer Programming, Vol. 2: Seminumerical Algorithms (1981), Donald Knuth (this one mostly when I needed something specific)
  • Art of Computer Programming, Vol. 3: Sorting and Searching (1973), Donald Knuth (there is also a Vol. 4ish, in a perpetual state of completion... I own two of those fascicles)
  • Data Structures and Algorithms (1983), Pearson, A. Aho, J. Ullman, J. Hopcroft (a good introduction to the subject)
  • Compilers: Principles, Techniques, and Tools (1986), Addison Wesley, Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman (known as the new dragon book; the original dragon book is Principles of Compiler Design)
  • Algorithms and Applications on Vector and Parallel Computers (1987), North-Holland, HJJ te Riele, Th. J. Dekker, H.A. van der Vorst editors
  • Designing Efficient Algorithms for Parallel Computers (1987), McGraw Hill, Michael J. Quinn
  • Combinatorial Algorithms (1990), Adam Hilger, Ludek Kucera
  • Introduction to Automata Theory, Languages and Computation, 2nd ed (2001), John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman

Approaches

Lean. Agile. Object Oriented. Functional. Test Driven. Domain Driven. Discipline. Design. Different paradigms of programming and of software design, different approaches to writing and shipping code.

Perhaps with a touch of Alice in Wonderland.

A few are must-reads, a few you might want to read, and a few... well.. not as much read, but, as in the previous section, perhaps consult from time to time. (sheet music of Sei Solo a Violino Basso by Johann Sebastian Bach)


  • Godel Escher Bach: An Eternal Golden Braid (1979), Basic Books, Douglas R. Hofstadter (Pulitzer prize, not exactly approach, algorithms, cognition, history, puzzles or storytelling. An unusual book to say the least)
  • Introduction to Functional Programming (1988), Prentice Hall, Richard Bird, Philip Wadler (early edition taught functional programming using Miranda, later edition of the book with just Bird as author uses Haskell)
  • Object-Oriented Analysis, 2nd ed. (1991), Yourdon Press, Peter Coad (the real beginning of my Object Oriented trip - these days I prefer functional programming, as it is easier to build pipelines)
  • Object-Oriented Design (1991), Yourdon Press, Peter Coad, Edward Yourdon
  • Object-Oriented Programming, (1993), Yourdon Press, Peter Coad, Jill Nicola
  • The Mythical Man-Month, 1995 edition (1995), Addison-Wesley, Fred Brooks (the first edition was published in 1975, a classic)
  • Domain Driven Design: Tackling Complexity in the Heart of Software (2003), Addison-Wesley, Eric Evans (if you've just started, this might be a hard read)
  • Balancing Agility and Discipline: A Guide for the Perplexed (2003), Addison-Wesley, Barry Boehm, Richard Turner (some pragmatism)
  • Universal Principles of Design: 100 Ways to Enhance Usability, Influence Perception, Increase Appeal, Make Better Design Through Decisions, and Teach Through Design (2003), Rockport, William Lidwell, Jill Butler, Kristina Holden (yes, a book on design, not specific to programming)
  • Masterminds of Programming: Conversations with the Creators of Major Programming Languages (2009), O'Reilly, Federico Biancuzzi (the title says it all)
  • 97 Things Every Programmer Should Know (2010), O'Reilly, Kevlin Henney ed. (the content is available online)
  • Learning Agile: Understanding Scrum, XP, Lean and Kanban (2014), O'Reilly, Andrew Stellman, Jennifer Greene (for a data science/analytics or software engineering manager or lead, I would complement this with something like Agile Management for Software Engineering by D. Anderson on Prentice Hall)

Languages

In a recent poll by KDNuggets, the top tool used for analytics, data science and machine learning by respondents turned out to also be a programming language: Python. We will cover it below. Other popular languages were R (also covered below), SQL (covered in part I) and Java (in the Java / Scala section below). C/C++ were a bit lower down the list, but are still foundational for many high-performance applications. Finally, I conclude languages with a potpourri of various and multiple languages, from Perl (also included in the poll with about 1.7%, while Julia concluded the list of languages with at least 1% use) to Javascript, to Pearls of Haskell.

Do not make the mistake of being monolingual. Do learn multiple languages and programming paradigms. Else you'll be like the person who, having learned logistic regression, thinks that everything can be solved with that (if all you have is a hammer, everything starts to look like a nail).

Python

Python's popularity in data science has been climbing rapidly since 2012 or so (there is an interesting chart on the number of downloads of Pandas and Numpy in a presentation I did several years back). But it has been around much much longer. The first time I heard about it, I was studying computer science at UofM, and a friend who was studying at Ecole Polytechnique brought up the subject. "What a weird language", said I. "Significant spaces? No ++? I don't know about that. Gimme C.". I had no idea at that time just how important this would become. Following are just some of the Python books I've acquired over the years. I also have bought many in epub and pdf formats, including some that are only available in electronic format (ie. Python: Deeper Insights into Machine Learning on PacktPub). I still much prefer paper books, and with hardcover, please! (comic: xkcd.com # 353)
  • Internet Programming with Python (1996), M&T Books, Aaron Watters, Guido van Rossum, James C. Ahlstrom (I was working some with Perl and Python for web applications back in these early days, but mostly worked with C/C++)
  • Python Annotated Archives (1999), Osborne, Martin C. Brown (there was a whole series of Annotated Archives, from Linux to C++, and of course for Python. I'm not aware of any similar recent series)
  • Learning Python, 2nd Ed. Covers Python 2.3 (2004), O'Reilly, Mark Lutz & David Ascher (as a reference, see current 5th edition further down this list)
  • Python Scripting for Computational Science 2nd ed. (2006), Springer, Hans Petter Langtangen (a 3rd edition came out in 2008. Langtangen also wrote A Primer on Scientific Programming with Python , 4th edition in 2014. He passed away in 2016)
  • Programming Collective Intelligence: Building Smart WEb 2.0 Applications (2007), O'Reilly, Toby Segaran (going back to it now, at the time I didn't really realize the significance of all the chapters of this book. It did help me with decision trees, pricing models, SVMs, but all the topics are all still hot today)
  • Computational Modeling and Complexity Science, Python Edition (2008), Allen Downey (was thinking of using that as a textbook for my development team back then - for a more up to date version, see Think Complexity 2nd ed., made available for download by the author)
  • Python Visual Quickstart, 2nd Edition (2009), Toby Donaldson (or using this as a textbook. In the end I taught a Python class using my own material)
  • Beginning Python Visualization: Crafting Visual Transformation Scripts (2009), Apress, Shai Vaingast (visualizations have come a long way since)
  • Python Testing Cookbook (2011), PacktPub, Greg L. Turnquist (this landscape has changed a good bit, see (2017) "Python Testing with pytest" further down, and see also the automation and configuration management section from part III)
  • Python For Data Analysis (2012), O'Reilly, Wes McKinney (from the author of Pandas, a basic building block for data science)
  • Learning Python, 5th Ed. Updated for Python 2.7 and 3.3 (2013), O'Reilly, Mark Lutz (Currently using Python 3.6, a few minor differences)
  • Learning IPython for Interactive Computing and Data Visualization (2014), PacktPub, Cyrille Rossant (there is a second edition from 2015 that is available. Covers not just but also , both fundamental building blocks in the data science ecosystem)
  • Python Machine Learning (2015), PacktPub, Sebastian Raschka (I've been suggesting this book to interns)
  • Mastering Matplotlib (2015), PacktPub, Duncan McGregor (Matplotlib is the basic building block for many other visualization tools in the Python ecosystem, including my own stemgraphic).
  • Think Python: How to Think Like a Computer Scientist, 2nd ed (2015), Green Tea Press, Allen Downey (made available for free by the author. He also has a few other free books)
  • Fluent Python (2015), O'Reilly, Luciano Ramalho (when you've gotten your skills to a decent level, get this book and climb higher)
  • Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (2015), Addison-Wesley, Cameron Davidson-Pilon (the author not only put up a lot of examples on his but also most of the content of the book)
  • Python Geospatial Development, 3rd Edition (2016), PacktPub, Erik Westra (just updated from 2nd edition which I had gotten in 2013)
  • Python Testing with pytest (2017), Pragmatic Programmers, Brian Okken
For the latest language specification see the Python 3 language reference. See also the Python 3 standard library and each individual module's documentation (sometimes on github, or readthedocs). There are also many books suggested on the python.org wiki.

R

I started looking into R around 2006 because I wanted to use it for forecasting. I also started taking statistics, data analysis and data science classes and many of them were taught using R. Plus I could do box plots! (comic: xkcd.com # 539)
  • The Basics of S and S-PLUS (1997), Springer, Andreas Krause, Melvin Olson (bought relatively recently, the roots of R)
  • Introductory Statistics with R (2002), Springer, Peter Dalgaard
  • Data Analysis and Graphics using R (2006), Cambridge University Press, John Maindonald, John Braun (first R book I bought, along with the one above)
  • R Graphics (2006), Chapman and Hall, Paul Murrell
  • R in a Nutshell (2010), O'Reilly, Joseph Adler (a bit more advanced)
  • R Inferno (2011), self-published, Patrick Burns (made available for free by the author, if you find you are scratching your head often, you should read this)
  • An Introduction to Statistical Learning with Applications in R (2013), Springer, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani (the book is also available online at Gareth's USC website)
  • R for Everyone: Advanced Analytics and Graphics (2014), Addison Wesley, Jared P. Lander (suggested this book to teammates who didn't have experience with R)
  • Computational Actuarial Science with R (2015), CRC Press, Arthur Charpentier, ed. (It does say Actuarial Science, but there is a ton of overlap with data science. Charpentier also has an incredibly chock-full blog: https://freakonometrics.hypotheses.org/)
  • Simulation for Data Science and R (2016), PacktPub, Matthias Templ (I've been meaning to read this for a while. it looks promising, but has been sitting in my backlog. I'll update this comment once I've made some progress in reading it)
  • R for Data Science (2017), O'Reilly, Garrett Grolemund, Hadley Wickham (If you have just started with R, check it out: available for free by the authors)

Part II (model) also included one book with some R content, in the metrics section. For R 3.4.0 language specification see this document. See the vignettes (ggplot2 vignette) and pdf documentation (such as this one for ggplot2) for individual packages on CRAN. Furthermore, there are much more suggested R books on r-project.

C/C++

"Writing in C or C++ is like running a chainsaw with all the safety guards off" - Bob Gray

C, what a great language! Full control of everything! Pointers! I love pointers. More seriously, learning C and C++ comes in handy to squeeze extra performance out of Python and R. (comic: xkcd.com # 138)

  • The C Programming Language (1978), Dennis M. Ritchie, Brian W. Kernighan (got my original copy as a teenager, in a yard sale. Long lost, but got another copy of the same vintage)
  • The C++ Programming Language (1986), Addison-Wesley, Bjarne Stroustrup (now in its 4th edition, 2013)
  • Neural Networks in C++ (1992), Wiley, Adam Blum
  • Thinking in C++ (1995), Prentice Hall, Bruce Eckel
  • Algorithms, Data Structures, And Problem Solving With C++ (1996) Addison Wesley, Mark Allen Weiss
  • Analog and Digital Filter Design in C (1996), Prentice Hall, Les Thede
  • C/C++ Annotated Archives: Code with Commentary (1999), Osborne, Art Friedman, Lars Klander, Mark Michaelis

Part III (technology) included some related books in the Operating System section (ie. Linux System Programming). See also the GPU and Graphics section further down, as it's pretty much all C/C++. For the latest C++ language specification (c++17) see the current working draft.

Java/Scala

After spending much time learning the ins and outs of Object-Oriented Programming, learning C++, I got exposed to this language called Java. Working for an R&D company in the field of telecom (and customers of Sun Micro), tasked to evaluate Java for a set-top box (interactive tv/cable). It really wouldn't fit the RAM we had, but I did learn to write applets. And JSP, and J2EE and so on and so forth. I now use it mostly to debug Hadoop issues. I've also worked some with Scala (JVM based), Scalding (cascading API for Scala) And I picked up many books in the process. (comic: xkcd.com # 801)
  • Thinking in Java (1998), Prentice Hall, Bruce Eckel (the last edition was 4th, from 2003)
  • Beginning Java 2 JDK 5 (2005), Murach, Doug Lowe, Joel Murach, Andrea Steelman (Murach is still around and Java Programming SE 9 is about to be out)
  • Programming Scala (2009), Pragmatic, Venkat Subramanian (Scala is also covered briefly in 7 languages in 7 weeks in the next section, and in the Pragmatic series you'll also find Clojure, Ruby and a few other languages similarly covered)
  • Scala for the Impatient (2012), Addison Wesley, Cay Horstmann (we are all impatient)
  • Machine Learning in Java (2016), PacktPub, Bostjan Kaluza (covers Weka, deeplearning4j, spark etc all at a very high level)
  • Think Java: How to Think Like a Computer Scientist (2016), Green Tea Press, Alan Downey, Chris Mayfield (made available for free by the author, might be a good first book if you are just getting into Java)

And of course, the official Java Language Specification (SE 8) and Scala Language Specification (v.2.9 at this time)

Other/Multiple Languages

"When someone says: 'I want a programming language in which I need only say what I wish done', give him a lollipop" - Alan Jay Perlis
  • Apple II-6502: Assembly Language Tutor (1983), Prentice Hall, Richard Haskell (the book that helped me get really good at 6502 assembly language - the MOS Technology 6502 processor is not of much use nowadays, but learning assembly language can help you understand algorithms and CPUs better)
  • Programming for Artificial Intelligence: Methods, Tools Applications (1991), Addison Wesley, Wolfgang Kreutzer, Bruce McKenzie (LISP, Scheme, Prolog, and Smalltalk)
  • A guide to VHDL, 2nd edition (1993), Patricia Laangstrat, Stanley Mazor (a hardware description language)
  • Programming Perl (1996), O'Reilly, (was evaluating Perl and Python for web applications back in these early days, and also had a copy of Effective Perl Programming)
  • Numerical Analysis and Graphic Visualization with MATLAB (1996), Prentice Hall, Schoichiro Nakamura (works for Octave too)
  • ASP: Active Server Pages (1997), Wiley, A. Fedorchek, D. Rensim
  • Murach C# (2004), Murach & Associates, Joel Murach, Doug Lowe
  • Data Crunching: Solve Everyday Problems Using Java, Python and more (2005), Pragmatic Programmers, Greg Wilson (sowed the seed of me switching from mostly C++ and Java to Python)
  • Web Standards Programmer's Reference: HTML, CSS, Javascript, Perl, Python (2005), WROX, Steven Schafer (all in one, although you'll probably need more recent references)
  • Mapping Hacks Tips & Tools for Electronic Cartography, O'Reilly, Schuyler Erle, Rich Gibson, Jo Walsh (a bit of everything the world of cartography)
  • Javascript The Good Parts (2008), O'Reilly, Douglas Crockford (there is no going around this, you will have to learn some of it)
  • 7 languages in 7 weeks (2010), The Pragmatic Programmers, Bruce Tate (a last minute addition, suggested in a discussion with my ex co-workers Shannon, Don, and Corey, this covers briefly Scala, Haskell, Prolog already covered elsewhere and also Ruby, Io, Erlang and Clojure)
  • Pearls of Functional Algorithm Design (2010), Cambridge University Press, Richard Bird (Haskell, fairly advanced book, IMHO)
Sorry, I do not own a book on Julia, yet. (Clarification: in a dead tree format. In electronic format, I've read Mastering Julia, and started going through Julia Programming for Operations Research)

GPU and Graphics

  • Computer Graphics, C version, 2nd ed (1997), Prentice Hall, Donald Hearn, Pauline Barer
  • OpenGL SuperBible 2nd ed (2000), Waite Group, Richard S. Wright, Michael Sweet
  • CUDA: Application Design and Development (2011), Morgan Kaufman, Rob Farber
  • Data Visualization with D3.js Cookbook (2013), PacktPub, Nick Qi Zhu
  • Professional CUDA C Programming (2014), WROX, John Cheng, Max Grossman, Ty McKercher

If all of these are not enough (well really, if you don't mind reading on a computer screen or tablet), you can find many related books that are available for free in this list: free-data-science-books.

This is a wrap for books related to code. Next article (Part V) in the series will cover Visualization.

Francois Dion
Chief Data Scientist
@f_dion

Comments