"ex-libris"​ of a Data Scientist, part III: Technology

Patent No. 395,783
abstract: I will cover some of the essential books for data science in a 6 part series. This part III covers technology (see part I: data and databases, part II: model, part IV: code, Part V: visualization and part VI: communication).

"Les inventions qui ne sont pas connues ont toujours plus de censeurs que d'approbateurs" - Blaise Pascal


Continuing this series, I will now focus on technology. Depending on the size of the company, it is possible you have no real opportunity to architect complete solutions. Or it could be that within your data science team, one (or more) individual is already a strong technologist, so you might say, why bother learning this stuff? In order to provide solutions that are scalable, a data scientist needs to understand this area in more than a superficial way.

A Bit of History

As always, let's take some time to learn from history because there are many cycles in technology that repeat themselves. Furthermore, novelty or age of something has nothing to do with how useful or how good it is.

(Picture: small detail of front panel of my CES Industries Ed-Lab Microcomputer Lab model # 804)



  • On Computable Numbers, with an Application to the Entscheidungsproblem, (1936), London Mathematical Society, Alan Turing (available at the Turing archive, mathematical description of an imaginary computing universal machine, the Turing machine)
  • A Symbolic Analysis of Relay and Switching Circuits (1938), MIT, Claude Shannon (Shannon's thesis, available here, analysis and synthesis of circuits)
  • First Draft of a Report on the EDVAC (1945), University of Pennsylvania, John Von Neumann (food for thought, the foundation of computing, at least from a logical standpoint, is based on a design that was the first draft... can be found in many places, for example, an appendix to a book on computer architecture and security here)
  • The Mathematical Theory of Communications (1949), Univ. of Illinois Press, Claude Shannon, Warren Weaver (one of the most important texts in technology, and contains many interesting points on redundancy, entropy, n-grams in languages etc. The original Bell System paper, A Mathematical Theory of Communications from 1948 is available here)
  • Machine Intelligence 5 (1969), Edinburgh University Press, D. Michie Ed, includes previously unpublished Intelligent Machinery written in 1947 by Alan Turing (this was a precursor to the famous Turing test paper, published in 1950 in Mind Journal)
  • The Soul of a New Machine (1981), Little, Brown & Co., Tracy Kidder (Pulitzer prize, some amusing anecdotes)
  • Bit by Bit: An Illustrated History of Computers (1984), Ticknor & Fields, Stan Augarten (the history of computing goes back further than you think...)
  • Hackers: Heroes of the Computer Revolution (1984), Doubleday, Steven Levy (a classic, from the early MIT stuff to Woz and beyond)
  • The Cuckoo's Egg (1989), Doubleday, Clifford Stohl (the other type of hackers, about computer security and problem-solving)

Hardware

One generation ago, hardware meant a slide rule, and two generations ago, an acetate ruler (for nomography).

A modern data science hardware stack relies on servers using one or multiple multicore processors, various bus speeds, various PCI standards, various types of network interconnections, various types of disk drives, protocols, serial and parallel operations and many more specialized components such as GPUs and FPGAs. They are also often part of a group of such machines, with data nodes, head nodes, controllers, switches, file servers, all interacting with each other in very complex ways.

(Picture: Sun Hemmi slide rule and acetate ruler on a nomography chart, from my collection)



  • Computer Organization 2nd edition (1984), McGraw-Hill, V. Carl Hamacher, Zvonko G. Vranesic, Safwat G. Zaky (now in its 6th edition, 2012. If you want to really understand how computers work, down to the logic gates, that's one book to get)
  • Internetworking with TCP/IP Vol. I (1988), Prentice Hall, Douglas Comer (all about protocols and architectures)
  • Internetworking with TCP/IP Vol. II 2nd Ed. (1994) Douglas Comer (on design, implementation internals, optional for most people)
  • The Irwin Handbook of Telecommunications, 3rd ed. (1997), McGraw Hill, James Harry Green (I worked in Telecom and Broadcasting at the time, a bit on the specialized side of things...)
  • Principles and Practices of Interconnection Networks (2003), Morgan Kaufmann, Brian Patrick, William James Dally (everything you want to know about the hardware side of interconnections, from things like PCI bus to Infiniband to unusual network architectures, down to bits and signals)


Reading about hardware is useful, but this is one area where hands-on is really critical. Even if your budget is very very small, you can build a mini cluster with a few raspberry Pi 3 model b as they only cost $35 each (or even less for the Pi 2). Add to this standoffs to stack them, some power supplies, SD cards, network cables, a switch, a keyboard, mouse, and monitor. Then figure out how to network them, how to automate the installation, how to deploy distributed code, how to monitor them, etc. Do this before even going and using cloud-based virtual servers or services. You can also use a single Raspberry Pi as a sidekick to your main computer or laptop.

Storage

I will keep this brief because I never bought many books covering this. Instead, I've downloaded many manuals over the years, provided by hardware vendors and hard disk and solid state device manufacturers.

(Picture: Storage for compute nodes #1 and #2 of the NSC-01 High Altitude Balloon, the first amateur computer cluster launched into near-space)








  • The RAIDbook: A storage system technology handbook, 6th ed. (1997), Paul Massiglia (everything one needs to know on hardware raid, but read up on such as implemented in ZFS and BTRFS)
  • Building Storage Networks (2003), Osborne, Marc Farley (everything and the kitchen sink on SANs and alternatives)
  • Solaris 10 ZFS Essentials (2010), Sun Microsystems Press, Scott Watanabe (although the book is for Solaris, ZFS is also used on Linux, Mac, and BSD)

GPU and Graphics

None of these are Visualization books, as they will be covered in their own section (part V). This is more about the technology side of things. CUDA programming and OpenGL will be in the next section (part IV). I'm including two older publications in case someone is interested in what was state of the art for desktop computer graphic programming in the 80s. (Picture: Apple ][ compatible, color mixed mode of low-resolution graphics and scrollable text on a CRT television)



  • Microcomputer Graphics for the Apple ][ (1982), Addison-Wesley, Roy Myers
  • IBM PC and PS/2 Graphics Handbook (1989), Lance Leventhal Microtrend, Edward Teja
  • Computer Graphics: Principles and Practice, 2nd Ed. (1993), Foley, VanDam, Feiner, Hughes
  • Point-Based Graphics (2007), Morgan Kaufman, Markus Gross editor (this goes wider than the title would suggest)
  • An Introduction to Computer Graphics and Creative 3-D environments (2008), Springer, Barry G. Blundell (not sure how easy it is to find, but it covers a broad range of subjects)
  • Information Theory Tools for Computer Graphics (2009), Morgan & Claypool, Mateu Sbert, Miquel Feixas, Jaume Rigau, Miguel Chover, Ivan Viola
  • GPU-based acceleration of selected clustering techniques (2010), Politechnika Slaska, Grzegorz Karch (thesis, available here, a specific application of GPU)
  • Analyzing General-Purpose Computing Performance On GPU(2015), California Polytechnic State University, Fanfu Meng (thesis, available here, a more general introduction to the subject)

HPC

Cray-2 liquid cooled Supercomputer, (c) 2013 Steve Hatle


  • SCI: Scalable Coherent Interface Architecture and Software for High-Performance Compute Clusters (1999), Springer, Hermann Hellwagner, Alexander Reinefeld (including this as example of early low latency interconnects)
  • Parallel I/O for High-Performance Computing (2000), Morgan Kaufman, John M. May (sure, we have feather format for R and Python, but that's not all there is)
  • Distributed Systems: Concepts and Design, 3rd Ed. (2001), Addison-Wesley, George Colouris, Tim Kindberg, Jean Dollimore (covers a lot of material, was the best broad coverage book on the subject in 2001. 5th edition is 2012)
  • Tools and Environments for Parallel and Distributed Computing (2004), Wiley, M. Parashar, Salim Hariri

Operating Systems


Although it is entirely possible to do every data science task in Windows, the fact is that the majority of cloud-based solutions are based on a Linux system. If you are a Windows-only person, then you could jump to the automation section. Having said that, even if you control your environment, you will definitely interact with many *nix systems (not just Linux, but all kinds of different systems). And of course, you might be using a laptop with OS/X, which is very much Unix-like. (Picture: OpenIndiana 151 dual Xeon workstation with VNC to Mac, circa 2012)


  • The Unix System (1982), Addison-Wesley, S. R. Bourne (the classic original reference for Unix)
  • Unix System Programming (1987), Addison-Wesley, Keith Haviland, Ben Salama (although about programming, this gives insight into how Unix and Unix-like systems work)
  • Red Hat Linux Unleashed, 2nd ed. (1998), Sams, David Pitts et al. (replace with your OS flavor and get as recent a publication as you can)
  • Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture, 2nd edition (2006), Sun Microsystem Press, Jim Mauro, Richard MacDougall (Unless you are using Solaris/ OpenSolaris/SmartOS, I wouldn't get this, but it is an amazing book, along with Solaris Performance and Tools. I wish there was a combo of this caliber available for a Linux distro)
  • Linux System Programming, 2nd edition (2013), O'Reilly, Robert Love (again, similar thought to reading Unix System Programming)


And for Linux, don't forget about the Linux Documentation Project at tldp.org

Performance

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. - Donald E. Knuth

  • Performance Tuning for Linux Servers (2005), IBM Press, Sandra K. Johnson, Gerrit Huizenga, Badari Pulavarty (Gregg's System Performance is a more recent book that covers a lot more ground)
  • Solaris Performance and Tools (2007), Sun Microsystems Press, Jim Mauro, Richard MacDougall, Brendan Gregg (same comment as for Solaris Internals, better to get System Performance by Gregg)
  • DTrace: Dynamic Tracing in Oracle Solaris, Mac OS/X and FreeBSD (2011), Pearson, Brendan Gregg (and as of this year, for Linux)
  • System Performance: Enterprise and the cloud (2013), Prentice Hall, Brendan Gregg (the reference about system performance for data scientists, system administrators, data engineers etc - see also Brendan's Linux Performance Analysis in 60,000ms)

Automation & Configuration Management


  • Open Source Development with CVS (1999), Coriolis, Karl Fogel
  • Continuous Integration: Improving Software Quality and Reducing Risk (2007), Addison Wesley, Paul M. Duvall
  • Pro Git (2009), Apress, Scott Chacon (A 2nd edition is available here)
  • Jenkins, the Definitive Guide (2011), O'Reilly, John Ferguson Smart (can automate a lot of things and scales quite a bit)
  • Software Build Systems, Principles and Experience (2011), Addison-Wesley, Peter Smith
  • Continuous Delivery and DevOps: A Quickstart Guide (2012), Packt Pub, Paul Swartout
  • Ansible Up and Running (2014), O'Reilly, Lorin Hochstein (my personal favorite, but other options include chef, salt etc)
  • Docker Up and Running (2015), O'Reilly, Karl Matthias and Sean P. Kane (pretty common for development instances)

Web, Cloud, Security

Enigma 1 cipher machine, Muzeo e Tecnologia Milano

  • Foundations of Security (2007), Apress, Neil Daswani, Christoph Kern, Anita Kesavan
  • Security Engineering, 2nd Ed (2008), Ross Anderson (content available here)
  • Web 2.0 Architecture (2009), O'Reilly, James Governor, Dion Hinchcliffe, Duane Nickull (has it been that long already?)
  • Cloud Security and Privacy: an Enterprise Perspective on risks and compliance (2009), O'Reilly, Tim Maher, Subra Kumaraswamy, Shahed Latif
  • The Cloud at your Service (2010), Manning, Jothy Rosenberg, Arthur Mateos
  • Creating the Infrastructures for Cloud Computing (2011), Intel Press, Enrique Castro-Leon
  • Mastering Nginx (2013), Packt Pub, Dimitri Aivaliotis (fairly popular web server. if you use a different one, get the doc for that)
  • Abusing the Internet of Things (2015), O'Reilly, Nitesh Dhanjani
Add to this cloud computing vendor-specific documentation.

Command Line


  • Gnu Make (1996), Free Software Foundation, Richard Stallman
  • Sed & Awk (1997), O'Reilly, Dale Dougherty, Arnold Robbins
  • Data Science at the Command Line (2014), O'Reilly, Jeroen Janssens (a sort of recap of many of the tools discussed in the other books and how to combine them - UPDATE: the author has made the content available for free)
  • Pro Bash Programming 2nd Ed (2015), Apress, Chris Johnson, Jayant Varma (substitute with a reference for your favorite shell, be it powershell, zsh, csh etc)


This, once again, turned out to be longer than I anticipated, so I limited the papers to the History section. Next article (part IV) in the series will cover Code.

Francois Dion

Chief Data Scientist, Dion Research LLC

@f_dion

Comments