"ex-libris"​ of a Data Scientist, part I: Data

Francois Dion - Recent readings

abstract: I will cover some of the essential books for data science in a 6 part series. After an overview, this one covers data and databases. (also available: Part II: model, Part III: technology, Part IV: code, Part V: visualization and part VI: communication)


Local user group presentations, speaking engagements at conferences and keynotes all have one thing in common for me: books.

Books on statistics, art, design, mathematics, artificial intelligence, biology, visualization, computer science, databases, etc. I bring between one to half-dozen books as props, to make a point by reading directly from them, or to show some unusual visualization (works best with large books, obviously). And in some cases, I bring bound booklets (32 or so pages) I wrote, to be given away as a companion to the presentation.

Some books I have brought in recent presentations include the 1900 Statistical Atlas (US Census Bureau), Calvin Schmid's Handbook of Graphic Presentation, Claude Shannon's The Mathematical Theory of Communication, John Tukey's EDA, a few volumes from the Machine Intelligence series edited by Donald Michie (some of them including previously unpublished papers by authors such as Alan Turing), Norbert Wiener's Cybernetics, Edward Tufte's Visual Display of Quantitative Information and many more classic and not so classic books, published between the early 1800s to current.

Conversation


Perhaps it is unsurprising then to be asked for book recommendations or book lists. The conversation usually goes like this:

"someone: So, what books would you recommend for someone like me who wants to get into data science?  
me: I do not know your background, what you already know. Is there an area you feel you really need to get better at?  
someone: Not really sure... I don't really know anything about data science.  
me: Ah, well let me see if I still have an extra copy of the Hitchhiker's Guide to the Open Source Data Science Galaxy. There is this diagram I need to show you..."
And then I proceed to show the above (Figure 1). After a few minutes, a question that was really about suggestions for a list of machine learning books expands to a much longer list... Data science is not a single thing. It is a tightly coupled interaction of multiple fields coming together to formulate a question and answer it to some degree of confidence.

Communication

At the top is communication. That is important because that is the contact between the data scientist and the "customer". The primary goal here is to define a question that needs answering. As John Tukey once said:
"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise".
It would be tempting to start with a list of books tied to communication, but instead, I will jump directly to Data for this first part. We will cover communication last. I am planning to publish five more articles to cover all areas.

Data

Data handling, data munging, data wrangling. It is all about ingesting structured and unstructured data from multiple sources, clean it up, vouch for its integrity, all the while ensuring adherence to policies, laws, and regulations.

The domain-specific language (DSL) for data is still SQL (read my article from 2015 "The report of SQL's death"). Hence, my list here will include at least one book focusing on that subject. It is also important to understand how to design databases, tables, how indexing works. Computer science books on database design and implementation are thus very useful.

Finally, it pays to know what makes a specific database tick. I will include a few books on PostgreSQL. If you use a different database technology, complement this list with books on whatever database system you use. You will soon learn that there are a lot more different database systems in use amongst your customers than you ever thought possible.

The selection, in order in which I read them (roughly):
  • Handbook of Relational Database Design (1989), Candace Fleming, Barbara von Halle
  • A First Course in Database Systems (1997), Jeffrey D. Ullman, Jennifer Widom (note: I've been told Database System Concepts by Silberschatz is a similar introductory book)
  • Transaction Processing (1993), Morgan Kaufman, Jim Gray, Andreas Reuter
  • Distributed Databases (1984), McGraw-Hill, Ceri & Pelagatti
  • Bulletin of the Technical Committee on Data Engineering, Vol. 23 No.4, Special issue on Data Cleaning (2000).
  • Database System Implementation (2000), Prentice Hall, Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom (note: see also Database Systems: The Complete Book)
  • SQLite (2004), Sams, Chris Newman
  • SQL in a Nutshell: A Desktop Quick Reference (2008), O'Reilly, Kline, Kline & Hunt
  • Database Systems: The Complete Book (2002), Prentice Hall, Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom (note: Includes most of A First Course and of Database System Implementation, a solid book)
  • Information Quality Applied: Best Practices for Improving Business Information, Processes, and Systems (2009), Wiley, Larry P. English (a gentle, tip of the iceberg, 800-page intro to the subject)
  • Graph Databases (2013), O'Reilly, I. Robinson, J. Weber & E. Eifrem


O'Reilly also have a SQL Pocket Guide, 3rd ed. and a MySQL Pocket reference that might be of interest. Back 15 years ago or so, I got several of their pocket guides for Oracle PL/SQL, SQL*plus, and tuning guides and I used them quite often.

Next up, Hadoop:
  • Hadoop: The Definitive Guide (2009), O'Reilly, T. White.
  • Practical Hadoop Security (2014), APress, , Bhushan
  • Apache Hadoop YARN, Addison-Wesley, Arun Murthy et al.
  • HBase Design Patterns (2014), Packt, M. Kerzner, S. Maniyam


Hadoop books will get stale quickly due to the ever-changing landscape, you've been warned. I would suggest online resources for now.

Next, PostgreSQL specific books:
  • Practical PostgreSQL (2002), O'Reilly, Worsley & Drake (note: this was published 15 years ago, there are better options now)
  • PostgreSQL 9 Administration Cookbook (2010), Packt Pub., Riggs & Krosing
  • PostgreSQL 9 High Performance (2010), Packt Pub., Gregory Smith
  • PostgreSQL Replication (2013), Packt Pub., Boszormenyi & Schonig
  • PostgreSQL Up & Running (2014), O'Reilly
  • PostGIS in Action, 2nd ed. (2015), Manning, Regina Obe & Leo Hsu
  • The official PostgreSQL documentation at www.postgresql.org/












The next list might be slightly more difficult to find, as they are database research papers.

Some are collected in volumes or are in proceedings from conferences, such as VLDB (Very Large Databases conference), but if your local University has a decent library, you should be able to find them. There is enough detail here to search and find them.


  • Relational Completeness of Database Sublanguages (1972), in Database Systems, Prentice Hall
  • The Design and Implementation of INGRES (1976), ACM Trans. on Database Systems Vol.1 No.3, M. Stonebraker, E. Wong, P. Kreps, G. Held
  • The Ubiquitous B-tree (1979), ACM Computing Surveys Vol. 11, No.2, Douglas Comer
  • Benchmarking Database Systems: A Systematic Approach (1983), Proc. 1983 VLDB Conference, D. Bitton et al.
  • The Design of Postgres (1986), Proc. 1986 ACM-SIGMOD Conference, M. Stonebraker
  • The Postgres Data Model (1987), Proc. 1987 VLDB Conference, L. Rowe, M. Stonebraker
  • Readings in Object-Oriented Database Systems (1989), Morgan Kaufman, S. Zdonik, D. Maier (a collection of around 40 papers on the subject, for the ultra-adventurous... these type of DBMS are not exactly common nowadays)
  • The Implementation of Postgres (1990), IEEE Transactions Knowledge and Data Engineering, M. Stonebraker
  • Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases (1990), ACM Computing Surveys, Vol. 22, No. 3, Sept. 1990, A. Sheth, J. Larson
  • Main Memory Database Systems: An Overview (1992), IEEE Transactions on Knowledge and Data Engineering vol.4 no.6, Hector Garcia-Molina, Kenneth Salem
  • Data Cube: a Relational Aggregation Operator Generalizing Group-by, cross-tab, and sub-totals (1996), Proc. Intl. Conf. on Data Engineering, J. N. Gray, A. Bosworth
  • Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations (1998), Proceeding EDBT '98, Berchtold et al.
  • A Product Perspective on Total Data Quality Management (1998), Communications of the ACM vol.41, no.2, Richard Y. Wang
  • Query Processing Techniques for Arrays (2002), VLDB Journal 11, Marathe & Salem
  • The Google File System (2003), SOSP'03 ACM, S. Ghemawat, H. Gobioff, S. Leung
  • Data Stream Management Issues - A Survey (2003), University of Waterloo Technical Report CS-2003-08, L. Golab, M. Ozsu
  • Column-Stores vs. Row-Stores: How different are they really? (2008), SIGMOD'08 Vancouver, Abadi, D. et al.
  • The Hadoop Distributed File System (2010), MSST Conf. 2010, K. Shvachko et al.
  • A survey of B-tree locking techniques (2010), ACM Trans. Database Systems 35, G. Graefe
  • Security and Privacy in the Era of Big Data (2013), RENCI/NCDS, C. Schmitt et al.
  • A Review Paper on Big Data and Hadoop (2014), H. Bhosale, D. Gadekar
  • SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures(2014), 40th Conf. VLDB, Avrilia Floratou, U. F. Minhas, F. Ozcan
  • Time Series Databases (2015), Proc. of XVII Intl. Conf. /RCDL, Dimitry Namiot
  • Readings in Database Systems (1988, 2015) aka "Red Book", Morgan Kaufman. Now in its 5th edition, this is a collection of many essential papers on the subject
There are a lot more database papers out there, but that is a start. Next time, we will cover Models.

Suggestion: when you pull a bound volume from the library shelf to read one of the above, flip through the rest of the volume, you might find another paper of interest.

Francois Dion

Chief Data Scientist, Dion Research LLC

@f_dion

NB: Also published on LinkedIn at: https://www.linkedin.com/pulse/ex-libris-data-scientist-part-i-francois-dion/

Comments