Data Infrastructures for the Rest of Us

Once upon a tweet

Having built HPC solutions and data infrastructures for several companies over the years, the following recently caught my attention:

This tweet by Angel D'az brought up a very good point: lots of over-engineering and little of keep it simple. It was also a theme I explored in a tutorial I did at the IEEE SouthEastCon in 2017: "Getting to know your data at scale".

Keep It Simple

"Do you even have a baseline of your usage? Probably not"
When we talk about building a data infrastructure, we are talking about hardware and technology as much as we are talking about software. If you feel like filling up your list of readings on the subject, I'll point you to my "ex-libris" of a Data Scientist, part III (Technology).

Your first reaction might be "let's build a cluster". But that's not exactly keeping it simple, especially if you don't need to, at least not right away.

It reminds me of a discussion I had with a system administrator, almost 20 years ago. His lab (at Virginia Tech in Blacksburg, VA) was replacing Sun Microsystems "big iron" servers with large scale clusters, and he was lamenting how much maintenance work that was. I had built Linux clusters, and had acquired a Sun Enterprise 5000 (I liked the 5000 cabinet better as it was not as tall, but used E5500 center plane and internals for better performance) and could definitely relate.

Moving a Sun Enterprise 5000 off a pallet

Managing a single server with many processors (14 in my case) and enough memory for running simulations (28GB, in an era where your typical desktop had about half a GB) was a lot easier than managing a large number of small Linux servers in a Beowulf cluster, as the software and automation was not mature yet. But the individual servers were cheaper to buy, and easier to move around. (Picture: Don't be fooled by the flat paper people in the diagram from the Sun installation manual, these servers were not lightweight).

But, since we are almost in 2020, does it still hold true? For one thing, servers have been packing more and more performance, memory, storage and IOs in smaller and smaller packages. Prices have also gone down drastically. What this means is that, quite often, you can do a lot of data science and data work with a single commodity server, believe it or not.

As to why not start right away with a cluster, set everything up in distributed fashion and work every project like it needs a gigantic data infrastructure? Do you even have a baseline of your usage? Probably not. In this post, we will assume a bare metal install Linux server, but this could also be provisioned from metal as a service, a virtual server in your data center or a public cloud provider. In a future post we will address some of these more directly.

Memory Lane

"La memoire est necessaire pour toutes les operations de la raison" -- Blaise Pascal

If you have been working on some data analysis, data engineering or data science task in general, you've no doubt faced a scenario where you ran out of memory. RAM, short for Random Access Memory is a lot smaller than your disk space, but unfortunately, this is where you want to hold your data (we will come back to the why and to caches in a future post). If you've searched the web for a solution to this problem, you've become acquainted with the expression "out of core" (see the paper "The input/output complexity of sorting and relatedproblems" for one of the early treatment on the subject, from 1987). And suddenly, an in memory operation, something that was very straightforward, is now something altogether different.

two 32 slot mezzanine memory boards for a 2U server

Once more, it doesn't have to be the only way. After all, you could go and buy a server with 4 x Intel Xeon E7, each having 24 memory slots. Using 128GB memory DIMMs, we get 4 x 24 x 128GB, or 12 TB in a single server. That might be more than enough. Or it might be totally overkill... Also, top capacity DIMMs will command top dollars, and cloud based servers with high memory capacity are unrealistically priced. Leasing or buying is usually cheaper.

So, how much memory does one really need? That is a very tricky question to answer. Today, we will look at this from a data ingestion, data operation and cores/threads perspectives. As this is about data infrastructures, we'll use Python as the main language, and we will use Pandas dataframes to keep things simple (memory usage will only grow as you use other tools like Airflow, scikit-learn or in memory databases)

Ingest, Digest

If we start at the data ingestion end, we might reason that with 16GB, we should be able to load a data file that is 12GB in size, no problem. But what format is that data file? Is it compressed? What type of variables? A lot of category codes or 0/1 flags? Is it dense or sparse data? Some factors will contribute to less memory than disk space (loading a dense representation on disk of a sparse matrix into a sparse matrix in memory), but the majority will contribute to using more memory.

Let's start with an example. I will load a file that only takes 200MB or so on disk:

In [1]: import os, sys

In [2]: import pandas as pd

In [3]: os.path.getsize('/data/acq/Acquisition_2003Q3.txt')/1024/1024
Out[3]: 194.37187385559082

Let's load this into a Pandas dataframe:

In [4]: df = pd.read_csv('/data/acq/Acquisition_2003Q3.txt', header=None, delimi
   ...: ter='|')

In [5]:'deep')                                           

RangeIndex: 1738620 entries, 0 to 1738619
Data columns (total 26 columns):
0     int64
1     object
2     object
3     float64
4     int64
5     int64
6     object
7     object
8     int64
9     float64
10    float64
11    float64
12    float64
13    object
14    object
15    object
16    int64
17    object
18    object
19    int64
20    float64
21    object
22    float64
23    float64
24    object
25    int64
dtypes: float64(8), int64(7), object(11)
memory usage: 1.3 GB

So we have 26 columns in all, a mix of decimal numbers (float64), integers (int64) and strings (object). The plain text csv went from 195MB all the way to 1.3GB, more than 6x increase (without memory_usage='deep', pandas reports only 344MB used). So, realistically, if we only have 16GB we will want to stick to a maximum of about 2GB-2.5GB, just to have the data in memory.

Even when "digesting" the data, removing a column here and there, or converting data types to use smaller representations (float32 instead of float64, int32 instead of int64, categorical integers to category etc), we still have to load the file in memory. Sure, you can use tricks, like read the csv in chunk and convert to smaller data types in some cases. Maybe this will allow you to read a slightly larger file, but realistically, on real world data, you might not get as much of a boost as you would think. Examples on the web will show you 50% reduction in memory consumption when switching from 64 to 32 bit floats and ints. Yet, let's try it here:

In [6]: float_cols = df.select_dtypes('float64').columns                       

In [7]: df[float_cols] = df[float_cols].astype('float32')

In [8]: int_cols = df.select_dtypes('int64').columns                       

In [9]: df[int_cols] = df[int_cols].astype('int32')                            

In [10]:'deep')                                           

RangeIndex: 1738620 entries, 0 to 1738619
Data columns (total 26 columns):
0     int32
1     object
2     object
3     float32
4     int32
5     int32
6     object
7     object
8     int32
9     float32
10    float32
11    float32
12    float32
13    object
14    object
15    object
16    int32
17    object
18    object
19    int32
20    float32
21    object
22    float32
23    float32
24    object
25    int32
dtypes: float32(8), int32(7), object(11)
memory usage: 1.2 GB

We saved 100MB, and still using 1.2GB, disappointingly. [1]

Rule of thumb: roughly 2 to 7x a csv file size (if you are instead using a query, you would need to run that query in a client that provides client side statistics, such as size of returned result set, or save the query to a csv)

Data Operations

The most obvious data operation is a copy. Obviously, each time you do an actual copy of your data, you add to your memory usage. Any method in Pandas that has an in_place attribute will make a copy if this attribute is False (default). So, although you might not do a copy explicitly, you can easily do one implicitly. This can quickly contribute to doubling or tripling your memory usage. Sometimes you have no choice, especially if you need to reference the original data later on in the processing.

Another operation that is pretty common in machine learning and feature engineering (often part of a data pipeline) is that of "one hot" encoding (called dummy or indicator variables in Pandas). Let's say you have a column that has 3 categories (B, C and R) and you also allow null values (or NaN). This means you would want 4 new columns. Every row where B is equal to 1 (and hence the other 3 values are 0) would indicate that the row is of category B:

In [11]: pd.get_dummies(df[1], dummy_na=True)                                   
         B  C  R  NaN
0        0  1  0    0
1        0  1  0    0
2        1  0  0    0
3        0  0  1    0
4        0  0  1    0
5        0  1  0    0
6        0  0  1    0
7        1  0  0    0
8        1  0  0    0
9        1  0  0    0

So let's have a look at our data, using describe (include='all' will include non numerical fields), looking at the unique column (I used .T to transpose the table to fit the blog) :

In [12]: df.describe(include='all').T                                           
          count unique     ...               75%          max
0   1.73862e+06    NaN     ...       1.07496e+09  2.14748e+09
1       1738620      3     ...               NaN          NaN
2       1738620     16     ...               NaN          NaN
3   1.73862e+06    NaN     ...             5.625          9.5
4   1.73862e+06    NaN     ...            201000       621000
5   1.73862e+06    NaN     ...               360          360
6       1738620     53     ...               NaN          NaN
7       1738620     53     ...               NaN          NaN
8   1.73862e+06    NaN     ...                80           97
9   1.73851e+06    NaN     ...                80          199
10  1.73853e+06    NaN     ...                 2            9
11  1.68378e+06    NaN     ...                40           64
12  1.73289e+06    NaN     ...               774          850
13      1738620      3     ...               NaN          NaN
14      1738620      4     ...               NaN          NaN
15      1738620      5     ...               NaN          NaN
16  1.73862e+06    NaN     ...                 1            4
17      1738620      3     ...               NaN          NaN
18      1738620     54     ...               NaN          NaN
19  1.73862e+06    NaN     ...               852          999
20       203065    NaN     ...                25           50
21      1738620      1     ...               NaN          NaN
22       766129    NaN     ...               779          850
23       203065    NaN     ...                 1            2
24      1738620      2     ...               NaN          NaN
25  1.73862e+06    NaN     ...            200303       200303

[26 rows x 11 columns]

It appears we have 11 columns that are probably categorical or ordinal. Those that are categorical would have to be encoded as new columns, depending on what needs to be done down the pipe, and what algorithm is used. Let's encode columns 1, 2, 6 and 7:

In [13]: df = pd.concat([df, pd.get_dummies(df[[1,2,6,7]])], axis=1)            

In [14]:'deep')                                           

RangeIndex: 1738620 entries, 0 to 1738619
Columns: 151 entries, 0 to 7_12/2002
dtypes: float32(8), int32(7), object(11), uint8(125)
memory usage: 1.5 GB

The dummy encoding of 4 columns expanded our 26 original columns to 151 columns. Even though these 125 new columns are uint8 (a byte or 8 bits), we have now grown our memory usage by 300MB (more than the whole data set on disk!), for a total of 1.5GB (it would have been 1.6GB if we didn't convert 64 bit variables to 32 bit variables). Note that we haven't removed the original columns, because we want to see the high water mark of memory usage, for capacity planning.

Another operation we might have to do is on objects. Text, or strings of characters, might need to be vectorized based on things like counts and on units such as words, ngrams of characters, ngrams of words, or other similar transformations that will once more add columns to our dataframe in memory. Whereas categoricall values typically add less than 100 columns each, text vectorizers can easily create millions of new features. If you work mostly with large corpus of text, you will likely need as much memory as you can afford.

The last operation we will mention here is a plain in memory pairwise comparison. Let's say we want to calculate a metric between every pair of rows so we can group them. This means we will need to store this computed metric in a matrix that is number of rows x number of rows in size. For our 1.7 million rows example, that would mean 3022799504400 values. Even if we are able to store each result in a byte, that will require 2.7TB of memory. You are approaching the limit of what is reasonable to do purely in memory, on a single server.

Rule of thumb: choosing the right encoding or the right algorithm can make or break a project, and brute force is not always possible.

Cores / Threads

Intel Xeon Gold 5217 specifications

Here, we will use the Intel Xeon Gold 5217 as an example of a CPU that you might use in a data science server. This is a low cost server CPU ($1522 suggested retail). Yet, as seen on the specs above, it provides 8 cores, each capable of 2 threads, for a total of 16 threads. We note also that each CPU can address up to 1 TB of memory. If we buy a dual socket server, that makes for a total of 2 TB of RAM, and 32 total threads. As we will see, that is more than enough RAM capacity, but this will come at a price, as each 128GB DDR4-2667 DIMM costs more than the CPU, and you would need 16 of them (about $32,000...).  64 GB modules are half the capacity and about a quarter the price. This would give 1TB of RAM (about $8,000) while 32GB DIMMs would provide 1/2 TB (512GB), so on and so forth.

Given 32 threads, do we have to have 2TB? What do threads have to do with memory usage?

HTOP, or the Matrix?

Let's say you run a single process, and it takes 32 minutes to run on a single core. Everything is processed sequentially. You monitored the memory usage while it was running and it was around 20GB. Reading the documentation, you notice a parameter to this function (say, a grid search for cross validation) named n_jobs. It states it will run the process in n_jobs parallel tasks. You are so eager to run this on your newly acquired dual Xeon Gold server. You installed 16GB DIMMs in it for 256GB. You run the process and within a minute, you get an out of memory error.

A serial task comprised of hundreds of individual sub-tasks consuming 20GB of RAM will only consume as much memory as the maximum an individual sub-task consumes, but take the full amount of time (in our fictitious example, 32 minutes). Running this in parallel on our 32 threads would in theory run up to 32 times faster (1 minute!), but can potentially use 20GB x 32 threads, or 640GB which is more than is available.

Rule of thumb: a process consuming at most Y GB can end up using Y GB x n_jobs.


We have seen that memory consumption of data in memory can easily be 6x the disk usage of the same data. Some copies might multiply this further by 2x or 3x, while encoding can further add 10 to 50% to this. The number or cores / threads run in parallel also will multiply the memory usage in a manner that is directly proportional to the number of threads used. Depending if each thread copies all the data (such as in a grid search) or splits the data across threads will impact the multiplier, but going by the rule of thumb, assume an n_jobs multiplier. And of course, some algorithms just cannot scale and you'll run out of memory in all but trivial cases.

We'll conclude with an example of calculation using all the estimated multipliers (in memory representation of 6x, copy at 2x, 30% for encoding, 32 threads, worst case scenario):

File on disk: 1GB
Memory usage = 1GB x 6 x 2 x 1.3 x 32 = ~500GB  [2]

Until next time

In practice, I've used a server running Jupyter Hub and the R and Python kernels with 512GB of RAM on much larger data files, but it does illustrate the many ways we can run out of memory, even when we think we have enough.

We still have a few more points to discuss before we complete our hardware configuration and start installing software, but that will have to wait for a future blog post.

Francois Dion
Chief Data Scientist

[1] In this example, if we convert objects to category, we would reduce memory usage by a significant amount, down to 117MB due to the low number of categories. Dummy encoding from that generates 197 new columns, for a total of 444MB. That's a lot better, of course and represent a best case scenario. For capacity planning, we err on the cautious side and look at a typical, slightly worse case scenario.

[2] Assuming a scenario on the other end that 1GB on disk takes only 2GB in memory (2x), we would see a memory usage = 1GB x 2 x 2 x 1.3 x 32 = ~ 167GB