Data Infrastructures for the Rest of Us - II

L'empereur Hou-Pi-Lai, Chine, L'univers, Histoire  et Description, Didot Freres, Paris 1837

"Elephants never forget", as the saying goes. Computers, well that depends if we are talking about volatile memory or not. Non volatile memory (i.e. a hard disk drive) is larger and keeps the information through power cycles, but is the slower type of memory. Volatile memory (i.e. RAM) is smaller and doesn't keep the information once the computer is power cycled, but is typically a lot faster than non volatile memory. That is why we put such an emphasis on RAM memory for data infrastructure in our previous post, and in this one.

How much is enough?

In the previous post, we covered just how much the RAM usage can vary for data infrastructures, especially if many parallel processes are involved. To discover your own memory usage baseline, an initial approach might be to use a cloud service and see how much memory gets used. Or perhaps you can requisition a large virtual server from your IT department on a short term basis.

If you are tracking the processing, you can open a console (there is even one in the Jupyter Notebook: New -> Terminal). From there, you can start htop (you might have to install it) or another similar tool.

As an example, on this particular Linux server, I can type the command "free":

[user@machina003 ~]$ free
              total        used        free      shared  buff/cache   available
Mem:      296988168    14711832   265781624     1271340    16494712   279852508
Swap:     285474812           0   285474812


The Mem: line tells me I have 265781624 KB free (about 254GiB). For planning capacity, I would watch the used column. If I typed "watch free", then the values would refresh automatically.


Byte, Vol.8, No.1, January 1983, p.27
What if ... I run out of memory?

More, more, give me more!

Memory Management Units (MMU) for Intel processors allow for a 48 bit address (2^48 = 256TiB), but 1 bit is usually reserved. This is the case for the Linux Kernel (as can be seen in this kernel memory map). This means 128TiB can be virtually mapped, half of it as physical memory (RAM). While 64TiB of memory is a fairly sizable amount of memory, practically, 96 DIMMs is what is found in large servers, a more humble 12 or 24 are much more common. Even with 128GB DIMMs, that's only 1.5TiB to 12TiB. And once budget constraints kick in, maybe you end up with 256GiB of memory on the server.

Let's Swap!

What if you run out of memory, but only on occasion. It can be frustrating to see a process run for several hours only to fail because you were missing 30GB of RAM. There is an alternative to RAM that might be just right, depending on the circumstances. I mentioned that the Linux memory map include half of 128TiB for physical memory. We also used the "free" command above to obtain free and used Mem. The output also included a line for Swap.

With Linux, the whole memory is divided in pages. Those directly available are in RAM, so the processor can access them relatively fast (we will talk about cache in the next section). If the memory is under pressure, the operating system can "swap" pages from physical memory into a, slower, alternate "bank". When the page is in this alternate address space, it cannot be accessed directly by the processor. It has to be recalled from the swap space and into physical memory.

How does that help us? Often, there are multiple programs running alongside our data pipeline. Perhaps tools like redis, graylog, postgres, etc are also deployed on the server. It is also quite possible we have interim results holding large amount of memory, but we don't use these further down our pipeline. When the OS sees the memory under pressure, if we have swap available, it will start moving currently unused pages to swap, freeing our so valuable physical memory to do current operations. The downside is that swap will be much slower, and there is a heavy penalty every time we have to move pages back to memory, and in a worst case scenario, it is possible to end up in a swap trashing scenario (a problem since the early days of computing as can be seen in this 1968 paper). On the plus side, it is built (transparently) in the operating system, and all it requires is some type of disk device. And by disk device, it could be anything from a conventional SAS or SATA "spinning rust" hard disk, to an SSD or even an NVMe device. Typically, I match RAM to swap. 256GB of RAM? I try to use a drive that is about the same size (240, 250, 256, 288GB or something along those lines), given how cheap drives are, even SSD and NVMe. Maybe you were not able to get a signoff for the extra $4000 ([1]) for the additional memory, but $200 for an SSD is a no brainer.

Adding swap space typically involves setting aside a drive in the server for that purpose, formatting it as swap during the install process, and it will automatically be added and made available to the operating system as swap. If the server is already built, then the device (i.e. /dev/sdX) will have to be setup using fdisk (partition id of 82), mkswap and swapon.

Example, assuming the disk I set aside for swap is /dev/sda (make sure you have the right device, once you type w, the partition is written to disk!):

[user@machina003 ~]$ sudo fdisk /dev/sda
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0x0d42f19f.

Command (m for help): p

Disk /dev/sda: 292.3 GB, 292326211584 bytes, 570949632 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0d42f19f

   Device Boot      Start         End      Blocks   Id  System

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1): 
First sector (2048-570949631, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-570949631, default 570949631): 
Using default value 570949631
Partition 1 of type Linux and of size 272.3 GiB is set

Command (m for help): t
Selected partition 1
Hex code (type L to list all codes): 82
Changed type of partition 'Linux' to 'Linux swap / Solaris'

Command (m for help): p

Disk /dev/sda: 292.3 GB, 292326211584 bytes, 570949632 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0d42f19f

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048   570949631   285473792   82  Linux swap / Solaris

Command (m for help): w

[user@machina003 ~]$ sudo mkswap /dev/sda

[user@machina003 ~]$ sudo swapon /dev/sda
NAME     TYPE        SIZE USED PRIO
/dev/sda partition 272.3G   0B   -2


So that it is persistent across reboot and power cycles, an entry with the device is needed in /etc/fstab:

#
# /etc/fstab
# Created by anaconda on Fri Dec  2 22:30:21 2016
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/sda swap swap defaults 0 0

Or you could use a UUID instead of the /dev/sda entry. To find out a device UUID:

[user@machina003 ~]$ sudo blkid
/dev/sda: UUID="98896cbe-57af-4e9b-b2ef-b9cdd62fc846" TYPE="swap" 

Let's Cache!

As fast as RAM is (ranging from 6400MB/s for DDR3-800 type DIMMs to 25600MB/s for DDR4-3200 type DIMMs), bandwidth is not the only thing that is important. Latency is also crucial. Imagine the processor is executing in a tight loop an instruction that takes 2 nanoseconds. If we can feed this loop from a memory that is of a similar order of magnitude, we can keep the CPU busy. Else, it'll be mostly waiting on data to do the work.

Sounds good, right? Except that you decide to measure your RAM latency, and it is just a tad bit over 100 ns. Is there anything faster? But of course. Cache.

Pulling up the CPU information on my travel laptop (from 2017), lscpu gives me the following:

[user@laptop ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 158
Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping:              9
CPU MHz:               917.553
CPU max MHz:           3800.0000
CPU min MHz:           800.0000
BogoMIPS:              5600.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7

This particular processor (an Intel i7) has 4 entries for cache:
  • L1d cache is the Level 1 cache for data. It is, along with L1i, the closest physically to the processing
  • L1i cache is the Level 1 cache for instructrions
  • L2 cache is still at the core level, but a little further and slower, but larger
  • L3 cache is off core but on die and is shared by all cores/threads

Logical representation of cache on an i7 4 core processor

On this particular CPU, the latency of L1d is about 1.46 nanoseconds (ns) (or almost two order of magnitude faster than RAM), 3.6 ns for L2 cache, and about 10 ns for L3 cache (about one order of magnitude faster than RAM at 100 ns).

There is no configuration at the OS level to take advantage of L1-L3 caches, but it is essential to be using the right techniques when developing pipelines (vectorized vs for loop, for example - learn more from Python for Data Analysis) and using the right data structures (example).

Other types of Cache

There are other types of cache, most of them caching a slower device to a cache in RAM. Although all of these will improve initial performance, keep in mind that they will use some of your precious RAM. Other types of cache use extremely fast non volatile memory, such as Intel Optane, NVMe, battery backed RAID controllers and the like. These would come into play as you select storage based on the speed / cost / capacity tradeoff. [2]

Until next time

For now, as we continue this series, we will assume we now have a server with enough RAM, with a swap device on an SSD matching our RAM, a pair of drives to install the OS with a mirror (when running a single machine vs a cluster, a boot drive failure would put us out of business, unless we have mirrored it on a second drive, through software like LVM or ZFS, or hardware RAID). We'll also assume we have one or many drives that are dedicated to our data storage (for multiple drives, ZFS is a pretty flexible option [2]). We'll address networking when we get to clusters and GPUs when we get to model building.

In the next article we'll get deep into software.

Francois Dion
Chief Data Scientist
@f_dion

[1] As a side note, if a $4000 expense for additional memory is hard to justify, it is quite likely that your data is either worthless or extremely undervalued...

[2] I'll probably cover ZFS in a future post, as it combines all kinds of techniques and hardware options for caching (intent log or ZIL, ARC and L2ARC).


Comments