Data Infrastructures for the rest of us - III - software

Abstract: In part I, we highlighted how memory and CPU requirements impact your data infrastructure, your data engineering efforts and how the right algorithm and the right technology makes or break a data science project. In part II, we continued on hardware and caching, and mentioned ZFS and also the ZFS survival guide. In this latest installment, we briefly touch on virtualization with LXD and go on to install some really useful software to have in your data infrastructure, like PostgreSQL.

It's All Virtual

Why do people like cloud computing? No, it's not the price. Most developer give no attention to questions of operating costs, ROI, support etc. What they do care about is getting a brand new "server", whenever they need one. Sure, the server might only have 2 CPU cores and 4GB of RAM, but all they had to do is click on provision, or run a command line tool, and their server is up and running. Then it is time to experiment. Get onto the new server, try things, mess up, no problem! Delete the VM and restart!

In reality, this concept has been around for many years, and doesn't require the cloud. As we are continuing the series "Data Infrastuctures for the rest of us", and we started with a single (although premium) physical server, we will briefly cover virtualization, so that, if you so choose, you can install each software, or group of software on its own virtual "server", but yet stay on a single physical machine.

Heavy is the Head that Wears the Crown

Hardware Virtualizer - Honeywell Patent No. 4253145 (1981)
When the first business computers (mainframes, big iron and Unix systems) started using virtualization, some were hardware based (subdividing hardware resources), such as what IBM later offered with LPARs (and other similar technologies like Sun LDOMs, HP nPar etc) while at the other end of the spectrum, Unix V7 (1978) introduced the chroot() system call (under the chdir(2) heading), changing the apparent root directory of the current running process, and children.

Neither solutions were optimal. On one end, you needed to pre-allocate RAM, CPUs, serial ports, network interfaces, disks etc to the heavy hard partitioning, and on the other end, with chroot, you couldn't restrict much anything, and everything was on the same network, serial connection, so on and so forth. Virtual machines with a full hypervisor (think Xen, kvm, VMWare, Virtualbox) replicated in software what computers of old did with hard partitions. Once more, sub-optimal use of resources (each VM uses dedicated memory, that it uses it or not).

Then, in 2004, came a lighter type of hypervisor in the form of (Sun Micro) Solaris Zones. This allowed some control over what resources could be used by each zone, and at the same time allowed sharing of storage pools, memory, CPU. Eventually it also supported multiple versions of Solaris and of Linux, and full software defined networking. It was now possible to replicate a whole data center architecture on a laptop...

Containers like it's 2020

Of course, since we are working on a Linux server, Solaris Zones / Containers are not an option, but in 2008, LXC appeared on Linux, and later was made simpler to use with LXD, ZFS (storage) and bridge-utils (networking), replicating in large part what Solaris Containers did, but for Linux. This is not a full how-to on LXD. You should get comfortable and learn this on your own, not just follow a quick guide. Or hire my company or something.

As far as installation goes, I would highly recommend to install ZFS (see the survival guide) and create a pool for LXD first.

If you only have a single network interface card to your main network, and use Ubuntu 18.04 LTS, you can follow this: LXD in 4 easy steps.

It's complicated

However, if your networking is anything but basic (ie. it incorporates infiniband, bonded interfaces, MAC vlans or other advanced features), proceed instead to install the Linux bridge interface and configure and test it, before installing LXD. On Ubuntu or debian:

[user@machina003 ~]$ sudo apt install bridge-utils

And see one of the Wikis for configuration instructions. The main trick with bonded interfaces is to set them as manual instead of static or dhcp, and configure your br0 interface to manage bond0 (bridge-ifaces bond0 and bridge-ports bond0 or whatever your bonded interface name is). You would also set br0 to dhcp (iface br0 inet dhcp)

Then proceed to install LXD and configure it (see the appropriate guide for your version of Linux). Something like (assuming a bridge named br0 and a zfs file system named lxd on a zpool named mpool):

[user@machina003 ~]$ sudo apt install lxd
[user@machina003 ~]$ sudo lxd init
Would you like to use LXD clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]:
Name of the storage backend to use (btrfs, dir, lvm, zfs) [default=zfs]:
Create a new ZFS pool? (yes/no) [default=yes]: no
Name of the existing ZFS pool or dataset: mpool/lxd
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to create a new local network bridge? (yes/no) [default=yes]: no
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]: yes
Name of the existing bridge or host interface: br0
Would you like LXD to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:

You now have the basic building block to create virtual servers.

But what about Docker?

Docker is perfect for deploying and scaling web apps, single page apps and the like. But it has limitations and is not suited for the types of tasks that are needed for data infrastructures. LXD adds full networking capabilities and persistent storage. It also behaves exactly like a real server. You can still deploy docker based apps inside an LXC container, if you want to.

But what about Virtualenv?

Virtualenv is like chroot for Python. Works fine for that application, but it is not anything like a virtual server. You can use virtualenv inside an LXC container, if you want to. Or use conda envs, or any of the other tools designed for Python.

Installing Postgresql

I've introduced PostgreSQL in my "ex-libris" of a Data Scientist part I. It is the most flexible open source database system. It can be customized to handle pretty much any workload, given the proper extensions (ranging from time series to GIS). I've also covered it in my NACIS presentation on open source cartographic tools for data science. So we won't go into the details as to why it is so great. Bottom line, you will need a database system for various data engineering and data science tasks, so we need to install it. It is also a recommended backend for airflow. (Image: IBM RAMAC, from patent filed 1970, original 1954).

Assuming you've gone ahead and decided LXD was for you, you'll need to create a server on which to install PostgreSQL. If you decided on something else for virtualization, we'll see you next time, when we go into airflow and jupyter notebooks.

[user@machina003 ~]$ sudo lxc launch ubuntu:18.04 pg-server
[user@machina003 ~]$ sudo lxc list
|     NAME     |  STATE  |       IPV4       | IPV6 |    TYPE    | SNAPSHOTS |
| pg-server    | RUNNING | (eth0) |      | PERSISTENT | 0         |

We just created a server. If you want to automate everything with a tool like ansible, you will have to create a user to ssh into the server, copy the public key and set the user to do passwordless sudo. However, that level of automation is for another time.

Today, we will just connect to the server as root and install and configure PostgreSQL. To do that, we will simply remote execute a bash shell on pg-server, from our "head" server, where we installed the LXD package.

[user@machina003 ~]$ sudo lxc exec pg-server -- /bin/bash
root@pg-server:~# apt install postgresql postgresql-contrib -y
root@pg-server:~# systemctl status postgresql

The status should say Active. The installation process automatically created a postgres user. To become that user:

root@pg-server:~# su - postgres
postgres@pg-server:~$ psql
psql (10.12 (Ubuntu 10.12-0ubuntu0.18.04.1))
Type "help" for help.

postgres=# \list
                              List of databases
   Name    |  Owner   | Encoding | Collate |  Ctype  |   Access privileges   
 postgres  | postgres | UTF8     | C.UTF-8 | C.UTF-8 | 
 template0 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +
           |          |          |         |         | postgres=CTc/postgres
 template1 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +
           |          |          |         |         | postgres=CTc/postgres
(3 rows)

postgres=# \q
postgres@pg-server:~$ exit

We've confirmed that the postgres user can connect locally and query the database. Depending on the type of application we will be using, we will have to configure remote access to the database. We will do that when the time comes.

Fine Tuning

For now, however, let's do a basic benchmark of our database, so we know what to expect. We will create a database for running pgbench, load data (with a scale of 100x the size the base test [1]), confirm we created the tables and run pgbench for 10 minutes in SELECT (reads) and then in TPC-B mode.

root@pg-server:~# su - postgres
postgres@pg-server:~$ createdb pgbench
postgres@pg-server:~$ pgbench -i -s 100 pgbench
postgres@pg-server:~$ psql -d pgbench
psql (10.12 (Ubuntu 10.12-0ubuntu0.18.04.1))
Type "help" for help.

pgbench=# \dt
              List of relations
 Schema |       Name       | Type  |  Owner   
 public | pgbench_accounts | table | postgres
 public | pgbench_branches | table | postgres
 public | pgbench_history  | table | postgres
 public | pgbench_tellers  | table | postgres
(4 rows)

pgbench=# \q

SELECT mode, 24 concurrent connections, 2 threads, 600 seconds (10 minutes) which is a minimum amount of time to get a stable measure. If we want to see our disk activity while the benchmark is running, we can open a new connection to our pg-server (with lxc exec) and then run: zpool iostat 1.

postgres@pg-server:~$ pgbench -c 24 -j 2 -T 600 -S pgbench
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 24
number of threads: 2
duration: 600 s
number of transactions actually processed: 43543511
latency average = 0.331 ms
tps = 72572.454834 (including connections establishing)
tps = 72573.118889 (excluding connections establishing)

That's over 72000 select transactions per second, on an array of spinning rust (regular 2.5" SAS rotating hard drives). In reality, thanks to compression and caching at the ZFS and PostgreSQL layers, the drives a mostly idle on reads. Now let's see on the full read/write benchmark:

postgres@pg-server:~$ pgbench -c 32 -j 2 -T 600 pgbench
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 32
number of threads: 2
duration: 600 s
number of transactions actually processed: 3822033
latency average = 5.024 ms
tps = 6369.896896 (including connections establishing)
tps = 6369.959330 (excluding connections establishing)

Out of the box, PostgreSQL has very conservative settings. Since our box has quite a bit of memory, let's bump a few memory settings. We'll also turn off the full page writes (we can do that because and only because we use ZFS as filesystem backend).

postgres@pg-server:~$ pg_conftool 10 main set shared_buffers 4GB
postgres@pg-server:~$ pg_conftool 10 main set work_mem 12MB
postgres@pg-server:~$ pg_conftool 10 main set maintenance_work_mem 1GB
postgres@pg-server:~$ pg_conftool 10 main set effective_cache_size 6GB
postgres@pg-server:~$ pg_conftool 10 main set full_page_writes off
postgres@pg-server:~$ exit
root@pg-server:~# systemctl restart postgresql

Will we do better than 72500 tps on SELECT statements?

root@pg-server:~# su - postgres
postgres@pg-server:~$ pgbench -c 24 -j 2 -T 600 -S pgbench
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 24
number of threads: 2
duration: 600 s
number of transactions actually processed: 109059919
latency average = 0.132 ms
tps = 181766.424785 (including connections establishing)
tps = 181768.306634 (excluding connections establishing)

Over 180,000 tps, a huge improvement with such a simple change to the PostgreSQL configuration. Now let's see on the full read/write benchmark:

postgres@pg-server:~$ pgbench -c 32 -j 2 -T 600 pgbench
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 32
number of threads: 2
duration: 600 s
number of transactions actually processed: 8041905
latency average = 2.388 ms
tps = 13403.132790 (including connections establishing)
tps = 13403.254404 (excluding connections establishing)

This is a little over twice as fast at 13403 tps, and a 2.388ms latency... Again, let's not forget we are running inside a virtual "server", with a bunch of old school 2.5" hard disks with spinning platters (striped and mirrored). As a reference point, I typically get around 2000 tps on a cloud server, even with NVME storage.

So, congratulations, you've now tweaked PostgreSQL to perform well on TPC-B benchmark. Of course, your workload might be different, so you will want to keep an eye on your server. How do we do this?

Log Analysis

We are now going to install pgbadger. According to the developer, it is:
"a PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering. It outperform any other PostgreSQL log analyzer."
 The installation from the github repo is pretty straightforward:

postgres@pg-server:~$ exit
root@pg-server:~# apt-get install libjson-xs-perl
root@pg-server:~# git clone
Cloning into 'pgbadger'...
remote: Enumerating objects: 67, done.
remote: Counting objects: 100% (67/67), done.
remote: Compressing objects: 100% (44/44), done.
remote: Total 4155 (delta 39), reused 41 (delta 18), pack-reused 4088
Receiving objects: 100% (4155/4155), 7.90 MiB | 687.00 KiB/s, done.
Resolving deltas: 100% (2554/2554), done.
root@pg-server:~# cd pgbadger/
root@pg-server:~/pgbadger# perl Makefile.PL INSTALLDIRS=vendor
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for pgBadger
Writing MYMETA.yml and MYMETA.json
root@pg-server:~/pgbadger# apt install make
root@pg-server:~/pgbadger# make && sudo make install
Installing /usr/share/man/man1/pgbadger.1p
Installing /usr/bin/pgbadger

We now have to configure postgresql to log long running queries. We will set it to log anything that is taking more than 1 second (or whatever makes sense for your application).

root@pg-server:~/pgbadger# su - postgres
postgres@pg-server:~$ pg_conftool 10 main set log_min_duration_statement 1000
postgres@pg-server:~$ pg_conftool 10 main set log_checkpoints on
postgres@pg-server:~$ pg_conftool 10 main set log_connections on
postgres@pg-server:~$ pg_conftool 10 main set log_disconnections on
postgres@pg-server:~$ pg_conftool 10 main set log_lock_waits on
postgres@pg-server:~$ pg_conftool 10 main set log_temp_files 0
postgres@pg-server:~$ pg_conftool 10 main set log_autovacuum_min_duration 0
postgres@pg-server:~$ pg_conftool 10 main set log_error_verbosity default
postgres@pg-server:~$ exit
root@pg-server:~/pgbadger# systemctl restart postgresql

You will now have logs that are usable by pgbadger. In my case, to show an example of the report, I set the log_min_duration_statement above to only 10ms instead of 1000ms, then ran the pgbench benchmark again. This populated the postgres log with quite a bit. To generate the report:

root@pg-server:~/pgbadger# cd
root@pg-server:~# pwd
root@pg-server:~# pgbadger /var/log/postgresql/postgresql-10-main.log 
[>                        ] Parsed        0 bytes of 10754158 (0.00%), queries: 
[>                        ] Parsed      962 bytes of 10754158 (0.01%), queries: 
[>                        ] Parsed    10160 bytes of 10754158 (0.09%), queries: 
[>                        ] Parsed   105112 bytes of 10754158 (0.98%), queries: 
[==>                      ] Parsed  1052740 bytes of 10754158 (9.79%), queries: 
[====>                    ] Parsed  2107921 bytes of 10754158 (19.60%), queries:
[=======>                 ] Parsed  3164601 bytes of 10754158 (29.43%), queries:
[=========>               ] Parsed  4216749 bytes of 10754158 (39.21%), queries:
[===========>             ] Parsed  5272034 bytes of 10754158 (49.02%), queries:
[==============>          ] Parsed  6329477 bytes of 10754158 (58.86%), queries:
[================>        ] Parsed  7385347 bytes of 10754158 (68.67%), queries:
[==================>      ] Parsed  8441076 bytes of 10754158 (78.49%), queries:
[=====================>   ] Parsed  9496409 bytes of 10754158 (88.30%), queries:
[=======================> ] Parsed 10553858 bytes of 10754158 (98.14%), queries:
[========================>] Parsed 10754158 bytes of 10754158 (100.00%), queries: 543065, events: 11
LOG: Ok, generating html report...
root@pg-server:~# ls -al out.html
-rw-r--r-- 1 root root 951657 May 18 02:33 out.html
root@pg-server:~# exit
[user@machina003 ~]$ sudo lxc file pull pg-server/root/out.html .

Opening out.html in a web browser:

out.html output of pgbadger
 These are the basic building blocks. The github repo for pgbadger goes into a lot more details, including setting up cron jobs and setting up a web server, but that's outside the scope of this article.

Until next time

In the next article in this series, we'll continue with the installation of more software.

Francois Dion

Chief Data Scientist


[1] In this case, we used a scale of x 100 which means a database size of about 1.6GB, so the database creation doesn't take too long. Using a factor of x 10000 will generate a database of 160GB. Running the tests, we would probably bump the number of connections and number of threads. Having done that, we see that we maintain a mixed transactions per seconds above 10000, even at that scale, on a modest number of regular hard drives, and running both the database and benchmark on the same container.

About Dion Research LLC: We are a boutique Data Science consultancy . established 2011. As we do end-to-end Data Science, we can help you solve business problems every step of the way. Get in touch for more information.