Signing data with Signethic

F. Dion - Wax seal

abstract: data integrity is a critical part of data governance and data science. Dion Research's open source module Signethic helps streamline the process for Jupyter notebook and regular python scripts.

Data tampering

When setting up a data pipeline, it is important to make sure the data is not modified by third parties (either internally or externally, with malicious intent or not). This should be part of the normal auditing process. Doing this from a data ingestion, data engineering, data modelling or end-to-end data science should not be too difficult.

Signethic

Last year, Dion Research published a simple to use open source Public Key Cryptographic Signature (PKCS) module to help with that: Signethic. It is available on github. Although there are instructions on the README, we are presenting here a more complete scenario using on disk data storage of numpy arrays. This same procedure works for any string or binary object in Python.

Installation


From source:

You will need to install git if you want to install from source, or download the repository as a zip file and unzip.
git clone https://github.com/dionresearch/signethic.git
cd signethic
python setup.py install

From pypi:

pip install signethic

Requirements:

Additional requirements for the example in this article can be installed with:

pip install numpy

The key


Since this package uses a PKCS approach, it needs some keys. How can you generate a private and public key for this?

After the installation of signethic, you can go into the folder where you want to store this, (i.e. ~/.ssh), open python, and generate the keys:


(signethic) user@desktop:~$ python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from signethic import gen_key_pair
>>> gen_key_pair()
>>> quit()

It is important to safeguard the private key.

Signing the data


We will now create a data "pipeline". This could be a jupyter notebook or a python script. In this simple form, we are simply generating a two dimension array with random values, but typically, this would be part of some ingestion (ie. read a csv file, hdf5 file, sql query etc). We will then sign and persist the signed data to disk:


import io
import numpy as np
from signethic import sign_and_persist


data = np.random(300, 8)

# file like stream
f = io.BytesIO()
# save data to stream
np.save(f, data)

_ = f.seek(0)
signature = sign_and_persist(f.getbuffer(), 'numpy_data.signed', private_key_path='/home/user/.ssh/signing_key.pem')



Verifying the data

Now, suppose you are building the model in a different script, perhaps on a different server and even a different department. The person who took charge of signing the data would then publish to you, not only where to get the data, but also provide the public key to verify it (perhaps through a data catalog or portal). So that way, you can do something like this:



import numpy as np
from signethic import verify_file

data = verify_file('numpy_data.signed', public_key_path='/path/to/signing_key.pem.pub')

if data:
    print("Good to go")
else:
    print("Somebody tampered with the file")
    exit(-1)


Conclusion


There you go, one line to sign a binary object and save to disk, and one line to verify and load a signed binary object. The same can be done with any other binary data, including pickle dumps or even complete file system dumps.


Francois Dion
Chief Data Scientist
@f_dion

Comments