Saturday, June 8, 2024

The simplest data pipeline ever

As software engineers what we care most about is getting stuff done. The simplest approach is probably the best, as long as it doesn't cause long term issues.

I was working on a project where the creation and initialization of a few PostgreSQL databases required data migration and transformation from another set of databases. The first step in this process involved loading a few hundreds of millions of records from the source databases into the destination databases.
I wanted to implement this in a way that is fully automated, and at the same time, I wanted to simplify this process as much as possible. So, no fancy tools. No cloud infrastructure. No nothing. Is that even possible?

The approach we followed was simply using Linux pipes! Linux pipes can transfer data from one process to another. So how could this help with this data import process ?

Postgres provides useful command line utilities that can be used to export and import data. For example this command exports data from a table called mydata to the standard output:

psql -h source -U postgres -d test -c "\copy mydata TO STDOUT"

On the other hand you can import data from standard input to a database table using this command:

psql -h destination -U postgres -d test -c "\copy mydata FROM STDIN"

With the power of Linux pipelines it's possible to stitch these commands together so that the output of the first command feeds into the input of the next and data flows from one database to another.

psql -h source -U postgres -d test -c "\copy mydata TO STDOUT" | psql -h destination -U postgres -d test -c "\copy mydata FROM STDIN"

Pretty simple. Isn't it? However since this operation may take a few hours, wouldn't it be nice to have some sort of progress indication ?
This is where another handy Linux utility comes to play: The pv utility. From the man page:

pv - monitor the progress of data through a pipe

By default this tool shows the number of bytes flowing through the pipe, however in the process of data import, the number of imported records is a better representative of the progress of the operation. The good thing is that pv has switches that enable counting the number of lines instead of bytes. So the final solution would look like:

psql -h source -U postgres -d test -c "\copy mydata TO STDOUT" | \ 
pv --line-mode --size 100000000 | \ 
psql -h destination -U postgres -d test -c "\copy mydata FROM STDIN"

Note that the --size parameter assumes knowledge of the total number of records, which can be retrieved using a simple select count(*), or just omitted.

When I run this in terminal, the progress looks like:

Transferring 100m records from one database to another both running as containers on my local machine took about 8 minutes. In case of transferring data over the network, it's expected to be slower.

Needless to say that in the real life implementation other steps were required like retrieving the credentials for the source and destination databases and hooking the scripts into a CI pipeline.

Surely, this isn't the most efficient way to transfer a lot of data. Note that the data is transferred as text which is far less efficient than the binary transfer that proper data import tools would use. As with any decision we make as software engineers it's all about tradeoffs. My priorities were clear: We need the simplest possible repeatable solution.

If you're interested in trying this on your machine, this is how I prepared the above screenshots:
Let's start with a docker compose file which instantiates 3 containers:

  1. A source database
  2. A destination database
  3. And a client where there data import / export process is executed.
version: "3.8"

    driver: bridge

    image: postgres:16.1
      - type: volume
        source: source-data
        target: /var/lib/postgresql/data
      - 9432:5432

    image: postgres:16.1
      - type: volume
        source: destination-data
        target: /var/lib/postgresql/data
      - 9433:5432

    container_name: postgres_client
    build: .
    entrypoint: [ "sleep", "infinity" ]

    external: true
    name: source-data

    external: true
    name: destination-data


Note that for the client container a Dockerfile is used and that is to ensure that the required utilities for this process -in particular pv- are installed. Additionally to copy the .pgpass file which contains the database passwords. 

FROM postgres:16.1

RUN apt-get update

RUN apt-get install pv

COPY pgpass /root/.pgpass

RUN chmod 0600 /root/.pgpass



Then start this docker compose stack using:

docker compose up -d --build

Connect to the databases using your favorite tool (mine is Azure data studio) and create the test table by executing this query:

CREATE TABLE public.mydata (
	id int NOT NULL,
    firstname varchar NULL,
	lastname varchar NULL,
	email varchar NULL,

The next step would be to populate some test data into the source database:

INSERT INTO public.mydata
(firstname, lastname, email, id)
select concat('firstname', counter), concat('lastname', counter), concat('firstname', counter, '.', 'lastname', counter, ''), counter
	from pg_catalog.generate_series(1, 100000000) as counter

Then connect to the client container using:

docker compose exec client bash

And execute the script to start the data migration:

psql -h source -U postgres -d test -c "\copy mydata TO STDOUT" | \ 
pv --line-mode --size 100000000 | \ 
psql -h destination -U postgres -d test -c "\copy mydata FROM STDIN"


I hope this helps.

No comments: