Jython

pygrametl supports running ETL flows on Jython, an implementation of Python that run on the JVM. Using Jython instead of CPython allows an ETL flow to be parallelized using multiple threads instead of multiple processes. This is because Jython does not have a global interpreter lock, which in CPython ensures that only a single thread is running per process at a given time. For more information about the GIL see the Python wiki GIL.

To make switching between CPython and Jython as simple as possible, two abstractions are provided by pygrametl. Firstly, JDBCConnectionWrapper provides two connection wrappers for JDBC connections with the same interface as the connection wrappers for PEP 249 connections. As the connection wrappers, all share the same interface the user usually only has to change the connection type (JDBC or PEP 249) and the connection wrapper when switching between CPython and Jython. For more information about database access in pygrametl see Database. Secondly, Jython currently has no support for multiprocessing as threads are more lightweight than processes and multiple threads can be run in parallel. So pygrametl includes the module jythonmultiprocessing which wraps Python’s threading module and provides a very small part of Python’s multiprocessing module. Thus, pygrametl exposes the same interface for creating parallel ETL flows no matter if a user is using CPython or Jython.

While both Jython and CPython are capable of executing the same language, the two platforms are implemented differently, so optimizations suitable for one platform may be less effective on the other. One aspect to be aware of when running high-performance pygrametl-based ETL flows on Jython is memory management. For example, Oracle’s HotSpot JVM implements a generational garbage collector that uses a much slower garbage collection strategy for the old generations than for the young. Thus, allowing too many objects to be promoted to the old generations can reduce the throughput of an ETL flow significantly. Unfortunately, this can easily occur if the values controlling caching, such as Decoupled.batchsize, are set too high. Similarly, if the value for Decoupled.batchsize is set too low the overhead of transferring data between threads increases as smaller batches are used. Many tools for profiling programs running on the JVM exist: JFR and JConsole are bundled with the JDK, while tools such as VisualVM must be installed separately but often provide additional functionality.

Setup

Using pygrametl with Jython requires an extra step compared to CPython, as Jython is less integrated with Python’s package management system. Firstly, install pygrametl from PyPI or by downloading the development version from GitHub. For more information about installing pygrametl for use with CPython see Install Guide.

After pygrametl has been installed, the location it has been installed to must be added to the environment variable JYTHONPATH, as Jython purposely does not import modules from CPython by default. The default directory used by CPython for packages depends on the operating system and whether a package is installed locally or globally. Check the output of the pip install command or its log for precise information about where the package has being installed. The method for setting this variable depends on the operating system. On most Unix-like systems, the variable can be set in ~/.profile, which will be sourced on login. On Windows, environment variables can be changed through the System setting in the Control Panel. Python’s module search path can also be extended on a per program basis by adding a path to sys.path at the start of a Python program.

Usage

Jython can in most cases be used as a direct replacement for CPython unless its C API is being used. While Jython does not implement CPython C API, it can use libraries implemented in other JVM-based languages like Java, Scala, Clojure, and Kotlin. To use such libraries, they must be added to the JVM classpath by using the -J-cp command-line option. For more information about Jython’s command-line flags run the command jython -h.

from pygrametl.tables import FactTable
from pygrametl.JDBCConnectionWrapper import JDBCConnectionWrapper

# The Java classes used must be explicitly imported into the program
import java.sql.DriverManager

# The actual database connection is handled by a JDBC connection
jconn = java.sql.DriverManager.getConnection(
    "jdbc:postgresql://localhost/dw?user=dwuser&password=dwpass")

# As PEP 249 and JDBC connections provide different interfaces, is it
# necessary to use a JDBCConnectionWrapper instead of a ConnectionWrapper.
# Both provides the same interface, thus pygrametl can execute queries
# without taking into account how the connection is implemented
conn = JDBCConnectionWrapper(jdbcconn=jconn)

# This instance of FactTable manages the table "facttable" in the
# database using the default connection wrapper created above
factTable = FactTable(
    name='testresults',
    measures=['errors'],
    keyrefs=['pageid', 'testid', 'dateid'])

The above example demonstrates how few changes are needed to change the first example from Fact Tables from using CPython to Jython. The database connection is changed from a PEP 249 connection to a JDBC connection, and ConnectionWrapper is changed to JDBCConnectionWrapper.JDBCConnectionWrapper. The creation of the FactTable object does not need to be changed to run on Jython, as the connection wrappers abstract away the differences between JDBC and PEP 249. The other Jython module, jythonmultiprocessing, is even simpler to use as pygrametl’s parallel module parallel imports either it or CPython’s built-in multiprocessing module depending on whether Jython or CPython is used.