Jython¶
pygrametl supports running ETL flows on Jython, an implementation of Python that run on the JVM. Using Jython instead of CPython allows an ETL flow to be parallelized using multiple threads instead of multiple processes. This is because Jython does not have a global interpreter lock, which in CPython ensures that only a single thread is running per process at a given time. For more information about the GIL see the Python wiki GIL.
To make switching between CPython and Jython as simple as possible, two
abstractions are provided by pygrametl. Firstly, JDBCConnectionWrapper
provides two connection wrappers for JDBC connections with the same interface as
the connection wrappers for PEP 249 connections. As the connection wrappers,
all share the same interface the user usually only has to change the connection
type (JDBC or PEP 249) and the
connection wrapper when switching between CPython and Jython. For more
information about database access in pygrametl see Database. Secondly,
Jython currently has no support for multiprocessing
as threads are more
lightweight than processes and multiple threads can be run in parallel. So
pygrametl includes the module jythonmultiprocessing
which wraps Python’s
threading
module and provides a very small part of Python’s
multiprocessing
module. Thus, pygrametl exposes the same interface for
creating parallel ETL flows no matter if a user is using CPython or Jython.
While both Jython and CPython are capable of executing the same language, the
two platforms are implemented differently, so optimizations suitable for one
platform may be less effective on the other. One aspect to be aware of when
running high-performance pygrametl-based ETL flows on Jython is memory
management. For example, Oracle’s HotSpot JVM implements a generational garbage
collector that uses a much slower garbage collection strategy for the old
generations than for the young. Thus, allowing too many objects to be promoted
to the old generations can reduce the throughput of an ETL flow significantly.
Unfortunately, this can easily occur if the values controlling caching, such as
Decoupled.batchsize
, are set too high. Similarly, if the value for
Decoupled.batchsize
is set too low the overhead of transferring data
between threads increases as smaller batches are used. Many tools for profiling
programs running on the JVM exist: JFR
and JConsole
are bundled with the JDK, while tools such as VisualVM must be installed separately but often provide
additional functionality.
Setup¶
Using pygrametl with Jython requires an extra step compared to CPython, as Jython is less integrated with Python’s package management system. Firstly, install pygrametl from PyPI or by downloading the development version from GitHub. For more information about installing pygrametl for use with CPython see Install Guide.
After pygrametl has been installed, the location it has been installed to must
be added to the environment variable JYTHONPATH
, as Jython purposely does
not import modules from CPython by default. The default directory used by
CPython for packages depends on the operating system and whether a package is
installed locally or globally. Check the output of the pip install
command
or its log for precise information about where the package has being installed.
The method for setting this variable depends on the operating system. On most
Unix-like systems, the variable can be set in ~/.profile
, which will be
sourced on login. On Windows, environment variables can be changed through the
System setting in the Control Panel. Python’s module search path can also be
extended on a per program basis by adding a path to sys.path
at the
start of a Python program.
Usage¶
Jython can in most cases be used as a direct replacement for CPython unless its
C API is being used. While Jython does not implement CPython C API, it can use
libraries implemented in other JVM-based languages like Java, Scala, Clojure,
and Kotlin. To use such libraries, they must be added to the JVM classpath by
using the -J-cp
command-line option. For more information about Jython’s
command-line flags run the command jython -h
.
from pygrametl.tables import FactTable
from pygrametl.JDBCConnectionWrapper import JDBCConnectionWrapper
# The Java classes used must be explicitly imported into the program
import java.sql.DriverManager
# The actual database connection is handled by a JDBC connection
jconn = java.sql.DriverManager.getConnection(
"jdbc:postgresql://localhost/dw?user=dwuser&password=dwpass")
# As PEP 249 and JDBC connections provide different interfaces, is it
# necessary to use a JDBCConnectionWrapper instead of a ConnectionWrapper.
# Both provides the same interface, thus pygrametl can execute queries
# without taking into account how the connection is implemented
conn = JDBCConnectionWrapper(jdbcconn=jconn)
# This instance of FactTable manages the table "facttable" in the
# database using the default connection wrapper created above
factTable = FactTable(
name='testresults',
measures=['errors'],
keyrefs=['pageid', 'testid', 'dateid'])
The above example demonstrates how few changes are needed to change the first
example from Fact Tables from using CPython to Jython. The database
connection is changed from a PEP 249 connection to a JDBC connection, and
ConnectionWrapper
is changed to
JDBCConnectionWrapper.JDBCConnectionWrapper
. The creation of the
FactTable
object does not need to be changed to run on Jython, as the
connection wrappers abstract away the differences between JDBC and PEP 249. The other Jython
module, jythonmultiprocessing
, is even simpler to use as pygrametl’s
parallel module parallel
imports either it or CPython’s built-in
multiprocessing
module depending on whether Jython or CPython is used.