Jython¶
pygrametl contains additional support for running ETL programs on Jython, the Java Virtual Machine implementation of Python. Using Jython compared to CPython allows for threads to be used instead of processes for performing operations in parallel. This is caused by the lack of global interpreter lock on the JVM, which in CPython prevents more then one thread in the same process from running at the same time. For more information about the GIL see the Python wiki GIL.
To make the switch between CPython and Jython as simple as possible, two
abstractions are provided by pygrametl. Firstly, JDBCConnectionWrapper
provides two wrappers for database connections following the Java Database
Connectivity standard (JDBC), and as pygrametl uses a similar wrapper for
PEP 249 database connections, they can be used without any changes to the
program code. For more information about database accesses in pygrametl see
Database. Secondly, as Jython currently has no support for
multiprocessing
due to threads being more lightweight and capable of
running in parallel, another abstraction is provided by pygrametl in the form
of the module jythonmultiprocessing
. This module wraps threading
to implement a very small part of Python multiprocessing
module, so the
same library interface can used on both Jython and CPython.
Note, that while both Jython and CPython are capable of executing the same
language, the two platforms are implemented differently, so optimisations
suitable for one platform may be less effective for the other. One aspect to
be aware of when running high performance pygrametl flows on Jython, is memory
management. The JVM, which Jython runs on, uses a generational garbage
collector with more expensive garbage collection strategies used for the old
generation part of the heap. Allowing too many objects to be moved to this part
of memory can reduce the throughput of an ETL flow significantly, something
which can easily occur if values controlling caching such as
Decoupled.batchsize
, are set too high. Similarly, a too low value
would in the case of Decoupled.batchsize
increase the overhead of
transferring data between threads, as smaller batches are used. Multiple tools
for profiling JVM based programs exist: HPROF and
JConsole
are bundled with the JDK, while tools such as VisualVM must be installed separately but often provide
additional functionality.
Setup¶
Using pygrametl with Jython requires a few extra steps compared to CPython, as Jython is less integrated with Python’s package management system, and the JVM needs access to the necessary libraries. First, install pygrametl either through PyPI or by downloading the latest development version from Github, for more information about installation of pygrametl for CPython see Install Guide.
After pygrametl has been installed, the install location must be added to the
environment variable JYTHONPATH
, as Jython purposely does not read the
locations used by CPython as a default. The default pip install directory
depends on the operating system, and whether packages are installed locally or
globally, check the output of the pip install command or its log for precise
information about where the packages are installed. The method for setting
this variable depends on your operating system. On most Unix-like systems , the
variable can be set in ~/.profile
, which will be sourced on login. On
Windows, environment variables can be changed through the System setting in in
the Control Panel. The module path can also be set programmatically through
sys.path
.
Usage¶
Jython can in most cases be used as a replacement for Jython, with the
exception of C-Extensions which Jython replaces with the capability to use
libraries from languages targeting the JVM such as Java, Scala or Clojure. To
accesses JVM libraries, they must be added to the JVM classpath by using the
-J-cp
command line option. For more information about Jython’s command line
flags, see Jython CLI.
import pygrametl
from pygrametl.tables import FactTable
from pygrametl.JDBCConnectionWrapper import JDBCConnectionWrapper
# Java classes used must be imported into the program
import java.sql.DriverManager
# The actual database connection is handled using a JDBC connection
jconn = java.sql.DriverManager.getConnection \
("jdbc:postgresql://localhost/dw?user=dwuser&password=dwpass")
# As PEP 249 and JDBC connections are different must JDBCConnectionWrapper
# instead of ConnectionWrapper. The class has the same interface and a
# reference to the wrapper is also saved to allow for easy access of it
conn = JDBCConnectionWrapper(jdbcconn=jconn)
# The instance of FactTable connects to the table "facttable" in the
# database using the default connection wrapper we just created
factTable = FactTable(
name='testresults',
measures=['errors'],
keyrefs=['pageid', 'testid', 'dateid'])
The above example demonstrates how few changes are needed to in order to change
the first example from Fact Tables from using CPython to Jython. The
database connection is changed to use a JDBC connection, and
ConnectionWrapper
is changed to
JDBCConnectionWrapper.JDBCConnectionWrapper
. The creation of the fact
table does not need to be changed in any way to run on Jython, as the
connection wrappers abstract away the differences between JDBC and PEP 249.
The other Jython module, jythonmultiprocessing
, is even simpler to use
as pygrametl’s parallel module parallel
imports either it, or CPythons
built-in multiprocessing
module depending on whether Jython or CPython
is used.