Database

Database access in pygrametl is done through either a PEP 249 connection if CPython is used, or with a JDBC connection when pygrametl is running on Jython. pygrametl provides multiple abstractions on top of these connections and direct usage of these to manipulate the database should generally not be necessary. As an abstraction for database rows Python’s dict type is used, where the keys the names of the columns in the table and the values are the data stored in that row.

Connection Wrappers

Multiple connection wrappers are provided by the pygrametl framework to allow PEP 249 connections and JDBC connections to be used uniformly, and to allow multiple threads and process to use the connection safely. In addition, the connection wrappers for PEP 249 connections also automatically convert from the pyformat parameter style used by pygrametl to any of the other parameter styles defined in PEP 249#paramstyle. To simplify the use of database connections, the first connection wrapper created is set as the default. The default connection wrapper can be used by abstractions such as tables.FactTable and tables.Dimension without the user having to pass the connection wrapper to them explicitly. If another database connection should be used, for example, if data is read from one database and written to another, a specific connection can be explicitly passed as an argument to all pygrametl abstractions that can read to and/or write from a database.

ConnectionWrapper and JDBCConnectionWrapper.JDBCConnectionWrapper are the two main connection wrappers provided by pygrametl. The interface provided by these two classes is just an abstraction on top of database operations, and provides methods, among others, for executing statements, iterating over returned rows, and committing transactions. Note however that these connection wrappers cannot be used by multiple threads or processes in parallel. To ensure that database access is performed correctly in a parallel ETL program without burdening the user with the task, the class parallel.SharedConnectionWrapperClient is provided. This class can be created from an existing connection wrapper using the function parallel.shareconnectionwrapper(). Each separate process can then be given a unique copy of the shared connection to access the database safely in parallel. For more information about the parallel capabilities of pygrametl see Parallel.

Experimental Connection Wrappers

pygrametl also provides two very experimental connection wrappers: BackgroundConnectionWrapper and JDBCConnectionWrapper.BackgroundJDBCConnectionWrapper. They are provided as alternatives to ConnectionWrapper and JDBCConnectionWrapper.JDBCConnectionWrapper and perform the database operations in a separate thread instead of the same thread as the ETL program. As they are considered experimental, they are not set as default upon creation, and must thus manually be set as the default with the method setasdefault(), available on all connection wrappers, or be manually passed around the program.

For most usage the classes ConnectionWrapper and JDBCConnectionWrapper.JDBCConnectionWrapper will likely provide better performance compared to the background versions. Furthermore, a connection wrapper used in a parallel ETL program should always be wrapped using parallel.shareconnectionwrapper() to ensure safe parallel database access, which itself runs the connection wrapper in a separate process or thread depending on the implementation. As the two implementations are very similar and provide an identical interface, either set of implementations might be removed in a future release.