pygrametl

A package for creating Extract-Transform-Load (ETL) programs in Python.

The package contains a number of classes for filling fact tables and dimensions (including snowflaked and slowly changing dimensions), classes for extracting data from different sources, classes for defining ‘steps’ in an ETL flow, and convenient functions for often-needed ETL functionality.

The package’s modules are:

  • datasources for access to different data sources

  • tables for giving easy and abstracted access to dimension and fact tables

  • parallel for parallelizing ETL operations

  • JDBCConnectionWrapper and jythonmultiprocessing for support of Jython

  • aggregators for aggregating data

  • steps for defining steps in an ETL flow

  • FIFODict for providing a dict with a limited size and where elements are removed in first-in first-out order

class pygrametl.ConnectionWrapper(connection, stmtcachesize=1000, paramstyle=None, copyintonew=False)

Bases: object

Provide a uniform representation of different database connection types.

All Dimensions and FactTables communicate with the data warehouse using a ConnectionWrapper. In this way, the code for loading the DW does not have to care about which parameter format is used.

pygrametl’s code uses the ‘pyformat’ but the ConnectionWrapper performs translations of the SQL to use ‘named’, ‘qmark’, ‘format’, or ‘numeric’ if the user’s database connection needs this. Note that the translations are simple and naive. Escaping as in %%(name)s is not taken into consideration. These simple translations are enough for pygrametl’s code which is the important thing here; we’re not trying to make a generic, all-purpose tool to get rid of the problems with different parameter formats. It is, however, possible to disable the translation of a statement to execute such that ‘problematic’ statements can be executed anyway.

Create a ConnectionWrapper around the given PEP 249 connection

If no default ConnectionWrapper already exists, the new ConnectionWrapper is set as the default.

Arguments:

  • connection: An open PEP 249 connection to the database

  • stmtcachesize: A number deciding how many translated statements to cache. A statement needs to be translated when the connection does not use ‘pyformat’ to specify parameters. When ‘pyformat’ is used and copyintonew == False, stmtcachesize is ignored as no statements need to be translated.

  • paramstyle: A string holding the name of the PEP 249 connection’s paramstyle. If None, pygrametl will try to find the paramstyle automatically (an AttributeError can be raised if that fails).

  • copyintonew: A boolean deciding if a new mapping only holding the needed arguments should be created when a statement is executed. Some drivers require this.

close()

Close the connection to the database,

commit()

Commit the transaction.

cursor()

Return a cursor object. Optional method.

execute(stmt, arguments=None, namemapping=None, translate=True)

Execute a statement.

Arguments:

  • stmt: the statement to execute

  • arguments: a mapping with the arguments (default: None)

  • namemapping: a mapping of names such that if stmt uses %(arg)s and namemapping[arg]=arg2, the value arguments[arg2] is used instead of arguments[arg]

  • translate: decides if translation from ‘pyformat’ to the undlying connection’s format should take place. Default: True

executemany(stmt, params, translate=True)

Execute a sequence of statements.

fetchalltuples()

Return all result tuples

fetchmanytuples(cnt)

Return cnt result tuples.

fetchone(names=None)

Return one result row (i.e. dict).

fetchonetuple()

Return one result tuple.

getunderlyingmodule()

Return a reference to the underlying connection’s module.

This is done by considering the connection’s __class__’s __module__ string from right to left (e.g., ‘a.b.c’, ‘a.b’, ‘a’) and looking for the attributes ‘paramstyle’ and ‘connect’ in the possible modules

resultnames()
rollback()

Rollback the transaction.

rowcount()

Return the size of the result.

rowfactory(names=None)

Return a generator object returning result rows (i.e. dicts).

setasdefault()

Set this ConnectionWrapper as the default connection.

pygrametl.copy(row, **renaming)

Create a copy of a dictionary, but allow renamings.

Arguments:

  • row the dictionary to copy

  • **renaming allows renamings to be specified in the form newname=oldname meaning that in the result, oldname will be renamed to newname. The key oldname must exist in the row argument, but it can be assigned to several newnames in the result as in x=’repeated’, y=’repeated’.

pygrametl.datereader(dateattribute, parsingfunction=ymdparser)

Return a function that converts a certain dict member to a datetime.date

When setting, fromfinder for a tables.SlowlyChangingDimension, this method can be used for generating a function that picks the relevant dictionary member from each row and converts it.

Arguments:

  • dateattribute: the attribute the generated function should read

  • parsingfunction: the parsing function that converts the string to a datetime.date

pygrametl.datespan(fromdate, todate, fromdateincl=True, todateincl=True, key='dateid', strings={'date': '%Y-%m-%d', 'monthname': '%B', 'weekday': '%A'}, ints={'day': '%d', 'month': '%m', 'year': '%Y'}, expander=None)

Return a generator yielding dicts for all dates in an interval.

Arguments:

  • fromdate: The lower bound for the date interval. Should be a datetime.date or a YYYY-MM-DD formatted string.

  • todate: The upper bound for the date interval. Should be a datetime.date or a YYYY-MM-DD formatted string.

  • fromdateincl: Decides if fromdate is included. Default: True

  • todateincl: Decides if todate is included. Default: True

  • key: The name of the attribute where an int (YYYYMMDD) that uniquely identifies the date is stored. Default: ‘dateid’.

  • strings: A dict mapping attribute names to formatting directives (as those used by strftime). The returned dicts will have the specified attributes as strings. Default: {‘date’:’%Y-%m-%d’, ‘monthname’:’%B’, ‘weekday’:’%A’}

  • ints: A dict mapping attribute names to formatting directives (as those used by strftime). The returned dicts will have the specified attributes as ints. Default: {‘year’:’%Y’, ‘month’:’%m’, ‘day’:’%d’}

  • expander: A callable f(date, dict) that is invoked on each created dict. Not invoked if None. Default: None

pygrametl.datetimereader(datetimeattribute, parsingfunction=ymdhmsparser)

Return a function that converts a certain dict member to a datetime

When setting, fromfinder for a tables.SlowlyChangingDimension, this method can be used for generating a function that picks the relevant dictionary member from each row and converts it.

Arguments:

  • datetimeattribute: the attribute the generated function should read

  • parsingfunction: the parsing function that converts the string to a datetime.datetime

pygrametl.endload()

Signal to all Dimension and FactTable objects that all data is loaded.

pygrametl.getbool(value, default=None, truevalues={'1', 'True', 't', 'true', True}, falsevalues={'0', 'False', 'f', 'false', False})

Convert a given value to True, False, or a default value.

If the given value is in the given truevalues, True is returned. If the given value is in the given falsevalues, False is returned. Otherwise, the default value is returned.

pygrametl.getdate(targetconnection, ymdstr, default=None)

Convert a string of the form ‘yyyy-MM-dd’ to a Date object.

The returned Date is in the given targetconnection’s format.

Arguments:

  • targetconnection: a ConnectionWrapper whose underlying module’s Date format is used

  • ymdstr: the string to convert

  • default: The value to return if the conversion fails

pygrametl.getdbfriendlystr(value, nullvalue='None')

Covert a value into a string that can be accepted by a DBMS.

None values are converted into the value of the argument nullvalues (default: ‘None’). Bools are converted into ‘1’ or ‘0’ (instead of ‘True’ or ‘False’ as str would do). Other values are currently just converted by means of str.

pygrametl.getdefaulttargetconnection()

Return the default target connection

pygrametl.getfloat(value[, default]) float(value) if possible, else default.
pygrametl.getint(value[, default]) int(value) if possible, else default.
pygrametl.getlong(value[, default]) long(value) if possible, else default.
pygrametl.getstr(value[, default]) str(value) if possible, else default.
pygrametl.getstrippedstr(value, default=None)

Convert given value to a string and use .strip() on the result.

If the conversion fails, the given default value is returned.

pygrametl.getstrornullvalue(value, nullvalue='None')

Convert a given value different from None to a string.

If the given value is None, nullvalue (default: ‘None’) is returned.

pygrametl.gettimestamp(targetconnection, ymdhmsstr, default=None)

Converts a string of the form ‘yyyy-MM-dd HH:mm:ss’ to a Timestamp.

The returned Timestamp is in the given targetconnection’s format.

Arguments:

  • targetconnection: a ConnectionWrapper whose underlying module’s Timestamp format is used

  • ymdhmsstr: the string to convert

  • default: The value to return if the conversion fails

pygrametl.getvalue(row, name, mapping={})

If name in mapping, return row[mapping[name]], else return row[name].

pygrametl.getvalueor(row, name, mapping={}, default=None)

Return the value of name from row using a mapping and a default value.

pygrametl.keepasis(s)
pygrametl.now(ignoredtargetconn=None, ignoredrow=None, ignorednamemapping=None)

Return the time of the first call this method as a datetime.datetime.

pygrametl.project(atts, row, renaming={})

Create a new dictionary with a subset of the attributes.

Arguments:

  • atts is a sequence of attributes in row that should be copied to the new result row.

  • row is the original dictionary to copy data from.

  • renaming is a mapping of names such that for each k in atts, the following holds:

    • If k in renaming then result[k] = row[renaming[k]].

    • If k not in renaming then result[k] = row[k].

    • renaming defaults to {}

pygrametl.rename(row, renaming)

Rename keys in a dictionary.

For each (oldname, newname) in renaming.items(): rename row[oldname] to row[newname].

pygrametl.renamefromto(row, renaming)

Rename keys in a dictionary.

For each (oldname, newname) in renaming.items(): rename row[oldname] to row[newname].

pygrametl.renametofrom(row, renaming)

Rename keys in a dictionary.

For each (newname, oldname) in renaming.items(): rename row[oldname] to row[newname].

pygrametl.rowfactory(source, names, close=True)

Generate dicts with key values from names and data values from source.

The given source should provide A) fetchmany returning the next set of rows, B) next() or fetchone() returning a tuple, or C) fetchall() returning a sequence of tuples. For each tuple, a dict is constructed such that the i’th element in names maps to the i’th value in the tuple.

If close=True (the default), close will be called on source after fetching all tuples.

pygrametl.setdefaults(row, attributes, defaults=None)

Set default values for attributes not present in a dictionary.

Default values are set for “missing” values, existing values are not updated.

Arguments:

  • row is the dictionary to set default values in

  • attributes is either
    1. a sequence of attribute names in which case defaults must be an equally long sequence of these attributes default values or

    2. a sequence of pairs of the form (attribute, defaultvalue) in which case the defaults argument should be None

  • defaults is a sequence of default values (see above)

pygrametl.today(ignoredtargetconn=None, ignoredrow=None, ignorednamemapping=None)

Return the date of the first call this method as a datetime.date object.

pygrametl.tolower(s)
pygrametl.toupper(s)
pygrametl.ymdhmsparser(ymdhmsstr)

Convert an input with a string representation of the form ‘yyyy-MM-dd HH:mm:ss’ or a datetime.datetime to a datetime.datetime.

If the input is None, the return value is also None. If the input is a datetime.datetime, it is returned. Else the input is cast to str and quotes are stripped before it is split into the different parts needed to create a datetime.datetime.

pygrametl.ymdparser(ymdstr)

Convert an input with a string representation of the form ‘yyyy-MM-dd’ or a datetime.date or a datetime.datetime to a datetime.date.

If the input is None, the return value is also None. If the input is a datetime.date, it is returned. If the input is a datetime.datetime, its date is returned. Else the input is cast to str and quotes are stripped before it is split into the different parts needed to create a datetime.date.