tables¶
This module contains classes for looking up rows, inserting rows and updating rows in dimensions and fact tables. Rows are represented as dictionaries mapping between attribute names and attribute values.
Note that named arguments should be used when instantiating classes. This improves readability and guards against errors in case of future API changes.
Many of the methods take an optional ‘namemapping’ argument which is explained here, but not repeated in the documentation for the individual methods: Consider a method m which is given a row r and a namemapping n. Assume that the method m uses the attribute a in r (i.e., r[a]). If the attribute a is not in the namemapping, m will just use r[a] as expected. But if the attribute a is in the namemapping, the name a is mapped to another name and the other name is used. That means that m then uses r[n[a]]. This is practical if attribute names in the considered rows and DW tables differ. If, for example, data is inserted into an order dimension in the DW that has the attribute order_date, but the source data uses the attribte name date, we can use a name mapping from order_date to date: dim.insert(row=…, namemapping={‘order_date’:’date’})
- class pygrametl.tables.AccumulatingSnapshotFactTable(name, keyrefs, otherrefs, measures=(), ignorenonerefs=True, ignorenonemeasures=True, factexpander=None, targetconnection=None)¶
Bases:
FactTable
A class for accessing and updating an accumulating fact table in the DW. Facts in an accumulating fact table can be updated. The class does no caching and all lookups and updates are sent to the DBMS (and an index on the keyrefs should thus be considered).
Arguments:
name: the name of the fact table in the DW
keyrefs: a sequence of attribute names that constitute the primary key of the fact tables. This is a subset of the dimension references and these references are not allowed to be updated.
otherrefs: a sequence of dimension references that can be updated.
measures: a possibly empty sequence of measure names. Default: ()
ignorenonerefs: A flag deciding if None values for attributes in otherrefs are ignored when doing an update. If True (default), the existing value in the database will not be overwritten by a None.
ignorenonemeasures: A flag deciding if None values for attributes in measures are ignored when doing an update. If True (default), the existing value in the database will not be overwritten by a None.
factexpander: A function(row, namemapping, set of names of updated attributes). This function is called by the ensure method before it calls the update method if a row has been changed. This is, e.g., practical if lag measures should be computed before the row in the fact table is updated. The function should make its changes directly on the passed row.
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- ensure(row, namemapping={})¶
Lookup the given row. If that fails, insert it. If found, see if values for attributes in otherrefs or measures have changed and update the found row if necessary (note that values for attributes in keyrefs are not allowed to change). If an update is necessary and a factexpander is defined, the row will first be updated with any missing otherrefs/measures and the factexpander will be run on it. Return nothing.
Arguments:
row: the row to insert or update if needed. Must contain the keyrefs attributes. For missing attributes from otherrefs and measures, the value is set to None if the row has to be inserted.
namemapping: an optional namemapping (see module’s documentation)
- update(row, namemapping={})¶
- class pygrametl.tables.BasePartitioner(parts)¶
Bases:
object
A base class for partitioning between several parts.
See also DimensionPartitioner and FactTablePartitioner.
- addpart(part)¶
Add a part
- droppart(part=None)¶
Drop a part. If an argument is given, it must be a part of the patitioner and it will then be removed. If no argument is given, the first part is removed.
- endload()¶
Call endload on all parts
- getpart(row, namemapping={})¶
Find the part that should handle the given row. The provided implementation in BasePartitioner does only use round robin partitioning, but subclasses apply other methods
- parts()¶
Return the parts the partitioner works on
- class pygrametl.tables.BatchFactTable(name, keyrefs, measures=(), batchsize=10000, usemultirow=False, targetconnection=None)¶
Bases:
FactTable
A class for accessing a fact table in the DW. This class performs performs insertions in batches.
Arguments:
name: the name of the fact table in the DW
keyrefs: a sequence of attribute names that constitute the primary key of the fact tables (i.e., the dimension references)
measures: a possibly empty sequence of measure names. Default: ()
batchsize: an int deciding how many insert operations should be done in one batch. Default: 10000
usemultirow: load batches with an INSERT INTO name VALUES statement instead of executemany(). WARNING: single quotes are automatically escaped. Other forms of sanitization must be manually performed.
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- property awaitingrows¶
Return the amount of rows awaiting to be loaded into the table
- endload()¶
Finalize the load.
- class pygrametl.tables.BulkDimension(name, key, attributes, bulkloader, lookupatts=(), idfinder=None, defaultidvalue=None, rowexpander=None, cachefullrows=False, fieldsep='\\t', rowsep='\\n', nullsubst=None, tempdest=None, bulksize=500000, usefilename=False, strconverter=pygrametl.getdbfriendlystr, encoding=None, dependson=(), targetconnection=None)¶
Bases:
_BaseBulkloadable
,CachedDimension
A class for accessing a dimension table. Does caching and bulk loading.
Unlike CachedBulkDimension, this class always caches all dimension data.
The class caches all dimension members in memory. Newly inserted dimension members are also put into the cache. The class does not INSERT new dimension members into the underlying database table immediately when insert or ensure is invoked. Instead, the class does bulk loading of new members. When a certain amount of new dimension members have been inserted (configurable through __init__’s bulksize argument), a user-provided bulkloader method is called.
Calls of lookup and ensure will only use the cache and does not invoke any database operations. It is also possible to use the update and getbyvals methods, but calls of these will invoke the bulkloader first (and performance can degrade). If the dimension table’s full rows are cached (by setting __init__’s cachefullrow argument to True), a call of getbykey will only use the cache, but if cachefullrows==False (which is the default), the bulkloader is again invoked first.
We assume that the DB doesn’t change or add any attribute values that are cached. For example, a DEFAULT value in the DB or automatic type coercion can break this assumption.
Arguments:
name: the name of the dimension table in the DW
key: the name of the primary key in the DW
attributes: a sequence of the attribute names in the dimension table. Should not include the name of the primary key which is given in the key argument.
bulkloader: A method m(name, attributes, fieldsep, rowsep, nullsubst, tempdest) that is called to load data from a temporary file into the DW. The argument “attributes” is a list of the names of the columns to insert values into and show the order in which the attribute values appear in the temporary file. The rest of the arguments are similar to those arguments with identical names that are described below. The argument “tempdest” can, however, be 1) a string with a filename or 2) a file object. This is determined by the usefilename argument (see below).
lookupatts: A subset of the attributes that uniquely identify a dimension members. These attributes are thus used for looking up members. If not given, it is assumed that lookupatts = attributes
idfinder: A function(row, namemapping) -> key value that assigns a value to the primary key attribute based on the content of the row and namemapping. If not given, it is assumed that the primary key is an integer, and the assigned key value is then the current maximum plus one.
defaultidvalue: An optional value to return when a lookup fails. This should thus be the ID for a preloaded “Unknown” member.
rowexpander: A function(row, namemapping) -> row. This function is called by ensure before insertion if a lookup of the row fails. This is practical if expensive calculations only have to be done for rows that are not already present. For example, for a date dimension where the full date is used for looking up rows, a rowexpander can be set such that week day, week number, season, year, etc. are only calculated for dates that are not already represented. If not given, no automatic expansion of rows is done.
cachefullrows: a flag deciding if full rows should be cached. If not, the cache only holds a mapping from lookupattributes to key values. Default: False.
fieldsep: a string used to separate fields in the temporary file. Default: ‘\t’
rowsep: a string used to separate rows in the temporary file. Default: ‘\n’
nullsubst: an optional string used to replace None values. If nullsubst=None, no substitution takes place. Default: None
tempdest: a file object or None. If None a named temporary file is used. Default: None
bulksize: an int deciding the number of rows to load in one bulk operation. Default: 500000
usefilename: a value deciding if the file should be passed to the bulkloader by its name instead of as a file-like object. This is, e.g., necessary when the bulk loading is invoked through SQL (instead of directly via a method on the PEP249 driver). It is also necessary if the bulkloader runs in another process. Default: False
strconverter: a method m(value, nullsubst) -> str to convert values into strings that can be written to the temporary file and eventually bulkloaded. Default: pygrametl.getdbfriendlystr
encoding: a string with the encoding to use. If None, locale.getpreferredencoding() is used. This argument is ignored under Python 2! Default: None
dependson: a sequence of other bulkloadble tables that should be loaded before this instance does bulkloading. Default: ()
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- getbykey(keyvalue)¶
Lookup and return the row with the given key value.
If no row is found in the dimension table, the function returns a row where all values (including the key) are None.
- insert(row, namemapping={})¶
Insert the given row. Return the new key value.
Arguments:
row: the row to insert. The dict is not updated. It must contain all attributes, and is allowed to contain more attributes than that. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.BulkFactTable(name, keyrefs, measures, bulkloader, fieldsep='\\t', rowsep='\\n', nullsubst=None, tempdest=None, bulksize=500000, usefilename=False, strconverter=pygrametl.getdbfriendlystr, encoding=None, dependson=())¶
Bases:
_BaseBulkloadable
Class for addition of facts to a fact table. Reads are not supported.
Arguments:
name: the name of the fact table in the DW
keyrefs: a sequence of attribute names that constitute the primary key of the fact tables (i.e., the dimension references)
measures: a possibly empty sequence of measure names.
bulkloader: A method m(name, attributes, fieldsep, rowsep, nullsubst, tempdest) that is called to load data from a temporary file into the DW. The argument “attributes” is the combination of keyrefs and measures (i.e., a list of the names of the columns to insert values into) and show the order in which the attribute values appear in the temporary file. The rest of the arguments are similar to those arguments with identical names that are given to BulkFactTable.__init__ as described here. The argument “tempdest” can, however, be 1) a string with a filename or 2) a file object. This is determined by the usefilename argument to BulkFactTable.__init__ (see below).
fieldsep: a string used to separate fields in the temporary file. Default: ‘\t’
rowsep: a string used to separate rows in the temporary file. Default: ‘\n’
nullsubst: an optional string used to replace None values. If nullsubst=None, no substitution takes place. Default: None
tempdest: a file object or None. If None a named temporary file is used. Default: None
bulksize: an int deciding the number of rows to load in one bulk operation. Default: 500000
usefilename: a value deciding if the file should be passed to the bulkloader by its name instead of as a file-like object. This is, e.g., necessary when the bulk loading is invoked through SQL (instead of directly via a method on the PEP249 driver). It is also necessary if the bulkloader runs in another process (for example, when if the BulkFactTable is wrapped by a DecoupledFactTable and invokes the bulkloader on a shared connection wrapper). Default: False
strconverter: a method m(value, nullsubst) -> str to convert values into strings that can be written to the temporary file and eventually bulkloaded. Default: pygrametl.getdbfriendlystr
encoding: a string with the encoding to use. If None, locale.getpreferredencoding() is used. This argument is ignored under Python 2! Default: None
dependson: a sequence of other bulkloadble tables that should be bulkloaded before this instance does bulkloading (e.g., if the fact table has foreign keys to some bulk-loaded dimension table). Default: ()
- class pygrametl.tables.CachedBulkDimension(name, key, attributes, bulkloader, lookupatts=(), idfinder=None, defaultidvalue=None, rowexpander=None, usefetchfirst=False, cachefullrows=False, fieldsep='\\t', rowsep='\\n', nullsubst=None, tempdest=None, bulksize=5000, cachesize=10000, usefilename=False, strconverter=pygrametl.getdbfriendlystr, encoding=None, dependson=(), targetconnection=None)¶
Bases:
_BaseBulkloadable
,CachedDimension
A class for accessing a dimension table. Does caching and bulk loading.
Unlike BulkDimension, the cache size is configurable and lookups may thus lead to database operations.
The class caches dimension members in memory. Newly inserted dimension members are also put into the cache. The class does not INSERT new dimension members into the underlying database table immediately when insert or ensure is invoked. Instead, the class does bulk loading of new members. When a certain amount of new dimension members have been inserted (configurable through __init__’s bulksize argument), a user-provided bulkloader method is called.
It is also possible to use the update and getbyvals methods, but calls of these will invoke the bulkloader first (and performance can degrade). If the dimension table’s full rows are cached (by setting __init__’s cachefullrow argument to True), a call of getbykey will only use the cache, but if cachefullrows==False (which is the default), the bulkloader is again invoked first.
We assume that the DB doesn’t change or add any attribute values that are cached. For example, a DEFAULT value in the DB or automatic type coercion can break this assumption.
Arguments:
name: the name of the dimension table in the DW
key: the name of the primary key in the DW
attributes: a sequence of the attribute names in the dimension table. Should not include the name of the primary key which is given in the key argument.
bulkloader: A method m(name, attributes, fieldsep, rowsep, nullsubst, tempdest) that is called to load data from a temporary file into the DW. The argument “attributes” is a list of the names of the columns to insert values into and show the order in which the attribute values appear in the temporary file. The rest of the arguments are similar to those arguments with identical names that are described below. The argument “tempdest” can, however, be 1) a string with a filename or 2) a file object. This is determined by the usefilename argument (see below).
lookupatts: A subset of the attributes that uniquely identify a dimension members. These attributes are thus used for looking up members. If not given, it is assumed that lookupatts = attributes
idfinder: A function(row, namemapping) -> key value that assigns a value to the primary key attribute based on the content of the row and namemapping. If not given, it is assumed that the primary key is an integer, and the assigned key value is then the current maximum plus one.
defaultidvalue: An optional value to return when a lookup fails. This should thus be the ID for a preloaded “Unknown” member.
rowexpander: A function(row, namemapping) -> row. This function is called by ensure before insertion if a lookup of the row fails. This is practical if expensive calculations only have to be done for rows that are not already present. For example, for a date dimension where the full date is used for looking up rows, a rowexpander can be set such that week day, week number, season, year, etc. are only calculated for dates that are not already represented. If not given, no automatic expansion of rows is done.
usefetchfirst: a flag deciding if the SQL:2008 FETCH FIRST clause is used when prefil is True. Depending on the used DBMS and DB driver, this can give significant savings wrt. to time and memory. Not all DBMSs support this clause yet. Default: False
cachefullrows: a flag deciding if full rows should be cached. If not, the cache only holds a mapping from lookupattributes to key values. Default: False.
fieldsep: a string used to separate fields in the temporary file. Default: ‘\t’
rowsep: a string used to separate rows in the temporary file. Default: ‘\n’
nullsubst: an optional string used to replace None values. If nullsubst=None, no substitution takes place. Default: None
tempdest: a file object or None. If None a named temporary file is used. Default: None
bulksize: an int deciding the number of rows to load in one bulk operation. Default: 5000
cachesize: the maximum number of rows to cache. If less than or equal to 0, unlimited caching is used. Default: 10000
usefilename: a value deciding if the file should be passed to the bulkloader by its name instead of as a file-like object. This is, e.g., necessary when the bulk loading is invoked through SQL (instead of directly via a method on the PEP249 driver). It is also necessary if the bulkloader runs in another process. Default: False
strconverter: a method m(value, nullsubst) -> str to convert values into strings that can be written to the temporary file and eventually bulkloaded. Default: pygrametl.getdbfriendlystr
encoding: a string with the encoding to use. If None, locale.getpreferredencoding() is used. This argument is ignored under Python 2! Default: None
dependson: a sequence of other bulkloadble tables that should be loaded before this instance does bulkloading. Default: ()
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- getbykey(keyvalue)¶
Lookup and return the row with the given key value.
If no row is found in the dimension table, the function returns a row where all values (including the key) are None.
- insert(row, namemapping={})¶
Insert the given row. Return the new key value.
Arguments:
row: the row to insert. The dict is not updated. It must contain all attributes, and is allowed to contain more attributes than that. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- lookup(row, namemapping={})¶
Find the key for the row with the given values.
Arguments:
row: a dict which must contain at least the lookup attributes
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.CachedDimension(name, key, attributes, lookupatts=(), idfinder=None, defaultidvalue=None, rowexpander=None, size=10000, prefill=False, cachefullrows=False, cacheoninsert=True, usefetchfirst=False, targetconnection=None)¶
Bases:
Dimension
A class for accessing a dimension. Does caching.
We assume that the DB doesn’t change or add any attribute values that are cached. For example, a DEFAULT value in the DB or automatic type coercion can break this assumption.
Arguments:
name: the name of the dimension table in the DW
key: the name of the primary key in the DW
attributes: a sequence of the attribute names in the dimension table. Should not include the name of the primary key which is given in the key argument.
lookupatts: A subset of the attributes that uniquely identify a dimension members. These attributes are thus used for looking up members. If not given, it is assumed that lookupatts = attributes
idfinder: A function(row, namemapping) -> key value that assigns a value to the primary key attribute based on the content of the row and namemapping. If not given, it is assumed that the primary key is an integer, and the assigned key value is then the current maximum plus one.
defaultidvalue: An optional value to return when a lookup fails. This should thus be the ID for a preloaded “Unknown” member.
rowexpander: A function(row, namemapping) -> row. This function is called by ensure before insertion if a lookup of the row fails. This is practical if expensive calculations only have to be done for rows that are not already present. For example, for a date dimension where the full date is used for looking up rows, a rowexpander can be set such that week day, week number, season, year, etc. are only calculated for dates that are not already represented. If not given, no automatic expansion of rows is done.
size: the maximum number of rows to cache. If less than or equal to 0, unlimited caching is used. Default: 10000
prefill: a flag deciding if the cache should be filled when initialized. Default: False
cachefullrows: a flag deciding if full rows should be cached. If not, the cache only holds a mapping from lookupattributes to key values. Default: False.
cacheoninsert: a flag deciding if the cache should be updated when insertions are done. Default: True
usefetchfirst: a flag deciding if the SQL:2008 FETCH FIRST clause is used when prefil is True. Depending on the used DBMS and DB driver, this can give significant savings wrt. to time and memory. Not all DBMSs support this clause yet. Default: False
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- lookup(row, namemapping={})¶
Find the key for the row with the given values.
Arguments:
row: a dict which must contain at least the lookup attributes
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.DecoupledDimension(dim, returnvalues=True, consumes=(), attstoconsume=(), batchsize=500, queuesize=200)¶
Bases:
Decoupled
A Dimension-like class that enables parallelism by executing all operations on a given Dimension in a separate, dedicated process (that Dimension is said to be “decoupled”).
Arguments:
dim: the Dimension object to use in a separate process
returnvalues: decides if return values from method calls on dim should be kept such that they can be fetched by the caller or another Decoupled instance
consumes: a sequence of Decoupled objects from which to fetch returnvalues (that are used to replace FutureResults in arguments). Default: ()
attstoconsume: a sequence of the attribute names in rows that should have FutureResults replaced by actual return values. Does not have to be given, but may improve performance when given. Default: ()
batchsize: the size of batches (grouped method calls) transferred between the processes. NB: Large values do not necessarily give good performance Default: 500
queuesize: the maximum amount of waiting batches. Infinite if less than or equal to 0. NB: Large values do not necessarily give good performance. Default: 200
- endload()¶
Invoke endload on the decoupled Dimension in the separate process and return when all waiting method calls have been executed
- ensure(row, namemapping={})¶
Invoke ensure on the decoupled Dimension in the separate process
- getbykey(keyvalue)¶
Invoke getbykey on the decoupled Dimension in the separate process
- getbyvals(row, namemapping={})¶
Invoke betbycals on the decoupled Dimension in the separate process
- insert(row, namemapping={})¶
Invoke insert on the decoupled Dimension in the separate process
- lookup(row, namemapping={})¶
Invoke lookup on the decoupled Dimension in the separate process
- scdensure(row, namemapping={})¶
Invoke scdensure on the decoupled Dimension in the separate process
- class pygrametl.tables.DecoupledFactTable(facttbl, returnvalues=True, consumes=(), attstoconsume=(), batchsize=500, queuesize=200)¶
Bases:
Decoupled
A FactTable-like class that enables parallelism by executing all operations on a given FactTable in a separate, dedicated process (that FactTable is said to be “decoupled”).
Arguments:
facttbl: the FactTable object to use in a separate process
returnvalues: decides if return values from method calls on facttbl should be kept such that they can be fetched by the caller or another Decoupled instance
consumes: a sequence of Decoupled objects from which to fetch returnvalues (that are used to replace FutureResults in arguments). Default: ()
attstoconsume: a sequence of the attribute names in rows that should have FutureResults replaced by actual return values. Does not have to be given, but may improve performance when given. Default: ()
batchsize: the size of batches (grouped method calls) transferred between the processes. NB: Large values do not necessarily give good performance Default: 500
queuesize: the maximum amount of waiting batches. Infinite if less than or equal to 0. NB: Large values do not necessarily give good performance. Default: 200
- endload()¶
Invoke endload on the decoupled FactTable in the separate process and return when all waiting method calls have been executed
- ensure(row, namemapping={})¶
Invoke ensure on the decoupled FactTable in the separate process
- insert(row, namemapping={})¶
Invoke insert on the decoupled FactTable in the separate process
- lookup(row, namemapping={})¶
Invoke lookup on the decoupled FactTable in the separate process
- class pygrametl.tables.Dimension(name, key, attributes, lookupatts=(), idfinder=None, defaultidvalue=None, rowexpander=None, targetconnection=None)¶
Bases:
object
A class for accessing a dimension. Does no caching.
Arguments:
name: the name of the dimension table in the DW
key: the name of the primary key in the DW
attributes: a sequence of the attribute names in the dimension table. Should not include the name of the primary key which is given in the key argument.
lookupatts: A subset of the attributes that uniquely identify a dimension members. These attributes are thus used for looking up members. If not given, it is assumed that lookupatts = attributes
idfinder: A function(row, namemapping) -> key value that assigns a value to the primary key attribute based on the content of the row and namemapping. If not given, it is assumed that the primary key is an integer, and the assigned key value is then the current maximum plus one.
defaultidvalue: An optional value to return when a lookup fails. This should thus be the ID for a preloaded “Unknown” member.
rowexpander: A function(row, namemapping) -> row. This function is called by ensure before insertion if a lookup of the row fails. This is practical if expensive calculations only have to be done for rows that are not already present. For example, for a date dimension where the full date is used for looking up rows, a rowexpander can be set such that week day, week number, season, year, etc. are only calculated for dates that are not already represented. If not given, no automatic expansion of rows is done.
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- endload()¶
Finalize the load.
- ensure(row, namemapping={})¶
Lookup the given row. If that fails, insert it. Return the key value.
If the lookup fails and a rowexpander was set when creating the instance, this rowexpander is called before the insert takes place.
Arguments:
row: the row to lookup or insert. Must contain the lookup attributes. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- getbykey(keyvalue)¶
Lookup and return the row with the given key value.
If no row is found in the dimension table, the function returns a row where all values (including the key) are None.
- getbyvals(values, namemapping={})¶
Return a list of all rows with values identical to the given.
Arguments:
values: a dict which must hold a subset of the tables attributes. All rows that have identical values for all attributes in this dict are returned.
namemapping: an optional namemapping (see module’s documentation)
- insert(row, namemapping={})¶
Insert the given row. Return the new key value.
Arguments:
row: the row to insert. The dict is not updated. It must contain all attributes, and is allowed to contain more attributes than that. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- lookup(row, namemapping={})¶
Find the key for the row with the given values.
Arguments:
row: a dict which must contain at least the lookup attributes
namemapping: an optional namemapping (see module’s documentation)
- update(row, namemapping={})¶
Update a single row in the dimension table.
Arguments:
row: a dict which must contain the key for the dimension. The row with this key value is updated such that it takes the value of row[att] for each attribute att which is also in row.
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.DimensionPartitioner(parts, getbyvalsfromall=False, partitioner=None)¶
Bases:
BasePartitioner
A Dimension-like class that handles partitioning.
Partitioning is done between a number of Dimension objects called the parts. The class offers the interface of Dimensions (incl. scdensure from SlowlyChangingDimension). When a method is called, the corresponding method on one of the parts (chosen by a user-definable partitioner function) will be invoked. The parts can operate on a single physical dimension table or different physical tables.
Arguments:
parts: a sequence of Dimension objects.
getbyvalsfromall: determines if getbyvals should be answered by means of all parts (when getbyvalsfromall = True) or only the first part, i.e., parts[0] (when getbybalsfromall = False). Default: False
partitioner: None or a callable p(dict) -> int where the argument is a dict mapping from the names of the lookupatts to the values of the lookupatts. The resulting int is used to determine which part a given row should be handled by. When partitioner is None, a default partitioner is used. This partitioner computes the hash value of each value of the lookupatts and adds them together.
- ensure(row, namemapping={})¶
Invoke ensure on the relevant Dimension part
- getbykey(keyvalue)¶
Invoke getbykey on the relevant Dimension part
- getbyvals(values, namemapping={})¶
Invoke getbyvals on the first part or all parts (depending on the value of the instance’s getbyvalsfromall)
- getpart(row, namemapping={})¶
Return the part that should handle the given row
- insert(row, namemapping={})¶
Invoke insert on the relevant Dimension part
- lookup(row, namemapping={})¶
Invoke lookup on the relevant Dimension part
- scdensure(row, namemapping={})¶
Invoke scdensure on the relevant Dimension part
- update(row, namemapping={})¶
Invoke update on the relevant Dimension part
- class pygrametl.tables.FactTable(name, keyrefs, measures=(), targetconnection=None)¶
Bases:
object
A class for accessing a fact table in the DW.
Arguments:
name: the name of the fact table in the DW
keyrefs: a sequence of attribute names that constitute the primary key of the fact tables (i.e., the dimension references)
measures: a possibly empty sequence of measure names. Default: ()
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- endload()¶
Finalize the load.
- ensure(row, compare=False, namemapping={})¶
Ensure that a fact is present (insert it if it is not already there).
Return True if a fact with identical values for keyrefs attributes was already present in the fact table; False if not.
Arguments:
row: a dict at least containing the attributes of the fact table
compare: a flag deciding if measure vales from a fact that was looked up are compared to those in the given row. If True and differences are found, a ValueError is raised. Default: False
namemapping: an optional namemapping (see module’s documentation)
- insert(row, namemapping={})¶
Insert a fact into the fact table.
Arguments:
row: a dict at least containing values for all the fact table’s attributes (both keys/references and measures).
namemapping: an optional namemapping (see module’s documentation)
- lookup(keyvalues, namemapping={})¶
Lookup a fact from the given key values. Return key and measure vals.
Return None if no fact is found.
Arguments:
keyvalues: a dict at least containing values for all keys
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.FactTablePartitioner(parts, partitioner=None)¶
Bases:
BasePartitioner
A FactTable-like class that handles partitioning.
Partitioning is done between a number of FactTable objects called the parts. The class offers the interface of FactTable. When a method is called, the corresponding method on one of the parts (chosen by a user-definable partitioner function) will be invoked. The parts can operate on a single physical fact table or different physical tables.
Arguments:
parts: a sequence of FactTable objects.
partitioner: None or a callable p(dict) -> int where the argument is a dict mapping from the names of the keyrefs to the values of the keyrefs. The resulting int is used to determine which part a given row should be handled by. When partitioner is None, a default partitioner is used. This partitioner computes the sum of all the keyrefs values.
- ensure(row, namemapping={})¶
Invoke ensure on the relevant part
- getpart(row, namemapping={})¶
Return the relevant part for the given row
- insert(row, namemapping={})¶
Invoke insert on the relevant part
- lookup(row, namemapping={})¶
Invoke lookup on the relevant part
- class pygrametl.tables.SlowlyChangingDimension(name, key, attributes, lookupatts, orderingatt=None, versionatt=None, fromatt=None, fromfinder=None, toatt=None, tofinder=None, minfrom=None, maxto=None, srcdateatt=None, srcdateparser=pygrametl.ymdparser, type1atts=(), cachesize=10000, prefill=False, idfinder=None, usefetchfirst=False, useorderby=True, targetconnection=None)¶
Bases:
Dimension
A class for accessing a slowly changing dimension of “type 2”.
“Type 1” updates can also be applied for a subset of the attributes.
Caching is used. We assume that the DB doesn’t change or add any attribute values that are cached. For example, a DEFAULT value in the DB or automatic type coercion can break this assumption.
Arguments:
name: the name of the dimension table in the DW
key: the name of the primary key in the DW
attributes: a sequence of the attribute names in the dimension table. Should not include the name of the primary key which is given in the key argument, but should include versionatt, fromatt, and toatt.
lookupatts: a sequence with a subset of the attributes that uniquely identify a dimension members. These attributes are thus used for looking up members.
orderingatt: the name of the attribute used to identify the newest version. The version holding the greatest value is considered to be the newest. If orderingatt is None, versionatt is used. If versionatt is also None, toatt is used and NULL is considered as the greatest value. If toatt is also None, fromatt is used and NULL is considered as the smallest values. If orderingatt, versionatt, toatt, and fromatt are all None, an error is raised.
versionatt: the name of the attribute holding the version number
fromatt: the name of the attribute telling from when the version becomes valid. Not used if None. Default: None
fromfinder: a function(targetconnection, row, namemapping) returning a value for the fromatt for a new version (the function is first used when it is determined that a new version must be added; it is not applied to determine this). If fromfinder is None and srcdateatt is also None, pygrametl.today is used as fromfinder. If fromfinder is None and srcdateatt is not None, pygrametl.datereader(srcdateatt, srcdateparser) is used. In other words, if no date attribute and no special date function are given, new versions get the date of the current day. If a date attribute is given (but no date function), the date attribute’s value is converted (by means of srcdateparser) and a new version gets the result of this as the date it is valid from. Default: None
toatt: the name of the attribute telling until when the version is valid. Not used if None. Default: None
tofinder: a function(targetconnection, row, namemapping) returning a value for the toatt. If not set, fromfinder is used (note that if fromfinder is None, it is set to a default function – see the comments about fromfinder. The possibly modified value is used here.) Default: None
minfrom: the value to use for fromatt for the 1st version of a member if fromatt is not already set. If None, the value is found in the same way as for other new versions, i.e., as described for fromfinder. If fromatt should take the value NULL for the 1st version, set minfrom to a tuple holding a single element which is None: (None,). Note that minto affects the 1st version, not any following versions. Note also that if the member to insert already contains a value for fromatt, minfrom is ignored. Default: None.
maxto: the value to use for toatt for new members. Default: None
srcdateatt: the name of the attribute in the source data that holds a date showing when a version is valid from. The data is converted to a datetime by applying srcdateparser on it. If not None, the date attribute is also used when comparing a potential new version to the newest version in the DB. If None, the date fields are not compared. Default: None
srcdateparser: a function that takes one argument (a date in the format scrdateatt has) and returns a datetime.datetime. If srcdateatt is None, srcdateparser is not used. Default: pygrametl.ymdparser (i.e., the default value is a function that parses a string of the form ‘yyyy-MM-dd’)
type1atts: a sequence of attributes that should have type1 updates applied. Default: ()
cachesize: the maximum size of the cache. 0 disables caching and values smaller than 0 allows unlimited caching
prefill: decides if the cache should be prefilled with the newest versions. Default: False.
idfinder: a function(row, namemapping) -> key value that assigns a value to the primary key attribute based on the content of the row and namemapping. If not given, it is assumed that the primary key is an integer, and the assigned key value is then the current maximum plus one.
usefetchfirst: a flag deciding if the SQL:2008 FETCH FIRST clause is used when prefil is True. Depending on the used DBMS and DB driver, this can give significant savings wrt. to time and memory. Not all DBMSs support this clause yet. Default: False
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
useorderby: a flag deciding if ORDER BY is used in the SQL to select the newest version. If True, the DBMS thus does the sorting. If False, all versions are fetched and the highest version is found in Python. For some systems, this can lead to significant performance improvements. Default: True
- closecurrent(row, namemapping={}, end=datetime.date(2023, 9, 7))¶
Close the current version by setting its toatt if it is maxto.
The newest version will have its toatt set to the given end argument (default pygrametl.today(), i.e., the current date) only if the value for the newest row’s toatt currently is maxto. Otherwise no update will be done. If toatt is not defined an exception is raised.
Arguments:
row: a dict which must contain at least the lookup attributes
namemapping: an optional namemapping (see module’s documentation)
end: the value to set for the newest version. Default: The current date as given by pygrametl.today()
- lookup(row, namemapping={})¶
Find the key for the newest version with the given values.
Arguments:
row: a dict which must contain at least the lookup attributes
namemapping: an optional namemapping (see module’s documentation)
- lookupasof(row, when, inclusive, namemapping={})¶
Find the key of the version that was valid at a given time.
If both fromatt and toatt have been set, the method returns the key of the version where the given time is between them. If only toatt has been defined, the method returns the key of the first version where toatt is after the given time. If only fromatt is defined, the method returns the key of the most recent version where fromatt is before the given time. See also the description of the argument inclusive. For this to be possible, fromatt and/or toatt must have been set. If this is not the case, a RuntimeError is raised. If no valid version is found for the given time, None is returned.
Note that this function cannot exploit the cache and always results in the execution of a SQL query.
Arguments:
row: a dict which must contain at least the lookup attributes.
when: the time when the version should be valid. This argument must be of a type that can be compared (using <, <=, ==, =>, >) to values in fromatt and/or toatt.
inclusive: decides if the values of fromatt and/or toatt are allowed to be equal to the value of when in the version to find. If only one of fromatt and toatt has been set, the argument should be a single Boolean. If both fromatt and toatt have been set, the argument should be a tuple of two Booleans where the first element decides if the fromatt value can be equal to the the value of when and the second element decides the same for toatt. This tuple must not be (False, False).
namemapping: an optional namemapping (see module’s documentation)
- scdensure(row, namemapping={})¶
Lookup or insert a version of a slowly changing dimension member.
Note
Has side-effects on the given row.
Arguments:
row: a dict containing the attributes for the member. key, versionatt, fromatt, and toatt are not required to be present but will be added (if defined).
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.SnowflakedDimension(references, expectboguskeyvalues=False)¶
Bases:
object
A class for accessing a snowflaked dimension spanning several tables in the underlying database. Lookups and inserts are then automatically spread out over the relevant tables while the programmer only needs to interact with a single SnowflakedDimension instance.
Arguments:
references: a sequence of pairs of Dimension objects [(a1,a2), (b1,b2), …] meaning that a1 has a foreign key to a2 etc. a2 may itself be a sequence of Dimensions: [(a1, [a21, a22, …]), (b1, [b21, b22, …]), …].
The first element of the first pair (a1 in the example above) must be the dimension table representing the lowest level in the hierarchy (i.e., the dimension table the closest to the fact table).
Each dimension must be reachable in a unique way (i.e., the given dimensions form a tree).
A foreign key must have the same name as the primary key it references.
expectboguskeyvalues: If expectboguskeyvalues is True, we allow a key that is used as lookup attribute in a lower level to hold a wrong value (which would typically be None). When ensure or insert is called, we find the correct value for the key in the higher level. If expectboguskeyvalues, we again try a lookup on the lower level after this. If expectboguskeyvalues is False, we move directly on to do an insert. Default: False
- endload()¶
Finalize the load.
- ensure(row, namemapping={})¶
Lookup the given member. If that fails, insert it. Return key value.
If the member must be inserted, data is automatically inserted in all participating tables where (part of) the member is not already represented.
Key values for different levels may be added to the row. It is NOT guaranteed that key values for all levels exist in row afterwards.
Arguments:
row: the row to lookup or insert. Must contain the lookup attributes. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- getbykey(keyvalue, fullrow=False)¶
Lookup and return the row with the given key value.
If no row is found in the dimension table, the function returns a row where all values (including the key) are None.
Arguments:
keyvalue: the key value of the row to lookup
fullrow: a flag deciding if the full row (with data from all tables in the snowflake) should be returned. If False, only data from the lowest level in the hierarchy (i.e., the table the closest to the fact table) is returned. Default: False
- getbyvals(values, namemapping={}, fullrow=False)¶
Return a list of all rows with values identical to the given.
Arguments:
values: a dict which must hold a subset of the tables attributes. All rows that have identical values for all attributes in this dict are returned.
namemapping: an optional namemapping (see module’s documentation)
fullrow: a flag deciding if the full row (with data from all tables in the snowflake) should be returned. If False, only data from the lowest level in the hierarchy (i.e., the table the closest to the fact table) is returned. Default: False
- insert(row, namemapping={})¶
Insert the given member. If that fails, insert it. Return key value.
Data is automatically inserted in all participating tables where (part of) the member is not already represented. If nothing is inserted at all, a ValueError is raised.
Key values for different levels may be added to the row. It is NOT guaranteed that key values for all levels exist in row afterwards.
Arguments:
row: the row to lookup or insert. Must contain the lookup attributes. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- lookup(row, namemapping={})¶
Find the key for the row with the given values.
Arguments:
row: a dict which must contain at least the lookup attributes which all must come from the root (the table closest to the fact table).
namemapping: an optional namemapping (see module’s documentation)
- scdensure(row, namemapping={})¶
Lookup or insert a version of a slowly changing dimension member.
Warning
Still experimental!!! For now we require that only the root is a SlowlyChangingDimension.
Note
Has side-effects on the given row.
Arguments:
row: a dict containing the attributes for the member. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- update(row, namemapping={})¶
Update rows in the participating dimension tables.
If the key of a participating dimension D is in the given row, D.update(…) is invoked.
Note that this function is not good to use for updating a foreign key which here has the same name as the referenced primary key: The referenced table could then also get updated unless it is ensured that none of its attributes are present in the given row.
In other words, it is often better to use the update function directly on the Dimensions that should be updated.
Arguments:
row: a dict. If the key of a participating dimension D is in the dict, D.update(…) is invoked.
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.SubprocessFactTable(keyrefs, measures, executable, initcommand=None, endcommand=None, terminateafter=-1, fieldsep='\\t', rowsep='\\n', nullsubst=None, strconverter=pygrametl.getdbfriendlystr, buffersize=16384)¶
Bases:
object
Class for addition of facts to a subprocess.
The subprocess can, e.g., be a logger or bulkloader. Reads are not supported.
Note that a created instance can not be used when endload() has been called (and endload() is called from pygrametl.commit()).
Arguments:
keyrefs: a sequence of attribute names that constitute the primary key of the fact table (i.e., the dimension references)
measures: a possibly empty sequence of measure names. Default: ()
executable: The subprocess to start.
initcommand: If not None, this command is written to the subprocess before any data.
endcommand: If not None, this command is written to the subprocess after all data has been written.
terminateafter: If greater than or equal to 0, the subprocess is terminated after this amount of seconds after the pipe to the subprocess is closed.
fieldsep: a string used to separate fields in the output sent to the subprocess. Default: ‘\t
rowsep: a string used to separate rows in the output sent to the subprocess. Default: ‘\n’
nullsubst: an optional string used to replace None values. If nullsubst=None, no substitution takes place. Default: None
strconverter: a method m(value, nullsubst) -> str to convert values into strings that can be written to the subprocess. Default: pygrametl.getdbfriendlystr
- endload()¶
Finalize the load.
- insert(row, namemapping={})¶
Insert a fact into the fact table.
Arguments:
row: a dict at least containing values for the keys and measures.
namemapping: an optional namemapping (see module’s documentation)
- class pygrametl.tables.TypeOneSlowlyChangingDimension(name, key, attributes, lookupatts, type1atts=(), cachesize=10000, prefill=False, idfinder=None, usefetchfirst=False, cachefullrows=False, targetconnection=None)¶
Bases:
CachedDimension
A class for accessing a slowly changing dimension of “type 1”.
Caching is used. We assume that the DB doesn’t change or add any attribute values that are cached. For example, a DEFAULT value in the DB or automatic type coercion can break this assumption.
Arguments:
name: the name of the dimension table in the DW
key: the name of the primary key in the DW
attributes: a sequence of the attribute names in the dimension table. Should not include the name of the primary key which is given in the key argument.
lookupatts: A subset of the attributes that uniquely identify a dimension members. These attributes are thus used for looking up members.
type1atts: A sequence of attributes that should have type1 updates applied, it cannot intersect with lookupatts. If not given, it is assumed that type1atts = attributes - lookupatts
cachesize: the maximum number of rows to cache. If less than or equal to 0, unlimited caching is used. Default: 10000
prefill: a flag deciding if the cache should be filled when initialized. Default: False
idfinder: A function(row, namemapping) -> key value that assigns a value to the primary key attribute based on the content of the row and namemapping. If not given, it is assumed that the primary key is an integer, and the assigned key value is then the current maximum plus one.
usefetchfirst: a flag deciding if the SQL:2008 FETCH FIRST clause is used when prefil is True. Depending on the used DBMS and DB driver, this can give significant savings wrt. to time and memory. Not all DBMSs support this clause yet. Default: False
cachefullrows: a flag deciding if full rows should be cached. If not, the cache only holds a mapping from lookupattributes to key values, and from key to the type 1 slowly changing attributes. Default: False.
targetconnection: The ConnectionWrapper to use. If not given, the default target connection is used.
- scdensure(row, namemapping={})¶
Lookup or insert a version of a slowly changing dimension member.
Note
Has side-effects on the given row.
Arguments:
row: a dict containing the attributes for the table. It must contain all attributes if it is the first version of the row to be inserted, updates of existing rows need only contain lookupatts and a subset of type1atts as a missing type1atts is ignored and the existing value left as is in the database. Key is not required to be present but will be added using idfinder if missing.
namemapping: an optional namemapping (see module’s documentation)
- pygrametl.tables.definequote(quotechar)¶
Defines the global quote function, for wrapping identifiers with quotes.
Arguments:
quotechar: If None, do not wrap identifier. If a string, prepend and append quotechar to identifier. If a tuple of two strings, prepend with first element and append with last.