.. _dimensions: Dimensions ========== Multiple abstractions for representing dimensions in a data warehouse are provided by pygrametl. These abstractions allow for simple modeling of both star and snowflake schemas as well as type 1, type 2, and combined type 1 and type 2 slowly changing dimensions. The abstractions can be used for both sequential and parallel loading of data into dimensions. For more information about the parallel capabilities of pygrametl see :ref:`parallel`. The following examples use PostgreSQL as the RDBMS and psycopg2 as the database driver. All of following classes are currently implemented in the :mod:`.pygrametl.tables` module. Dimension --------- :class:`.Dimension` is the simplest abstraction pygrametl provides for interacting with dimensions in a data warehouse. It provides an interface for performing operations on the underlying table, such as insertions or looking up keys, while abstracting away the database connection and queries. Using :class:`.Dimension` is a two-step process. Firstly, an instance of :class:`.ConnectionWrapper` must be created with a :PEP:`249` database connection. The first :class:`.ConnectionWrapper` created is automatically set as the default and used by :class:`.Dimension` for interacting with the database. For more information about database connections see :ref:`database`. Secondly, an instance of :class:`.Dimension` must be created for each dimension in the data warehouse. To create an instance of :class:`.Dimension` the name of the table must be specified, along with the primary key of the table, as well as the non-key columns in the table. In addition to these required parameters, a subset of columns to be used for looking up keys can also be specified, as well as a function for computing the primary key, a default return value if a lookup fails, and a function for expanding a row automatically. .. code-block:: python import psycopg2 import pygrametl from pygrametl.tables import Dimension # Input is a list of rows which in pygrametl is modeled as dicts products = [ {'name': 'Calvin and Hobbes 1', 'category': 'Comic', 'price': '10'}, {'name': 'Calvin and Hobbes 2', 'category': 'Comic', 'price': '10'}, {'name': 'Calvin and Hobbes 3', 'category': 'Comic', 'price': '10'}, {'name': 'Cake and Me', 'category': 'Cookbook', 'price': '15'}, {'name': 'French Cooking', 'category': 'Cookbook', 'price': '50'}, {'name': 'Sushi', 'category': 'Cookbook', 'price': '30'}, {'name': 'Nineteen Eighty-Four', 'category': 'Novel', 'price': '15'}, {'name': 'The Lord of the Rings', 'category': 'Novel', 'price': '60'} ] # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(connection=pgconn) # The instance of Dimension connects to the table product in the # database using the default connection wrapper created above, the # argument lookupatts specifies the column which needs to match # when doing a lookup for the key from this dimension productDimension = Dimension( name='product', key='productid', attributes=['name', 'category', 'price'], lookupatts=['name']) # Filling a dimension is simply done by using the insert method for row in products: productDimension.insert(row) # Ensures that the data is committed and the connection is closed correctly conn.commit() conn.close() In the above example, a set of rows with product information is loaded into the product dimension, using an instance of :class:`.Dimension` created with information about the table in the database. The list of product information can then be inserted into the database one element at a time using the method :meth:`.Dimension.insert()`. Then the transaction must be committed and the connection closed to ensure that the data is written to the database. CachedDimension --------------- :class:`.CachedDimension` extends :class:`.Dimension` with a cache to reduce the latency of lookups by reducing the number of round trips to the database. To control what is cached, three additional parameters have been added to its constructor. The parameter :attr:`.prefill` indicates that the cache should be filled with data from the database when the object is created, while :attr:`.cachefullrows` determines whether only the primary key and columns defined by :attr:`.lookuparts`, or entire rows should be cached. Lastly, the parameter :attr:`.cacheoninsert` specifies if newly inserted rows should be cached. To ensure that the cache is kept consistent, the RDBMS is not allowed to modify the rows in any way, a default value set by the RDBMS is an example of a simple-to-miss violation of this. .. code-block:: python import psycopg2 import pygrametl from pygrametl.datasources import CSVSource from pygrametl.tables import CachedDimension, FactTable # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(pgconn) # The cached dimension is initialized with data from the product table in # the database, allowing for more efficient lookups of keys for the fact # table, at the cost of requiring it to already contain the necessary data productDimension = CachedDimension( name='product', key='productid', attributes=['name', 'category', 'price'], lookupatts=['name'], prefill=True) # A similar abstraction is created for the data warehouse fact table factTable = FactTable( name='facttable', measures=['sales'], keyrefs=['storeid', 'productid', 'dateid']) # A CSV file contains information about each product sold by a store sales = CSVSource(f=open('sales.csv', 'r', 16384), delimiter='\t') # Looking up keys from the product dimension is done using the lookup # method with the information read from the sales.csv file. The second # argument renames the column product_name from the CSV file to name for row in sales: # Using only the attributes defined as lookupatts, lookup() checks if a # row with matching values are present in the cache. Only if a match # cannot be found in the cache does lookup() check the database table. row['productid'] = productDimension.lookup(row, {'name': 'product_name'}) factTable.insert(row) # Ensures that the data is committed and the connection is closed correctly conn.commit() conn.close() The example shows how to utilize :class:`.CachedDimension` to automatically improve the performance of :meth:`.CachedDimension.lookup`. The :class:`.CachedDimension` caches the values from the product dimension locally, allowing increased performance when looking up keys as fewer, or none if all rows are cached, round trips are made to the database. BulkDimension ------------- :class:`.BulkDimension` is a dimension optimized for high throughput when inserting rows and fast lookups. This is done by inserting rows in bulk from a file while using an in-memory cache for lookup. To support this the RDBMS is not allowed to modify the rows in any way, as this would make the cache and the database table inconsistent. Another aspect of :class:`.BulkDimension` is that :meth:`.BulkDimension.update` and :meth:`.BulkDimension.getbyvals` calls :meth:`.BulkDimension.endload` which inserts all rows stored in the file into the database using a user-defined bulk loading function. Thus, calling these functions often will negate the benefit of bulk loading. The method :meth:`.BulkDimension.getbykey` also forces :class:`.BulkDimension` to bulk load by default but can use the cache if :attr:`.cachefullrows` is enabled at the cost of additional memory. :meth:`.BulkDimension.lookup` and :meth:`.BulkDimension.ensure` will always use the cache and do not invoke any database operations as :class:`.BulkDimension` never evicts rows from its cache. If the dimension is too large to be cached in memory, the class :class:`.CachedBulkDimension` should be used instead as it supports bulk loading using a finite cache. To support bulk loading from a file on disk, multiple additional parameters have been added to :class:`.BulkDimension` constructor. These provide control over the temporary file used to store rows, such as specific delimiters and the number of rows to be bulk loaded. All of these parameters have a default value except for :attr:`.bulkloader`. This parameter must be passed a function to be called for each set of rows to be bulk loaded, this is necessary as the exact way to perform bulk loading differs from RDBMS to RDBMS. .. py:function:: func(name, attributes, fieldsep, rowsep, nullval, filehandle): Required signature of a function bulk loading data from a file into an RDBMS in pygrametl. For more information about bulk loading see :ref:`bulkloading`. **Arguments:** - name: the name of the dimension table in the data warehouse. - attributes: a list containing the sequence of attributes in the dimension table. - fieldsep: the string used to separate fields in the temporary file. - rowsep: the string used to separate rows in the temporary file. - nullval: if the :class:`.BulkDimension` was passed a string to substitute None values with, then it will be passed, if not then None is passed. - filehandle: either the name of the file or the file object itself, depending upon on the value of :attr:`.BulkDimension.usefilename`. Using the filename is necessary if the bulk loading is invoked through SQL (instead of directly via a method on the PEP249 driver). It is also necessary if the bulkloader runs in another process. .. code-block:: python import sqlite3 import psycopg2 import pygrametl from pygrametl.datasources import SQLSource from pygrametl.tables import BulkDimension # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(connection=pgconn) # This function bulk loads a file into PostgreSQL using psycopg2 def pgbulkloader(name, attributes, fieldsep, rowsep, nullval, filehandle): cursor = conn.cursor() # psycopg2 does not accept the default value used to represent NULL # bv BulkDimension, which is None. Here this is ignored as we have no # NULL values that we wish to substitute for a more descriptive value cursor.copy_from(file=filehandle, table=name, sep=fieldsep, columns=attributes) # In addition to arguments needed for a Dimension, a reference to the # bulk loader defined above must also be passed to BulkDimension productDimension = BulkDimension( name='product', key='productid', attributes=['name', 'category', 'price'], lookupatts=['name'], bulkloader=pgbulkloader) # A PEP249 connection is sufficient for an SQLSource so we do not need # to create a new instance of ConnectionWrapper to read from the database sqconn = sqlite3.connect('product_catalog.db') # Encapsulating a database query in an SQLSource allows it to be used as an # normal iterator, making it very simple to load the data into another table sqlSource = SQLSource(connection=sqconn, query='SELECT * FROM product') # Inserting data from a data source into a BulkDimension is performed just # like any other dimension type in pygrametl, as the interface is the same for row in sqlSource: productDimension.insert(row) # Ensures that the data is committed and the connections are closed correctly conn.commit() conn.close() sqconn.close() The example above shows how to use :class:`.BulkDimension` to efficiently load the contents of a local SQLite database into a data warehouse dimension. This process is a good use case for :class:`.BulkDimension` as :meth:`.BulkDimension.update`, :meth:`.BulkDimension.getbykey` and :meth:`.BulkDimension.getbyval` are not used, so no additional calls to :meth:`.BulkDimension.endload` are made. By bulk loading the rows from a file using :meth:`copy_from` instead of inserting them one at a time, the time required to load the dimension is significantly reduced. However, it is important that :meth:`.ConnectionWrapper.commit()` is executed after all rows have been inserted into :class:`.BulkDimension` as it ensures that the last set of rows are bulk loaded by calling :meth:`.BulkDimension.endload` on all tables. A downside of :class:`.BulkDimension` is that it caches the entire dimension in memory. If the dimension can be bulk loaded but is too large to cache in memory :class:`.CachedBulkDimension` should be used instead of :class:`.BulkDimension`. CachedBulkDimension ------------------- :class:`.CachedBulkDimension` is very similar to :class:`.BulkDimension` and is also intended for bulk loading a dimension, so only their differences are described here. Unlike :class:`.BulkDimension` the size of :class:`.CachedBulkDimension` cache is limited by the parameter :attr:`cachesize`. This allows it to be used with a dataset too large to be cached entirely in memory. The trade-off is that :meth:`.CachedBulkDimension.lookup` and :meth:`.CachedBulkDimension.ensure` sometimes have to lookup keys in the database instead of always using the cache. However, the method :meth:`.CachedBulkDimension.getbykey` also no longer needs to call :meth:`.CachedBulkDimension.endload` if :attr:`.cachefullrows` is not enabled. This is because :class:`.CachedBulkDimension` caches the rows currently in the file in a separate cache. All rows in the file are cached as there is no guarantee that the cache storing rows from the database does not evict rows currently in the file but not yet in the database when the cache is full, So an additional cache is needed to ensure that :meth:`.CachedBulkDimension.lookup` and :meth:`.CachedBulkDimension.getbykey` can locate rows before they are loaded into the database. :meth:`.CachedBulkDimension.insert` caches rows in the file cache, and only when the rows in the file are loaded into the database are they moved to the database row cache, in which :meth:`.CachedBulkDimension.lookup` also stores rows if the method had to query the database for them. Due to the use of two caches, caching in :class:`.CachedBulkDimension` is controlled by two parameters. The parameter :attr:`.cachesize` can be set to control the size of the cache for rows loaded into the database, while the parameter :attr:`.bulksize` controls the number of rows stored in the file before the dimension bulk loads. As the rows in the file are all cached in a separate cache, the memory consumption will change in correspondence to both of these values. .. note:: If rows with matching :attr:`lookupatts` are passed to :meth:`insert` without bulk loading occurring between the calls, only the first row will be loaded into the dimension. All calls to :meth:`insert` after the first one will just return the key of the first row as it is stored in the file cache. .. The space in the header is intentional so the two parts can fit in the toc TypeOneSlowlyChanging Dimension ------------------------------- :class:`.TypeOneSlowlyChangingDimension` allows the creation of a type 1 slowly changing dimension. The dimension is based on :class:`.CachedDimension`, albeit with a few differences. The primary difference between the two classes besides the additional method :meth:`.TypeOneSlowlyChangingDimension.scdensure`, is that :class:`.TypeOneSlowlyChangingDimension` always caches rows when they are inserted. This is done to minimize the number of round trips to the database needed for :meth:`.TypeOneSlowlyChangingDimension.scdensure` to increase its throughput. The user must also specify the sequence of attributes to use for looking up keys as :attr:`.lookupatts`, and can optionally specify the sequence of type 1 slowly changing attributes :attr:`.type1atts`. If :attr:`.type1atts` is not given it will default to all attributes minus :attr:`.lookupatts`. The sequences :attr:`.lookupatts` and :attr:`.type1atts` must be disjoint and an error will be raised if they are not. As caching is used to increase to speedup lookups, it is assumed that the database does not change or add any attribute values to the rows. For example, a default value set by RDBMS and automatic type coercion can break this assumption. .. code-block:: python import psycopg2 import pygrametl from pygrametl.tables import TypeOneSlowlyChangingDimension # Input is a list of rows which in pygrametl is modeled as dicts products = [ {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '10'}, {'name': 'Cake and Me', 'category': 'Cookbook', 'price': '15'}, {'name': 'French Cooking', 'category': 'Cookbook', 'price': '50'}, {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '20'}, {'name': 'Sushi', 'category': 'Cookbook', 'price': '30'}, {'name': 'Nineteen Eighty-Four', 'category': 'Novel', 'price': '15'}, {'name': 'The Lord of the Rings', 'category': 'Novel', 'price': '60'}, {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '10'} ] # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(connection=pgconn) # TypeOneSlowlyChangingDimension is created with price as a changing attribute productDimension = TypeOneSlowlyChangingDimension( name='product', key='productid', attributes=['name', 'category', 'price'], lookupatts=['name'], type1atts=['price']) # scdensure determines whether the row already exists in the database # and inserts a new row or updates any type1atts that have changed for row in products: productDimension.scdensure(row) # Ensures that the data is committed and the connections are closed correctly conn.commit() conn.close() The above example shows a scenario where the price of a product changes over time. The instance of :class:`.TypeOneSlowlyChangingDimension` automatically checks if a product already exists, and if it does, updates the price if the old and new values differ. As a type 1 slowly changing dimension does store the history of the changes only the value of the last row to be inserted will be stored, so the rows most be loaded in chronological order. If the history of the changes must be stored a type 2 slowly changing dimension should be created, and :class:`.SlowlyChangingDimension` should be used instead of :class:`.TypeOneSlowlyChangingDimension`. SlowlyChangingDimension ----------------------- :class:`.SlowlyChangingDimension` allows for the creation of either a type 2 slowly changing dimension, or a combined type 1 and type 2 slowly changing dimension. To support this functionality, multiple additional attributes have been added to :class:`.SlowlyChangingDimension` compared to :class:`.Dimension`. However, only the additional parameter :attr:`.versionatt` is required when creating a :class:`.SlowlyChangingDimension`. This parameter indicates which of the dimensions attribute stores the row's version number. The method :meth:`.SlowlyChangingDimension.scdensure` updates the table while taking the slowly changing aspect of the dimension into account. If the row is already available then the primary key is returned, if the row is not available then it is inserted into the dimension, and if an attributes have changed a new version is created. The method :meth:`.SlowlyChangingDimension.lookup` is also changed slightly as it returns the latest version of a row. To improve the performance of lookups for a slowly changing dimension, caching is used, which assumes that the database does not modify any values in the inserted rows; an assumption that the use of default values can break. It is also possible to lookup the version of a member that was valid at given time by using :meth:`.SlowlyChangingDimension.lookupasof`. This method, however, requires :attr:`.toatt` and/or :attr:`.fromatt` to be given when the :class:`.SlowlyChangingDimension` is created. .. code-block:: python import psycopg2 import pygrametl from pygrametl.tables import SlowlyChangingDimension # Input is a list of rows which in pygrametl is modeled as dicts products = [ {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '20', 'date': '1990-10-01'}, {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '10', 'date': '1990-12-10'}, {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '20', 'date': '1991-02-01'}, {'name': 'Cake and Me', 'category': 'Cookbook', 'price': '15', 'date': '1990-05-01'}, {'name': 'French Cooking', 'category': 'Cookbook', 'price': '50', 'date': '1990-05-01'}, {'name': 'Sushi', 'category': 'Cookbook', 'price': '30', 'date': '1990-05-01'}, {'name': 'Nineteen Eighty-Four', 'category': 'Novel', 'price': '15', 'date': '1990-05-01'}, {'name': 'The Lord of the Rings', 'category': 'Novel', 'price': '60', 'date': '1990-05-01'} ] # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(connection=pgconn) # This slowly changing dimension is created as type 2 only. Meaning that a # new row only changes the validto attribute in the previous row. validto # is a timestamp indicating when the row is no longer valid. As additional # parameters, the object is initialized with information about which # attribute holds a timestamp for when the row's validity starts and ends. # The parameter fromfinder is also given, which is set to a function that # computes the timestamp for when the row becomes valid. In this example, # the function datareader from pygrametl is used which converts a timestamp # from a str to a datetime.date which PostgresSQL stores as Date type. productDimension = SlowlyChangingDimension( name='product', key='productid', attributes=['name', 'category', 'price', 'validfrom', 'validto', 'version'], lookupatts=['name'], fromatt='validfrom', fromfinder=pygrametl.datereader('date'), toatt='validto', versionatt='version') # scdensure extends the ensure methods with support for updating slowly # changing attributes of rows where lookupparts match. This is done by # increamenting the version attribute for the new row, and assigning the new # rows' fromatt to the old rows' toatt, indicating that the old row is no # longer valid. for row in products: productDimension.scdensure(row) # This finds the key value of the newest version newest = productDimension.lookup({'name': 'Calvin and Hobbes') # This finds the key value of the version that was valid 1990-12-31 older = productDimension.lookupasof({'name': 'Calvin and Hobbes', '1990-12-31', True) # Ensures that the data is committed and the connections are closed correctly conn.commit() conn.close() As the values of the product dimension, in this case, have changing prices, a :class:`.SlowlyChangingDimension` is used to automate the changes a new row might incur on an existing row. The product information itself is also extended with timestamps indicating at what time a particular product had a certain price. When creating the instance of :class:`.SlowlyChangingDimension` information about how these timestamps should be interpreted is provided to the constructor. In this case, is it fairly simple as the timestamp provided in the input data is simple enough to be converted directly to :class:`.datetime.date` object. These can then be inserted into a column of type Date. To automate this conversion, the parameter :attr:`fromfinder` is set to the function returned by :func:`.pygrametl.datareader` which constructs :class:`.datetime.date` objects from the :class:`.str` in date. However, a user-defined function with the same interface as the function generated by :func:`.pygrametl.datareader` could also be used. When inserting the rows the method :meth:`.SlowlyChangingDimension.scdensure` is used instead of :meth:`.SlowlyChangingDimension.insert` as it first performs a lookup to verify that an existing version of the row is not already present. If a row is already present, this row is updated with the from timestamp inserted into its to time attribute indicating when this version of the row was deemed obsolete, and an incremented version number is added to the new row indicating that this is a newer version of an existing row. SnowflakedDimension ------------------- :class:`.SnowflakedDimension` represents a snowflaked dimension as a single object with the same interface as :class:`.Dimension`. Instantiating a :class:`.SnowflakedDimension` is however different. Instead of requiring all arguments to be passed to the constructor of :class:`.SnowflakedDimension` itself, dimension objects must be created for each table in the snowflaked dimension. These objects are then passed to :class:`.SnowflakedDimension` constructor. The dimension objects must be given as pairs that indicate their foreign key relationships, e.g. (a1, a2) should be passed if a1 has a foreign key to a2. Currently, it is a requirement that a foreign key must have the same name as the primary key it references. The only additional configuration supported by :class:`.SnowflakedDimension` is :attr:`.expectboguskeyvalues` which indicates that a key used as a lookup attribute in a lower level of the hierarchy might not have a matching primary key. Support for slowly changing dimensions of type 2 or a combined of type 1 and type 2 are provided if an instance of :class:`.SlowlyChangingDimension` is at the root of the snowflaked dimension. Currently, only the root dimension can be an instance of :class:`.SlowlyChangingDimension` and the feature should be considered experimental. .. code-block:: python import psycopg2 import pygrametl from pygrametl.tables import CachedDimension, SnowflakedDimension # Input is a list of rows which in pygrametl is modeled as dicts products = [ {'name': 'Calvin and Hobbes 1', 'category': 'Comic', 'type': 'Fiction', 'price': '10'}, {'name': 'Calvin and Hobbes 2', 'category': 'Comic', 'type': 'Fiction', 'price': '10'}, {'name': 'Calvin and Hobbes 3', 'category': 'Comic', 'type': 'Fiction', 'price': '10'}, {'name': 'Cake and Me', 'category': 'Cookbook', 'type': 'Non-Fiction', 'price': '15'}, {'name': 'French Cooking', 'category': 'Cookbook', 'type': 'Non-Fiction', 'price': '50'}, {'name': 'Sushi', 'category': 'Cookbook', 'type': 'Non-Fiction', 'price': '30'}, {'name': 'Nineteen Eighty-Four', 'category': 'Novel', 'type': 'Fiction', 'price': '15'}, {'name': 'The Lord of the Rings', 'category': 'Novel', 'type': 'Fiction', 'price': '60'} ] # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(connection=pgconn) # The product dimension is in the database represented as a snowflaked # dimension, so a CachedDimension object is created for each table productTable = CachedDimension( name='product', key='productid', attributes=['name', 'price', 'categoryid'], lookupatts=['name']) categoryTable = CachedDimension( name='category', key='categoryid', attributes=['category', 'typeid'], lookupatts=['category']) typeTable = CachedDimension( name='type', key='typeid', attributes=['type']) # An instance of SnowflakedDimension is initialized with the created # dimensions as input. Thus allowing a snowflaked dimension to be used in # the same manner as a dimension consisting of a single table. The dimension's # tables are passed as pairs based on their foreign key relationships. # Meaning the arguments indicate that the productTable has a foreign key # relationship with the categoryTable, and the categoryTable has a foreign # key relationship with the typeTable. If a table has multiple foreign key # relationships to tables in the snowflaked dimension, a list must be passed # as the second part of the tuple with a dimension object for each table the # first argument references through its foreign keys. productDimension = SnowflakedDimension(references=[ (productTable, categoryTable), (categoryTable, typeTable)]) # SnowflakedDimension provides the same interface as the dimensions classes, # however, some of its methods stores keys in the row when they are computed for row in products: productDimension.insert(row) # Ensures that the data is committed and the connections are closed correctly conn.commit() conn.close() In the above example, the product dimension is not represented by a single table like in the examples shown for the other dimensions classes provided by pygrametl. It is instead represented as a snowflake consisting of multiple tables. To represent this in the ETL flow, a combination of :class:`.SnowflakedDimension` and :class:`.CachedDimension` is used. As multiple tables need to be represented, an instance of :class:`.CachedDimension` is created for each. An instance of :class:`.SnowflakedDimension` is then created to represent the many instances of :class:`.CachedDimension` as a single object. Interacting with a snowflaked dimension is done through the same interface provided by the other dimension classes in pygrametl. However, some of :class:`.SnowflakedDimension` methods modify the provided rows as foreign key relationship needs to be computed based on the contents of the rows the object operates on. .. code-block:: python import psycopg2 import pygrametl from pygrametl.tables import CachedDimension, SnowflakedDimension, \ SlowlyChangingDimension # Input is a list of rows which in pygrametl is modeled as dicts products = [ {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '20', 'date': '1990-10-01'}, {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '10', 'date': '1990-12-10'}, {'name': 'Calvin and Hobbes', 'category': 'Comic', 'price': '20', 'date': '1991-02-01'}, {'name': 'Cake and Me', 'category': 'Cookbook', 'price': '15', 'date': '1990-05-01'}, {'name': 'French Cooking', 'category': 'Cookbook', 'price': '50', 'date': '1990-05-01'}, {'name': 'Sushi', 'category': 'Cookbook', 'price': '30', 'date': '1990-05-01'}, {'name': 'Nineteen Eighty-Four', 'category': 'Novel', 'price': '15', 'date': '1990-05-01'}, {'name': 'The Lord of the Rings', 'category': 'Novel', 'price': '60', 'date': '1990-05-01'} ] # The actual database connection is handled by a PEP 249 connection pgconn = psycopg2.connect("""host='localhost' dbname='dw' user='dwuser' password='dwpass'""") # This ConnectionWrapper will be set as a default and is then implicitly # used, but it is stored in conn so transactions can be committed and the # connection closed conn = pygrametl.ConnectionWrapper(connection=pgconn) # The dimension is snowflaked into two tables, one with categories, and the # other at the root with name and the slowly changing attribute price productTable = SlowlyChangingDimension( name='product', key='productid', attributes=['name', 'price', 'validfrom', 'validto', 'version', 'categoryid'], lookupatts=['name'], fromatt='validfrom', fromfinder=pygrametl.datereader('date'), toatt='validto', versionatt='version') categoryTable = CachedDimension( name='category', key='categoryid', attributes=['category']) productDimension = SnowflakedDimension(references=[(productTable, categoryTable)]) # Using a SlowlyChangingDimension with a SnowflakedDimension is done in the # same manner as a normal SlowlyChangingDimension using scdensure for row in products: productDimension.scdensure(row) # Ensures that the data is committed and the connections are closed correctly conn.commit() conn.close() A :class:`.SlowlyChangingDimension` and a :class:`.SnowflakedDimension` can be combined if necessary, with the restriction that all slowly changing attributes must be placed in the root table. To do this, the :class:`.CachedDimension` instance connecting to the root table has to be changed to an instance of :class:`.SlowlyChangingDimension` and the necessary attributes added to the database table. Afterward, :meth:`.SnowflakedDimension.scdensure` can be used to insert and lookup rows while ensuring that the slowly changing attributes are updated correctly.