matrixb package

Submodules

matrixb.columnnormalizer module

class matrixb.columnnormalizer.ColumnNormalizer(shorthand_name='default', **kwargs)[source]

Bases: datacleaner.snakecase.SnakeCase

default_translations = {None: '<blank>'}

matrixb.iterator module

class matrixb.iterator.MatrixIterator(matrix)[source]

Bases: object

Defines the iterator class for a Matrix.

top()[source]

Look at the next row, but do not actually iterate over it. This is an extension of the required Python iterator definition.

matrixb.matrix module

class matrixb.matrix.Matrix(source=None, columns=None, null_values=False, extra_null_values=None, indexed_columns=None, column_normalizer='default', columns_from_source=None, column_types=None, duplicate_column_policy='warning', recode_duplicates=True, rowcount_policy='error', autoclean=True, load_policy='lazy', data_cleaner=False)[source]

Bases: object

add_columns(n=None, columns=None)[source]

Add one or more columns to the end of the current matrix. It will retroactively add null values to existing rows.

Parameters:
  • n (int, optional) – The number of columns to add. If n is used and the ‘columns’ arg is not, the columns will have no name. Either ‘n’ or ‘columns’ is required.
  • columns (list[str], optional) – The names of the new columns to add. Either ‘n’ or ‘columns’ is required.
append(row)[source]

Adds the row to the end of the matrix (entire source will be loaded if not already done)

autoclean

Returns indicates whether data rows will be cleaned as loaded.

Type:bool
clean(row)[source]
colmap

a hash that maps the normalized column name to the 0-based index of the column.

Raises:Exception if a duplicate column is found and the recode_duplicates property is True, as this should have been translated in the _render_columns function, or the programmer should have taken strides to add/change a non-duplicated column name. If this exception is thrown, it probably means the programmer changed the column name manually.
Type:dict
column_normalizer

The column normalizer for this object.

column_types

Column Types map the Column Index to the type that the column must be. At present, None is coded as null and is valid, regardless of the type.

Type:dict(int => type)
column_validators

An array of length(columns) that provides closures to validate the values when rows are added, processed, or appended AND either autoclean is on or clean is stated explicitly.

This uses the column_types parameter to create column-specific closures, and uses the default_datacleaner for anything else. Typechecking happens after null_values are processed, which may have implications for custom or sophisticated checking.

Type:list
columns

The column names for the matrix, after they have been normalized.

Type:list|none
copy(conditions=None)[source]

Copy the current matrix by value, so that changes to the copy do not affect the data structure of the present. Note that this will be very memory intensive for very large matrices.

Parameters:condtions (list[dict],optional) – An optional set of conditions or functions to apply during the copy. For each row, if any condition is false, the row will not be included. Each condition may be defined with the following key-value arguments: column (str): the column name to apply the condition to. value: a value that will be compared to the value in the listed column. function (str, optional): a specific type of function for the comparison. At present, only ‘re’ and ‘eq’ are supported. For ‘eq’, it is a test of equality. For ‘re’, it is a regular expression search based on value.
Returns:The copy of the matrix.
data_cleaner

The data-cleaner object used to clean cell values.

dataframe()[source]
classmethod default_column_normalizer(shorthand_name=False)[source]

Provides an interface to a class-level column-normalizer, allowing for subclasses for common processes. Defaults to matrixb.ColumnNormalizer, which translates to snake_case and replace almost all special characters to underscores (except for # or %).

Parameters:shorthand_name – passed to the constructor of the matrixb.ColumnNormalizer to create a defualt set of normalization rules. Values are ‘default’ (removes all special characters except for [-, %, #] and converts to snake_case); ‘conservative’ (converts to snake case, preserving all special characters), ‘ascii’ (removes all non-alphanumeric and converts to snake_case). None indicates there should be no class-level column normalizer. Default to False, which should be interpreted as ‘default’.
Returns:The default column normalizer for the class, filtered by the shorthand name .
default_null_values()[source]

list: The default null values - this is kept in function format that can be called in either class or object context, and can be easily extended in subclasses.

delete_column(*column)[source]

Deletes the column from the matrix. Also deletes any related index and updates the colmap appropriately.

Parameters:column (int|str) – The reference to the column to delete. If an integer, uses it as the column index. If a string, looks up the index in the colmap. For convenience, will take multiple column values if provided.
delete_index(column)[source]

Deletes the index associated with the column.

Parameters:column (str|int) – The column to delete. This can be one of the following: int : the index of the column str: uses the colmap to determine the column location
export(filename, topmatter=None, autosize=False)[source]

Export to file. Will ascertain from the extension.

Parameters:
  • {str} (filename) – The filename
  • {scalar|list (topmatter) – either details to be placed in the top cell or a matrix to be places above the table. empty rows of the matrix will be interpretted as blank lines.
  • {bool} (autosize) – Attempt to autosize columns for formats that support it (excel). This can be timeconsuming. Defaults to false. NOT IMPLEMENTED YET.
extend(rows)[source]

Extends the matrix with the rows (entire source will be loaded if not already done)

has_next()[source]
Returns:True if there is a next row to be loaded. False if there is no next row to be loaded. None if there is no source, so that implicit tests will evaluate the same as False for Resident matrices.

Note

If no rows have been loaded, this will load the first row and place it in the private variable representing the next row. This may therefore create a side effect if there is a possibility that the source is not initialized to the point of serving out the first row.

Development Note:
  1. I’m not certain that this shouldn’t be a function of the source class and be delegated there.
index_column(column, ignore_null=False, unique=False)[source]

create and maintain an index to specific columns to be used with lookup() later. Note that this will greatly increase the speed to look up specific values in the columns, but it does require the entire matrix to be loaded before a lookup() can occur. So index/lookup and deferred load/storage are essentially mutually exclusive.

Parameters:column (str|int|dict) – The column definition about which to maintain the index. This can be one of the following: str: uses the colmap to determine the column location int : expected to be the index of the column dict of {‘type’:’column_name’, ‘value’: column_name_var} : if you have non-string column names, this should be your format
indexed_columns

A list of indexed columns. Indexed columns maintain a hash of the values in the columns to all of the row indices that the value relates to, to faciliate fast lookups when the dataset is large enough that the time to look up is a speed-performance concern. Indexing requires the entire dataset to be loaded into memory (and will do so on the first lookup) and so it is inconsistent with delayed load or non-resident implementations. The user cannot set the indexed_columns after initialization because of the level of processing involved. Instead, they should use the index functions to manipulate this list (e//e.g., index_column).

Type:list
insert(key, row)[source]

Inserts the row parameter into the matrix at location ‘key’ (source will be loaded to the ith position if not already done).

Parameters:
  • key (int) – the index at which point to insert the new row.
  • row (array) – the new row to insert. This needs to conform to all other row expectations - it must be the same width as the matrix (or the rowcount_policy needs to be set accordingly)
iter()[source]

returns an iterator of the rows of the matrix

load(automated=False)[source]

Immediately loads all data from the matrix source into the internal data.

Parameters:automated (bool) – Indicates whether this is an ‘automated,’ i.e. when this occurs as an automated fashion as a side effect of another call (such as nrows). The automated argument is only important if the load_policy is ‘manual’, when an automated load will throw an error. Default to False (so that the user calling matrixb.load() manually works as expected).
Raises:Exception when the load_policy is ‘manual’ and ‘automated’ is True. This is provided so a user can explicitly prevent a load of the mull matrix from the source, such as when the user wishes to prevent a side effect of a full load when the source is too large for the available memory our would take too long for the intended time-to-process of the broader program.
load_next()[source]

Loads the next row from source and processes it, updates the ‘loaded’ property when complete.

Returns:The parsed, next row.
Raises:StopItration if the source has been exhausted / fully loaded.
load_policy

When data is loaded from the source. Defaults to “lazy”. ‘lazy’: Each row of data is loaded when the next row is referenced or when a function that requires all data to return the correct result is called. All matrix sources are assumed to be serial data sources (as with a file), and so if a future row is referenced before the intermediary rows have been loaded, all rows between the presently loaded row and the referenced row will be loaded. ‘init’: All data is loaded when the object is initialized or when the source is added. ‘manual’: Each row is loaded only when the programmer explicitly calls load() or load_next(). If a function is called that requires all data, an exception is thrown.

Type:enum(‘lazy’, ‘deferred’, ‘init’)
load_to(key)[source]

Loads the source to the ith element.

Parameters:key (int) – The index to which the source should be loaded.
Returns:The row of the matrix at location key (base-0).
Raises:IndexError if key is greater than the length of the source.
loaded

indicates whether the MatrixSource has been processed through to completion. If the source is empty or null, loaded is true.

Type:bool
lookup(column, value)[source]

Lookup a value in a column index. This column must have been previously indexed.

Params:
column {str|int|dict}: the column, column-name, or column-defintion for the index to be searched. See the index_column() function for a description of the column-definition dict.
make_hashmap(row)[source]
make_rowmap(row)[source]
ncolumns

the number of columns in the matrix. Returns 0 if the matrix has no data and columns have not been defined.

Type:int
nrows

The number of rows in the matrix. Raises an exception if the matrix has no data. Note that this requires the matrix to be entirely loaded and calls load implicitly.

Type:int
null_values

A list of values that should be translated to None if encountered wholly in a cell, especially used to automatically convert variants of null/blank/empty cells to consisently be None when processing via Python. To add null values after object creation, use the function add_null_values(). Delegates to the data cleaner; if there is no data cleaner, empty list is returned.

Type:null_values (list)
pop(key=None)[source]

An implementation of array pop for the matrix.

print_csv()[source]

Print data in a comma separated format to standard output. Quote strings iff they contain questionable characters

rebuild_indices()[source]
recode_duplicates

0, ‘B.1’: 1, ‘B.2’: 2, }). If false, the colmap property will point to an array of column indexes for columns with duplicate names, and any calls to rowmap() will include an ordered array of all of the values, but all non-duplicated columns will continue to be a direct map of column name to value (e.g., columns [‘A’, ‘B’, ‘B’] remain as they are, but colmap becomes {‘A’: 0, ‘B’: [1,2], }). Defaults to True.

Type:bool
Type:Returns whether duplicate column names are recoded or not. If True, duplicate column names will have the actual names recoded with an iterator (e.g., columns [‘A’, ‘B’, ‘B’] -> [‘A’, ‘B.1’, ‘B.2’] and the colmap is {‘A’
rename_column(column, column_name)[source]

Rename one or more columns and updates secondary data structures appropriately. Safer than manually renaming the column using self.columns[i] = X

Parameters:
  • column (int|str) – The column to rename. If an integer, uses it as the column index. If a string, looks up the index in the colmap.
  • column_name (int|str|None) – The new name of the column.
Raises:

Exception if column_name already exists in the list of columns and recode_duplicates is True. This function does not auto-recode duplicate names.

rowcount_policy

enum of {‘error’, ‘warning’, ‘ignore’/None, ‘accommodate’}. Policy for any matrix data line (that is not empty) that exceeds the number of elements in the header. Defaults to ‘error’ None/’ignore’: the row is spliced down to the column length, appended to the matrix, and processing continue. ‘error’: raises exception to note that the row has more values than the number of columns. ‘warning’: call warnings.warn for any row that exceeds the column length, chop off the extraneous values, and continue ‘accommodate’: dynamically extend the matrix to accommodate extra columns if discovered during processing. This will add new column elements that are blank and extend the existing matrix to account for them. This is not an efficient process.

Type:rowcount_policy
rowmap()[source]

returns a MatrixRowmap object that is a derivative of this matrix

rowmap_list()[source]

returns a list of rowmaps

set_column_type(column, column_type)[source]

sets the column type for a listed column, after the instatiation of the object but before any rows have been loaded. At present, it does not retroactively check column values (and throws an error).

Parameters:
  • column (int|str) – the column index or name in question.
  • column_type (type|class|'maybeint') – the required type for the column.

Todo

Update to retroactively apply to values after some rows have been loaded.

sort(key=None, reverse=None)[source]

An implementation of array sort for the matrix.

to_dataframe()[source]

Export the matrix data into a Pandas DataFarme

matrixb.rowmap module

class matrixb.rowmap.MatrixRowmap(matrix)[source]

Bases: matrixb.iterator.MatrixIterator

Defines a subclass of the MatrixIterator when the rowmap is called instead of a strict matrix iteration.

top()[source]

Look at the next row, but do not actually iterate over it. This is an extension of the required Python iterator definition.

Module contents