matrixb package¶
Subpackages¶
Submodules¶
matrixb.columnnormalizer module¶
matrixb.iterator module¶
matrixb.matrix module¶
-
class
matrixb.matrix.Matrix(source=None, columns=None, null_values=False, extra_null_values=None, indexed_columns=None, column_normalizer='default', columns_from_source=None, column_types=None, duplicate_column_policy='warning', recode_duplicates=True, rowcount_policy='error', autoclean=True, load_policy='lazy', data_cleaner=False)[source]¶ Bases:
object-
add_columns(n=None, columns=None)[source]¶ Add one or more columns to the end of the current matrix. It will retroactively add null values to existing rows.
Parameters:
-
append(row)[source]¶ Adds the row to the end of the matrix (entire source will be loaded if not already done)
-
colmap¶ a hash that maps the normalized column name to the 0-based index of the column.
Raises: Exception if a duplicate column is found and the recode_duplicates property is True, as this should have been translated in the _render_columns function, or the programmer should have taken strides to add/change a non-duplicated column name. If this exception is thrown, it probably means the programmer changed the column name manually. Type: dict
-
column_normalizer¶ The column normalizer for this object.
-
column_types¶ Column Types map the Column Index to the type that the column must be. At present, None is coded as null and is valid, regardless of the type.
Type: dict(int => type)
-
column_validators¶ An array of length(columns) that provides closures to validate the values when rows are added, processed, or appended AND either autoclean is on or clean is stated explicitly.
This uses the column_types parameter to create column-specific closures, and uses the default_datacleaner for anything else. Typechecking happens after null_values are processed, which may have implications for custom or sophisticated checking.
Type: list
-
columns¶ The column names for the matrix, after they have been normalized.
Type: list|none
-
copy(conditions=None)[source]¶ Copy the current matrix by value, so that changes to the copy do not affect the data structure of the present. Note that this will be very memory intensive for very large matrices.
Parameters: condtions (list[dict],optional) – An optional set of conditions or functions to apply during the copy. For each row, if any condition is false, the row will not be included. Each condition may be defined with the following key-value arguments: column (str): the column name to apply the condition to. value: a value that will be compared to the value in the listed column. function (str, optional): a specific type of function for the comparison. At present, only ‘re’ and ‘eq’ are supported. For ‘eq’, it is a test of equality. For ‘re’, it is a regular expression search based on value. Returns: The copy of the matrix.
-
data_cleaner¶ The data-cleaner object used to clean cell values.
-
classmethod
default_column_normalizer(shorthand_name=False)[source]¶ Provides an interface to a class-level column-normalizer, allowing for subclasses for common processes. Defaults to matrixb.ColumnNormalizer, which translates to snake_case and replace almost all special characters to underscores (except for # or %).
Parameters: shorthand_name – passed to the constructor of the matrixb.ColumnNormalizer to create a defualt set of normalization rules. Values are ‘default’ (removes all special characters except for [-, %, #] and converts to snake_case); ‘conservative’ (converts to snake case, preserving all special characters), ‘ascii’ (removes all non-alphanumeric and converts to snake_case). None indicates there should be no class-level column normalizer. Default to False, which should be interpreted as ‘default’. Returns: The default column normalizer for the class, filtered by the shorthand name .
-
default_null_values()[source]¶ list: The default null values - this is kept in function format that can be called in either class or object context, and can be easily extended in subclasses.
-
delete_column(*column)[source]¶ Deletes the column from the matrix. Also deletes any related index and updates the colmap appropriately.
Parameters: column (int|str) – The reference to the column to delete. If an integer, uses it as the column index. If a string, looks up the index in the colmap. For convenience, will take multiple column values if provided.
-
delete_index(column)[source]¶ Deletes the index associated with the column.
Parameters: column (str|int) – The column to delete. This can be one of the following: int : the index of the column str: uses the colmap to determine the column location
-
export(filename, topmatter=None, autosize=False)[source]¶ Export to file. Will ascertain from the extension.
Parameters: - {str} (filename) – The filename
- {scalar|list (topmatter) – either details to be placed in the top cell or a matrix to be places above the table. empty rows of the matrix will be interpretted as blank lines.
- {bool} (autosize) – Attempt to autosize columns for formats that support it (excel). This can be timeconsuming. Defaults to false. NOT IMPLEMENTED YET.
-
extend(rows)[source]¶ Extends the matrix with the rows (entire source will be loaded if not already done)
-
has_next()[source]¶ Returns: True if there is a next row to be loaded. False if there is no next row to be loaded. None if there is no source, so that implicit tests will evaluate the same as False for Resident matrices. Note
If no rows have been loaded, this will load the first row and place it in the private variable representing the next row. This may therefore create a side effect if there is a possibility that the source is not initialized to the point of serving out the first row.
- Development Note:
- I’m not certain that this shouldn’t be a function of the source class and be delegated there.
-
index_column(column, ignore_null=False, unique=False)[source]¶ create and maintain an index to specific columns to be used with lookup() later. Note that this will greatly increase the speed to look up specific values in the columns, but it does require the entire matrix to be loaded before a lookup() can occur. So index/lookup and deferred load/storage are essentially mutually exclusive.
Parameters: column (str|int|dict) – The column definition about which to maintain the index. This can be one of the following: str: uses the colmap to determine the column location int : expected to be the index of the column dict of {‘type’:’column_name’, ‘value’: column_name_var} : if you have non-string column names, this should be your format
-
indexed_columns¶ A list of indexed columns. Indexed columns maintain a hash of the values in the columns to all of the row indices that the value relates to, to faciliate fast lookups when the dataset is large enough that the time to look up is a speed-performance concern. Indexing requires the entire dataset to be loaded into memory (and will do so on the first lookup) and so it is inconsistent with delayed load or non-resident implementations. The user cannot set the indexed_columns after initialization because of the level of processing involved. Instead, they should use the index functions to manipulate this list (e//e.g., index_column).
Type: list
-
insert(key, row)[source]¶ Inserts the row parameter into the matrix at location ‘key’ (source will be loaded to the ith position if not already done).
Parameters: - key (int) – the index at which point to insert the new row.
- row (array) – the new row to insert. This needs to conform to all other row expectations - it must be the same width as the matrix (or the rowcount_policy needs to be set accordingly)
-
load(automated=False)[source]¶ Immediately loads all data from the matrix source into the internal data.
Parameters: automated (bool) – Indicates whether this is an ‘automated,’ i.e. when this occurs as an automated fashion as a side effect of another call (such as nrows). The automated argument is only important if the load_policy is ‘manual’, when an automated load will throw an error. Default to False (so that the user calling matrixb.load() manually works as expected). Raises: Exception when the load_policy is ‘manual’ and ‘automated’ is True. This is provided so a user can explicitly prevent a load of the mull matrix from the source, such as when the user wishes to prevent a side effect of a full load when the source is too large for the available memory our would take too long for the intended time-to-process of the broader program.
-
load_next()[source]¶ Loads the next row from source and processes it, updates the ‘loaded’ property when complete.
Returns: The parsed, next row. Raises: StopItration if the source has been exhausted / fully loaded.
-
load_policy¶ When data is loaded from the source. Defaults to “lazy”. ‘lazy’: Each row of data is loaded when the next row is referenced or when a function that requires all data to return the correct result is called. All matrix sources are assumed to be serial data sources (as with a file), and so if a future row is referenced before the intermediary rows have been loaded, all rows between the presently loaded row and the referenced row will be loaded. ‘init’: All data is loaded when the object is initialized or when the source is added. ‘manual’: Each row is loaded only when the programmer explicitly calls load() or load_next(). If a function is called that requires all data, an exception is thrown.
Type: enum(‘lazy’, ‘deferred’, ‘init’)
-
load_to(key)[source]¶ Loads the source to the ith element.
Parameters: key (int) – The index to which the source should be loaded. Returns: The row of the matrix at location key (base-0). Raises: IndexError if key is greater than the length of the source.
-
loaded¶ indicates whether the MatrixSource has been processed through to completion. If the source is empty or null, loaded is true.
Type: bool
-
lookup(column, value)[source]¶ Lookup a value in a column index. This column must have been previously indexed.
- Params:
- column {str|int|dict}: the column, column-name, or column-defintion for the index to be searched. See the index_column() function for a description of the column-definition dict.
-
ncolumns¶ the number of columns in the matrix. Returns 0 if the matrix has no data and columns have not been defined.
Type: int
-
nrows¶ The number of rows in the matrix. Raises an exception if the matrix has no data. Note that this requires the matrix to be entirely loaded and calls load implicitly.
Type: int
-
null_values¶ A list of values that should be translated to None if encountered wholly in a cell, especially used to automatically convert variants of null/blank/empty cells to consisently be None when processing via Python. To add null values after object creation, use the function add_null_values(). Delegates to the data cleaner; if there is no data cleaner, empty list is returned.
Type: null_values (list)
-
print_csv()[source]¶ Print data in a comma separated format to standard output. Quote strings iff they contain questionable characters
-
recode_duplicates¶ 0, ‘B.1’: 1, ‘B.2’: 2, }). If false, the colmap property will point to an array of column indexes for columns with duplicate names, and any calls to rowmap() will include an ordered array of all of the values, but all non-duplicated columns will continue to be a direct map of column name to value (e.g., columns [‘A’, ‘B’, ‘B’] remain as they are, but colmap becomes {‘A’: 0, ‘B’: [1,2], }). Defaults to True.
Type: bool Type: Returns whether duplicate column names are recoded or not. If True, duplicate column names will have the actual names recoded with an iterator (e.g., columns [‘A’, ‘B’, ‘B’] -> [‘A’, ‘B.1’, ‘B.2’] and the colmap is {‘A’
-
rename_column(column, column_name)[source]¶ Rename one or more columns and updates secondary data structures appropriately. Safer than manually renaming the column using self.columns[i] = X
Parameters: - column (int|str) – The column to rename. If an integer, uses it as the column index. If a string, looks up the index in the colmap.
- column_name (int|str|None) – The new name of the column.
Raises: Exception if column_name already exists in the list of columns and recode_duplicates is True. This function does not auto-recode duplicate names.
-
rowcount_policy¶ enum of {‘error’, ‘warning’, ‘ignore’/None, ‘accommodate’}. Policy for any matrix data line (that is not empty) that exceeds the number of elements in the header. Defaults to ‘error’ None/’ignore’: the row is spliced down to the column length, appended to the matrix, and processing continue. ‘error’: raises exception to note that the row has more values than the number of columns. ‘warning’: call warnings.warn for any row that exceeds the column length, chop off the extraneous values, and continue ‘accommodate’: dynamically extend the matrix to accommodate extra columns if discovered during processing. This will add new column elements that are blank and extend the existing matrix to account for them. This is not an efficient process.
Type: rowcount_policy
-
set_column_type(column, column_type)[source]¶ sets the column type for a listed column, after the instatiation of the object but before any rows have been loaded. At present, it does not retroactively check column values (and throws an error).
Parameters: - column (int|str) – the column index or name in question.
- column_type (type|class|'maybeint') – the required type for the column.
Todo
Update to retroactively apply to values after some rows have been loaded.
-
matrixb.rowmap module¶
-
class
matrixb.rowmap.MatrixRowmap(matrix)[source]¶ Bases:
matrixb.iterator.MatrixIteratorDefines a subclass of the MatrixIterator when the rowmap is called instead of a strict matrix iteration.