hypermorph package¶

Submodules¶

hypermorph.clients module¶

class hypermorph.clients.ConnectionPool(db_client, dialect=None, host=None, port=None, user=None, password=None, database=None, path=None, trace=0)¶

Bases: object

ConnectionPool manages connections to DBMS and extends API database clients i) with useful introspective properties (api, database, sqlalchemy_engine, last_query, query_stats) ii) with a uniform sql command interface (sql method) iii) with common methods to access database metadata (get_tables_metadata, get_columns_metadata)

HyperMorph currently supports the following three database API clients (self._api_name)

Clickhouse-Driver
MySQL-Connector
SQLAlchemy with the following three dialects (self._sqlalchemy_dialect)
1. pymysql
2. clickhouse
3. sqlite

Consequently various database APIs are categorized as (self._api_category)

MYSQL CLICKHOUSE SQLite

property api_category¶

property api_name¶

clickhouse_connections = 0¶

property connector¶

property database¶

get_columns_metadata(table=None, columns=None, fields=None, aggr=None, **kwargs)¶

Parameters

table – name of the table in database
columns – list of ClickHouse column names
fields – select specific meta-data fields for the columns of a table in database dictionary metadata field names are dependent on the specific DBMS e.g. MySQL, SQLite, ClickHouse, etc…
aggr – aggregate metadata results for the columns of a clickhouse table
kwargs – pass extra parameters to sql() method

Returns

metadata for the columns of a table(s) in a database e.g. name of column, default value, nullable, etc

get_tables_metadata(fields=None, clickhouse_engine=None, name=None, **kwargs)¶

Parameters

clickhouse_engine – type of storage engine for clickhouse database
fields – select specific meta-data fields for a table in database dictionary metadata field names are dependent on the specific DBMS e.g. MySQL, SQLite, ClickHouse, etc…
name – table name regular expression e.g. name=’%200%’
kwargs – parameters passed to sql() method

Returns

metadata for the tables of a database e.g. name of table, number of rows, row length, collation, etc..

mysql_connections = 0¶

sql(query, **kwargs)¶: For kwargs and specific details on implementation see implementation of connector class for the specific API e.g. SQLAlchemy.sql() for sqlalchemy database API :param query: :param kwargs: pass other parameters to sql() method of connector class :return: result set represented with a pandas dataframe, tuples, ….

property sqlalchemy_dialect¶

sqlite_connections = 0¶

hypermorph.connector_clickhouse_driver module¶

class hypermorph.connector_clickhouse_driver.ClickHouse(host, port, user, password, database, trace=0)¶

Bases: object

ClickHouse class is based on clickhouse-driver python API for ClickHouse DBMS

property api_category¶

property connection¶

create_engine(table, engine, heading, partition_key=None, order_key=None, settings=None, execute=True)¶

Parameters

table – name of ClickHouse engine
engine – the type of clickhouse engine
settings – clickhouse engine settings
heading –
list of field names paired with clickhouse data types [ (‘fld1_name’, ‘dtype1’) ,

(‘fld2_name’, ‘dtype2’,) ( -//- , -//- ) (‘fldN_name’, ‘dtypeN’ )
partition_key –
order_key –
execute –

Returns

property cursor¶

disconnect()¶

get_columns(table=None, columns=None, fields=None, aggr=None, **kwargs)¶

Parameters

table – ClickHouse table name
columns – list of ClickHouse column names
fields – Metadata fields for columns
aggr – aggregate metadata results for the columns of a clickhouse table

Returns

metadata for clickhouse columns

get_mutations(table, limit=None, group_by=None, execute=True)¶

Parameters

table – clickhouse table
group_by –
limit – SQL limit
execute –

Returns

get_parts(table, hb2=None, active=True, execute=True)¶

Parameters

table – clickhouse table
hb2 – select parts with a specific hb2 dimension (hb2 is the dim2 of the Entity/ASET key) default hb2=’%’
active – select only active parts
execute – Execute the command only if execute=True

Returns

information about parts of MergeTree tables

get_query_log(execute=True)¶

property last_query_statistics¶

optimize_engine(table, execute=True)¶

property print_query_statistics¶

sql(sql, out='dataframe', as_columns=None, index=None, partition_size=None, arrow_encoding=True, params=None, qid=None, execute=True, trace=None)¶

This method is calling clickhouse-driver execute() method to execute sql query Connection has already been established. :param sql: clickhouse SQL query string that will be send to server :param out: output format, i.e. python data structure that will represent the result set

dataframe, tuples, json_rows

Parameters

as_columns – user specified column names for pandas dataframe, (list of strings, or comma separated string)
index – pandas dataframe columns
arrow_encoding – PyArrow columnar dictionary encoding
arrow_table – Output is PyArrow Table, otherwise it is a PyArrow RecordBatch
partition_size – ToDo number of records to use for each partition or target size of each partition, in bytes
params – clickhouse-client execute parameters
qid – query identifier. If no query id specified ClickHouse server will generate it
execute – execute SQL commands only if execute=True
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

Returns

result set formatted according to the out parameter

hypermorph.connector_mysql module¶

class hypermorph.connector_mysql.MySQL(host, port, user, password, database, trace=0)¶

Bases: object

property api_category¶

close()¶

property connection¶

property cursor¶

property last_query¶

set_cursor(buffered=True, raw=None, dictionary=None, named_tuple=None)¶

sql(sql, out='dataframe', as_columns=None, index=None, partition_size=None, arrow_encoding=True, execute=True, buffered=True, trace=None)¶

This method is calling the cursor.execute() method of mysql.connector to execute sql query Connection has already been established. :param sql: mysql query string that will be send to server :param out: output format, i.e. python data structure that will represent the result set

dataframe, tuples, named_tuples, json_rows

Parameters

partition_size – ToDo number of records to use for each partition or target size of each partition, in bytes
arrow_encoding – PyArrow columnar dictionary encoding
as_columns – user specified column names for pandas dataframe (list of strings, or comma separated string)
index – column names to be used in pandas dataframe index
execute – execute SQL commands only if execute=True
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….
buffered – MySQLCursorBuffered cursor fetches the entire result set from the server and buffers the rows. For nonbuffered cursors, rows are not fetched from the server until a row-fetching method is called.

Returns

result set formatted according to the out parameter

For more details about MySQLCursor class execution see https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor.html

hypermorph.connector_sqlalchemy module¶

class hypermorph.connector_sqlalchemy.SQLAlchemy(dialect=None, host=None, port=None, user=None, password=None, database=None, path=None, trace=0)¶

Bases: object

property api_category¶

property connection¶

property cursor¶

property engine¶

property last_query¶

property last_query_stats¶

sql(sql, out='dataframe', execute=True, trace=None, arrow_encoding=True, as_columns=None, index=None, partition_size=None, **kwargs)¶

Parameters

sql – sql query string that will be send to server
out – output format e.g. dataframe, tuples, ….
execute – flag to enable execution of SQL statement
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….
partition_size – number of records to use for each partition or target size of each partition, in bytes
arrow_encoding – PyArrow columnar dictionary encoding
as_columns – user specified column names for pandas dataframe (list of strings, or comma separated string)
index – column names to be used in pandas dataframe index
kwargs – parameters passed to pandas.read_sql() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

Returns

result formatted according to the out parameter

hypermorph.data_graph module¶

class hypermorph.data_graph.GData(rebuild=False, load=False, **graph_properties)¶

Bases: object

GData class represents ABox “assertion components” — a fact associated with a terminological vocabulary Such a fact is an instance of HB-HAs association ABox are TBox-compliant statements about that vocabulary Each instance of HB-HAs association is compliant with the model (schema) of Entity-Attributes

GData of HyperMorph are represented with a directed graph that is based on graph_tool python module. GData is composed from DataNodes and DataEdges. Each DataEdge links two DataNodes and we define a direction convention from a tail DataNode to a head DataNode.

HyperAtom, HyperBond classes are derived from DataNode class

GData of HyperMorph is a hypergraph defined by two sets of objects (a.k.a hyper-atoms HAs & hyper-bonds HBs) If we have ‘hyper-bonds’ HB={hb1, hb2, hb3} and ‘hyper-atoms’ B={ha1, ha2, ha3} then we can make a map such as d = {hn1: (ha1, ha2), hb2: (ha2), hb3: (ha1, ha2, ha3)} G(HB, HA, d) is the hypergraph

add_edge(from_vertex, to_vertex)¶: Used in GDataLink to create a new instance of an edge :param from_vertex: tail vertex :param to_vertex: head vertex :return: an edge of GData Graph

add_hyperlinks(hlinks, hb2=10000)¶

Method updates the GData graph with vertices (nodes) and edges (links) that are related to parameter hlinks It also adds dim2, dim1 and ntype vertex properties

Parameters: hb2 – dim2 value for hyperbonds, it is set at a high enough value to filter them later on in the graph of data

:param hlinks is a list of hyperlinks in the form [ ((hb2, hb1), (ha2, ha1)), ….., ((hb2, hb1), (ha2, ha1))] A hyperlink is defined as the edge that connect a HyperBond with a HyperAtom, i.e. HB(hb2, hb1) —> HA(ha2, ha1)

In a table of data (hb2) with pk=hb1 is associated (linked) to a column of data (ha2) with indices (ha1) hb2 uint16>10000 represents a data table hb1 uint32 represents a data table row or pk index ha2 uint16<10000 represents a column of the data table ha1 uint32 represents a unique value, secondary index value of the specific column (ha2)

Therefore the set of hyperlinks (HBi —> HA1, HBi —> HA2, HBi —> HAn) transforms the tuple of a Relation to an association between the table row and the column values, indices

This association is graphically represented on a hypergraph with a hyperedge (HB) that connects many hypernodes (HAs)

add_link(from_node, to_node)¶

Parameters

from_node – tail node is a GDataNode object or node ID
to_node – head node is a GDataNode object or node ID

If there isn’t a link from node, to node it will try to create a new one, otherwise it will return an existing GDataLink instance

Returns: GDataLink object, i.e. an edge of the GData graph

add_node(**nprops)¶

Parameters: nprops – GData node (vertex) properties
Returns: HyperBond object

add_values(string_values, hb2=10000)¶

Create and set a value vertex property :param hb2: dim2 value for hyperbonds,

it is set at a high enough value to filter them later on in the graph of data

Parameters: string_values – string_repr string representation of ha2 column UNIQUE data values with a NumPy array of dtype=str
Returns

add_vertex(**vprops)¶: Used in GDataNode to create a new instance of a node :param vprops: GData vertex properties :return: a vertex of GData Graph

add_vertices(n)¶

at(dim2, dim1)¶

Parameters

dim2 – ha2 dimension of hyperatom or hb2 dimension of hyperbond
dim1 – ha1 dimension of hyperatom or hb1 dimension of hyperbond

Returns

the node of the graph with the specific dimensions

property dim1¶

property dim2¶

get(nid)¶

Parameters: nid – Node ID (vertex id)
Returns: GDataNode object

get_node_by_id(nid)¶

Parameters: nid – node ID (vertex id)
Returns: GDataNode object from the derived class, i.e. HyperAtom, HyperBond object see class_dict

get_node_by_key(dim2, dim1)¶

Parameters

dim2 –
dim1 –

Returns

object with the specific key

get_vp(vp_name)¶

Parameters: vp_name – vertex property name
Returns: VertexPropertyMap object

get_vp_value(vp_name, vid)¶

Parameters

vp_name – vertex property name
vid – either vertex object or vertex index (node id)

Returns

the value of vertex property on the specific vertex of the graph

get_vp_values(vp_name, filtered=False)¶

property graph¶

property graph_properties¶

property graph_view¶

property is_filtered¶

property is_view_filtered¶

property list_properties¶

property net_alias¶

property net_descr¶

property net_edges¶

property net_format¶

property net_name¶

property net_path¶

property net_tool¶

property net_type¶

property ntype¶

save_graph()¶: Save HyperMorph GData._graph using the self._net_name, self._net_path and self._net_format

set_filter(vmask, inverted=False)¶

This is filtering DGraph._graph instance Only the vertices with value different than False are kept in the filtered graph

Parameters

vmask – boolean mask for the vertices of the graph
inverted – if it is set to TRUE only the vertices with value FALSE are kept.

Returns

the filtered state of the graph

set_filter_view(vmask)¶: DGraph._graph_view is a filtered view of the DGraph._graph, in that case the state of the DGraph is not affected by the filtering operation, i.e. after filtering DGraph._graph has the same vertices and edges as before filtering :param vmask: boolean mask for the vertices of the graph :return: filtered state of the graph view

unset_filter()¶: Reset the filtering of the DGraph._graph instance :return: the filtered state

unset_filter_view()¶

property value¶

property vertex_properties¶

property vertices¶

property vertices_view¶

property vid¶

property vids¶

property vids_view¶

property vmask¶

hypermorph.data_graph.int_to_class(class_id)¶

Parameters: class_id – (0 - ‘HyperAtom’) or (1 - ‘HyperBond’)
Returns: a class that is used in get(), get_node_by_id() methods

hypermorph.data_graph_hyperatom module¶

class hypermorph.data_graph_hyperatom.HyperAtom(gdata, vid=None, **node_properties)¶: Bases: hypermorph.data_graph_node.GDataNode

hypermorph.data_graph_hyperbond module¶

class hypermorph.data_graph_hyperbond.HyperBond(gdata, vid=None, **node_properties)¶: Bases: hypermorph.data_graph_node.GDataNode

hypermorph.data_graph_link module¶

class hypermorph.data_graph_link.GDataLink(gdata, from_node, to_node)¶

Bases: object

Each instance of GDataLink links a tail node with a head node, i.e. HyperBond —> HyperAtom

Each GDataLink has two connectors (bidirectional edges): An outgoing edge from the tail An incoming edge to the head

In the case of a HyperBond (HB) node there are <Many> Outgoing Edges that start < From One > HB In the case of a HyperAtom (HA) node there are <Many> Incoming Edges that end < To One > HA

GDataLink type represents a DIRECTED MANY TO MANY RELATIONSHIP

Important Notice: Do not confuse the DIRECTION OF RELATIONSHIP with the DIRECTION OF TRAVERSING THE BIDIRECTIONAL EDGES of the GDataLink

Many-to-Many Relationship is defined as a (Many-to-One) and (One-to-Many)

MANY side (tail node) —ONE side (outgoing edge)— —ONE side (incoming edge)— MANY side (head node)

(fromID)
An outgoing edge

FROM Node ========================== GDataLink ==========================> TO Node
(toID) An incoming edge

property edge¶

property gdata¶

hypermorph.data_graph_node module¶

class hypermorph.data_graph_node.GDataNode(gdata, vid=None, **vprops)¶

Bases: object

The GDataNode class:

if vid is None
create a NEW node, i.e. a new vertex on the graph with properties
if vid is not None
initialize a node that is represented with an existing vertex with vid

property all¶

property all_edges_ids¶

property all_links¶

property all_nids¶

property all_nodes¶

property all_vertices¶

property gdata¶

get_value(prop_name)¶

Parameters: prop_name – Vertex property name (vp_names) or @property function name (calculated_properties) or data type properties (field_meta)
Returns: the value of property for the specific node

property in_edges_ids¶

property in_links¶

property in_nids¶

property in_nodes¶

property in_vertices¶

property key¶

property out_edges_ids¶

property out_links¶

property out_nids¶

property out_nodes¶

property out_vertices¶

property vertex¶

hypermorph.data_pipe module¶

class hypermorph.data_pipe.DataPipe(schema_node, result=None)¶

Bases: hypermorph.utils.GenerativeBase

Implements method chaining: A query operation, e.g. projection, counting, filtering can invoke multiple method calls. Each method corresponds to a query operator such as: get_components.over().to_dataframe().out()

out() method is always at the end of the chained generative methods to return the final result

Each one of these operators returns an intermediate result self.fetch allowing the calls to be chained together in a single statement.

DataPipe methods such as get_rows() are wrapped inside methods of other classes e.g. get_rows() Table(SchemaNode) so that when they are called from these methods the result can be chained to other methods of DataPipe In that way we implement easily and intuitively transformations and conversion to multiple output formats

Notice: we distinguish between two different execution types according to the evaluation of the result

Lazy evaluation, see for example to_***() methods
Eager evaluation

This module has a dual combined purpose:

perform transformations from one data structure to another data structure
load data into volatile memory (RAM, DRAM, SDRAM, SRAM, GDDR) or import data into non-volatile storage (NVRAM, SSD, HDD, Database) with a specific format e.g. parquet, JSON, ClickHouse MergeTree engine table, MYSQL table, etc…

Transformation, importing and loading operations are based on pyarrow/numpy library and ClickHouse columnar DBMS

exclude(select=None)¶

Parameters: select – Exclude columns in projection
Returns

get_columns()¶: Wrapped in Table(SchemaNode) class :return: pass self.fetch to the next chainable operation

get_rows(npartitions=None, partition_size=None)¶

Wrapped in Table(SchemaNode) class Fetch either records of an SQL table or rows of a flat file Notice: Specify either npartitions or block_size parameter or none of them

Parameters

npartitions – split the values of the index column linearly slice() will have the effect of modifying accordingly the split
partition_size – number of records to use for each partition or target size of each partition, in bytes

Notice: npartitions or partition_size will perform a lazy evaluation and it will return a generator object

Returns: pass self.fetch to the next chainable operation

order_by(columns)¶

Parameters: columns – comma separated string column names to sort by
Returns

out(lazy=False)¶

We distinguish between two cases, eager vs lazy evaluation. This is particularly useful when we deal with very large dataframes that do not fit in memory

Parameters: :lazy –
Returns: use out() method at the end of the chained generative methods to return the

output of SchemaNode objects displayed with the appropriate specified format and structure

over(select=None, as_names=None, as_types=None)¶

Notice: over() must be present in method chaining: when you fetch data by constructing and executing an SQL query in that case default projection self._project = ‘ * ‘

Parameters

select – projection over the selected metadata columns
as_names – list of column names to use for resulting frame List of user-specified column names, these are used: i) to rename columns (SQL as operator) ii) to extend the result set with calculated columns from an expression
as_types – list of data types or comma separated string of data types e.g. when we read data from flat files using pandas.read_csv and we want to disable type inference on those columns these are pandas data types

Returns

pass self.fetch to the next chainable operation

property schema_node¶

slice(limit=None, offset=0)¶

Parameters

limit – number of rows to return from the result set
offset – number of rows to skip from the result set

Returns

SQL statement

property sql_query¶

to_batch(delimiter=None, nulls=None, skip=0, trace=None, arrow_encoding=True)¶

Parameters

delimiter – 1-character string specifying the boundary between fields of the record
nulls – list of strings that denote nulls e.g. [‘N’]
skip – number of rows to skip at the start of the flat file
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….
arrow_encoding – apply PyArrow columnar dictionary encoding

Returns

PyArrow RecordBatch with optionally dictionary encoded columns

to_dataframe(data=None, index=None, delimiter=None, nulls=None, trace=None)¶

Parameters

data – ndarray (structured or homogeneous), Iterable, dict
index – column names of the result set to use in pandas dataframe index
delimiter – 1-character string specifying the boundary between fields of the record
nulls – list of strings that denote nulls e.g. [‘N’]
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

Returns

pandas dataframe

to_feather(path, **feather_kwargs)¶

Parameters

path – full path of the feather file
feather_kwargs –

Returns

file_location

to_parquet(path, **parquet_kwargs)¶

Parameters

path – full path string of the parquet file
parquet_kwargs – row_group_size, version, use_dictionary, compression (see…

https://pyarrow.readthedocs.io/en/latest/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table :return: file_location

to_table(delimiter=None, nulls=None, skip=0, trace=None, arrow_encoding=True)¶

Notice1: This is a transformation from a row layout to a column layout, i.e. chained to get_rows() method: Dictionary encoded columnar layout is a fundamental component of HyperMorph associative engine.

Notice2: The output is a PyArrow Table data structure with a columnar layout, NOT a row layout, Notice3: method is also used when we fetch columns directly from a columnar data storage e.g.

ClickHouse columnar database, parquet files, i.e. chained to get_columns() method

Parameters

delimiter – 1-character string specifying the boundary between fields of the record
nulls – list of strings that denote nulls e.g. [‘N’]
skip – number of rows to skip at the start of the flat file
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….
arrow_encoding – apply PyArrow columnar dictionary encoding

Returns

PyArrow in-memory table with a columnar data structure with optionally dictionary encoded columns

to_tuples(trace=None)¶: ToDo NumPy structured arrays representation…. :param trace: trace execution of query, i.e. print query, ellapsed time, rows in set, etc…. :return:

where(condition=None)¶

hypermorph.draw_hypergraph module¶

class hypermorph.draw_hypergraph.IHyperGraphPlotter(edges, vertex_labels, vertex_colors)¶

Bases: object

This module draws a hypergraph from edges using the igraph library

plot(**kwargs)¶

Parameters: kwargs – pass parameters to igraph plot function
Returns

hypermorph.exceptions module¶

exception hypermorph.exceptions.ASETError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to construct an AssociativeSet instance

exception hypermorph.exceptions.AssociationError¶: Bases: hypermorph.exceptions.HyperMorphError

exception hypermorph.exceptions.ClickHouseException¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to execute query in ClickHouse

exception hypermorph.exceptions.DBConnectionFailed¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to create a connection with the database

exception hypermorph.exceptions.GraphError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised in Schema methods

exception hypermorph.exceptions.GraphLinkError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised in SchemaLink methods

exception hypermorph.exceptions.GraphNodeError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised in SchemaNode methods or in any of the methods of SchemaNode subclasses

exception hypermorph.exceptions.HACOLError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to initialize HACOL

exception hypermorph.exceptions.HyperMorphError¶

Bases: Exception

Base class for all HyperMorph-related errors

exception hypermorph.exceptions.InvalidAddOperation¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when you call DataManagementFramework.add() with invalid parameters

exception hypermorph.exceptions.InvalidDelOperation¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when you call DataManagementFramework.del() with invalid parameters

exception hypermorph.exceptions.InvalidEngine¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when we pass a wrong type of HyperMorph engine

exception hypermorph.exceptions.InvalidGetOperation¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when you call DataManagementFramework.get() with invalid parameters

exception hypermorph.exceptions.InvalidPipeOperation¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to execute an operation in a pipeline

exception hypermorph.exceptions.InvalidSQLOperation¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to execute an SQL command

exception hypermorph.exceptions.InvalidSourceType¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when we pass a wrong source type of HyperMorph

exception hypermorph.exceptions.MISError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised in operations with DataDictionary

exception hypermorph.exceptions.PandasError¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to construct pandas dataframe

exception hypermorph.exceptions.UnknownDictionaryType¶

Bases: hypermorph.exceptions.HyperMorphError

Raised when trying to add a term in the dictionary with an unknown type Types can be either : HyperEdges, i.e. instances of the TBoxTail class DRS, DMS, DLS - (dim4, 0 , 0) HLT, DS, DM - (dim4, dim3, 0) HyperNodes, i.e. instances of the TBoxHead class TSV, CSV, FLD - (dim4, dim3, dim2) ENT, ATTR - (dim4, dim3, dim2)

exception hypermorph.exceptions.UnknownPrimitiveDataType¶

Bases: hypermorph.exceptions.HyperMorphError

Primitive Data Types are: [‘bln’, ‘int’, ‘flt’, ‘date’, ‘time’, ‘dt’, ‘enm’, ‘uid’, ‘txt’, ‘wrd’]

exception hypermorph.exceptions.WrongDictionaryType¶

Bases: hypermorph.exceptions.HyperMorphError

raised when we attempt to call a specific method on an object that has wrong node type

hypermorph.hacol module¶

class hypermorph.hacol.HAtomCollection(attribute, data)¶

Bases: object

A HyperAtom Collection (HACOL) can be: 1. A set of hyperatoms (HACOL_SET) that represent the domain of values for a specific attribute

2. A multiset of hyperatoms (HACOL_BAG) that represents a column of data in a table Each hyperatom may appear multiple times in this collection because each hyperatom is linked to one or more hyperbonds (MANY-TO-MANY relationship)

3. A set of values of a specific data type (HACOL_VAL) where each value is associated with a hyperatom from the set of hyperatoms (HACOL_SET) to form a KV pair.

The set of KV pairs represents the domain of a specific attribute where: K is the key of hyperatom with dimensions (dim3-model, dim2-attribute, dim1-distinct value) V is the data type value

HyperAtoms can be displayed with K, V or K:V pair

All hyperatoms in (1), (2) and (3) have common dimensions (dim3, dim2) i.e. same model, same attribute

HACOL is bringing together but at the same time keep them separate under the same object:: metadata stored in an Attribute of the DataModel data (self._data) stored in PyArrow DictionaryEncoded Array object Notice: data points to a DictionaryEncoded Array object which is a column of a PyArrow Table

count(dataframe=True)¶

property data¶

dictionary(columns=None, index=None, order_by=None, ascending=None, limit=None, offset=0)¶

Parameters

columns – list (or comma separated string) of column names for pandas dataframe
index – list (or comma separated string) of column names to include in pandas dataframe index
order_by – str or list of str Name or list of names to sort by
ascending – bool or list of bool, default True the sorting order
limit – number of records to return from states dictionary
offset – number of records to skip from states dictionary

Returns

states dictionary of HACOL

property filtered¶

property filtered_data¶

property hatoms_included¶

is_filtered()¶

Returns: The filtered state of the HACOL

memory_usage(mb=True, dataframe=True)¶

property pipe¶: Returns a HACOLPipe GenerativeBase object that refers to an instance of a HyperCollection use this object to chain operations and to update the state of HyperCollection instance.

print_states(limit=10)¶: wrapper for dictionary() :param limit: :return:

property q¶: wrapper for the starting point of a query pipeline :return:

reset()¶

update_frequency_include_color_state(indices)¶

In associative filtering we update frequency, include and color state for ALL HACOLs

Parameters: indices – unique indices of filtered values (pyarrow.lib.Int32Array) these are values that are included in a column of a filtered table
Returns

update_select_state(indices)¶

Parameters: indices – unique indices of the selected values (pyarrow.lib.Int32Array)
Returns

property values_included¶

hypermorph.hacol_pipe module¶

class hypermorph.hacol_pipe.HACOLPipe(hacol, result=None)¶

Bases: hypermorph.utils.GenerativeBase

And()¶: ToDo: …. :return:

In()¶

ToDo:….. 1st case comma separated string or list of string values e.g. ‘Fairfax Village, Anacostia Metro, Thomas Circle, 15th & Crystal Dr’

(‘Fairfax Village’, ‘Anacostia Metro’, ‘Thomas Circle’, ‘15th & Crystal Dr’)

2nd case list of numeric values e.g. (31706, 31801, 31241, 31003)

Not()¶: ToDo: …. :return:

Or()¶: ToDo: …. :return:

between(low, high, low_open=False, high_open=False)¶

ToDo:… scalar operations with an interval :param low: lower limit point :param high: upper limit point :param low_open: :param high_open:

closed interval (default) —> low_open=False, high_open=False open interval —> low_open=True, high_open=True half open interval —> low_open=False, high_open=True half open interval —> low_open=True, high_open=False

Returns: BooleanArray Mask that is used in filter()

count(dataframe=True)¶

Parameters: dataframe – flag to display output with a Pandas dataframe
Returns: number of values in filtered/unfiltered state number of hatoms in filtered/unfiltered state

filter(mask=None)¶

It uses a boolean array mask (self.fetch) constructed in previous chained operation to filter HACOL data represented with a DictionaryArray

Parameters: mask – this is used when we call filter() externally from ASETPipe.filter() method to update the filtering state of HACOL
Returns: DictionaryArray, i.e. HACOL.data filtered the filtered DictionaryArray is pointed at self._hacol.filtered_data

like(pattern)¶: Notice: like operator can also be used in where() as a string :param str pattern: match substring in column string values :return: PyArrow Boolean Array mask (self.fetch) that is used in filter()

it also returns boolean mask to calls from ASETPipe.where(), ASETPipe.And() methods

out(lazy=False)¶

We distinguish between two cases, eager vs lazy evaluation. This is particularly useful when we deal with very large HyperAtom collections that do not fit in memory

Parameters: :lazy –
Returns: use out() method at the end of the chained generative methods to return the output displayed with the appropriate specified format and structure

slice(limit=None, offset=0)¶

slice is used either to limit the number of entries to return in the states dictionary or to limit the members of HyperAtom collection, i.e. hyperatoms (values)

Parameters

limit – number of records to return from the result set
offset – number of records to skip from the result set

Returns

A slice of records

start()¶: This is used as the first method in a chain of other methods where we set the filtered/unfiltered data pipeline methods slice(), to_array(), to_numpy(), to_series() start here :return: DictionaryArray either in filtered or unfiltered state

to_array(order=None, unique=False)¶

Parameters

order – default None, ‘asc’, ‘desc’
unique – take distinct elements in array

Returns

by default PyArrow Array or PyArrow DictionaryArray if dictionary=False

to_hyperlinks(hb2=10001)¶

Parameters: hb2 – dim2 value for hyperbonds, it is set at a high enough value >10000 to filter them later on in the graph of data
Returns: HyperLinks (edges that connect a HyperBond with HyperAtoms) List of pairs in the form [ ((hb2, hb1), (ha2, ha1)), ((hb2, hb1), (ha2, ha1)), …] These are used to create a data graph

to_numpy(order=None, limit=None, offset=0)¶

Parameters

order – default None, ‘asc’, ‘desc’
limit – number of values to return from HACOL
offset – number of values to skip from HACOL

Returns

to_series(order=None, limit=None, offset=0)¶

Parameters

order – default None, ‘asc’, ‘desc’
limit – number of values to return from HACOL
offset – number of values to skip from HACOL

Returns

Pandas Series

to_string_array()¶

Returns: List of string values This is a string representation for the valid (non-null) values of the filtered HACOL It is used in the construction of a data graph to set the value property of the node

where(condition='$v')¶

Example: phys.q.where(‘city like ATLANTA’) Notice: Entering where() method, self.fetch = self._hacol.filtered_data

Thus pc.match_substring(), pc.greater(), pc.equal() etc… are applied to either already filtered or unfiltered (self._hacol.filtered_data = self._hacol.data) DictionaryArray

Parameters: condition –
Returns: PyArrow Boolean Array mask (self.fetch) that is used in filter() it also returns boolean mask to calls from ASETPipe.where(), ASETPipe.And() methods

hypermorph.haset module¶

class hypermorph.haset.ASET(entity, debug)¶

Bases: object

An AssociativeSet, also called AssociativeEntitySet, is ALWAYS bounded to a SINGLE entity An AssociativeSet is a Set of Association objects (see Association Class) An AssociativeSet can also be represented with a set of HyperBonds

There is a direct analogy with the Relational model:

Relation : A set of tuples —-> Associative Set : A set of Associations Body : tuples of ordered values —-> Body : Associations Heading : A tuple of ordered attribute names —-> Heading : A set of attributes View : Derived relation —-> Associative View: A derived set of Associations

ASET is bringing together but at the same time keep them separate under the same object:: metadata stored in an Entity of the DataModel data (self._data) stored in PyArrow DictionaryEncoded Table object from one or more DataSet(s)

property attributes¶

count()¶: wrapper for ASETPipe.count() method :return:

property data¶

dictionary_encode(delimiter=None, nulls=None, skip=0, trace=None)¶

It will load data from the DataSet, it currently supports tabular format (rows or columns of a data table) and will apply PyArrow DictionaryArray encoding to the columns

Parameters

delimiter – 1-character string specifying the boundary between fields of the record
nulls – list of strings that denote nulls e.g. [‘N’]
skip – number of rows to skip at the start of the flat file
trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

Returns

PyArrow RecordBatch constructed with DictionaryEncoded Array objects

property entity¶

property filtered¶

property filtered_data¶

property hacols¶

property hbonds¶

is_filtered()¶

Returns: The filtered state of ASET

property mask¶

memory_usage(mb=True, dataframe=True)¶

Parameters

mb – output units MegaBytes
dataframe – flag to display output with a Pandas dataframe

Returns

property num_rows¶

property pipe¶: Returns an ASETPipe GenerativeBase object that refers to an instance of a HyperCollection use this object to chain operations and to update the state of HyperCollection instance.

print_rows(select=cname_list, order_by='city, last, first', limit=20, index='npi, pacID')¶

Parameters

select –
as_names –
index –
order_by –
ascending –
limit –
offset –

Returns

property q¶: wrapper for the starting point of a query pipeline :return:

reset(hacols_only=False)¶

ASET reset includes:: Construction of PyArrow Boolean Array mask with ALL True reset of filtered state to False reset of Hyperbonds reset of HACOLs

Parameters: hacols_only – Flag for partial reset of HACOLs only
Returns

property select¶: wrapper for the starting point of a query pipeline in associative filtering mode :return:

update_hacols_filtered_state()¶

Update the filtering state of HyperAtom collections This is used when we want to operate with HyperAtom collections at filtered state <aset>.<hacol>.<operation>

For a single HACOL we can also use the form <aset>.<hacol>.q.filter(<aset.mask>).<operation>.out() :return:

hypermorph.haset_pipe module¶

class hypermorph.haset_pipe.ASETPipe(aset, result=None)¶

Bases: hypermorph.utils.GenerativeBase

And(condition)¶

Parameters: condition –
Returns: BooleanArray Mask that is used in filter()

count()¶

Returns: number of hbonds (rows) in filtered/unfiltered state

filter()¶

Returns

out(lazy=False)¶

We distinguish between two cases, eager vs lazy evaluation. This is particularly useful when we deal with very large dataframes that do not fit in memory

Parameters: :lazy –
Returns: use out() method at the end of the chained generative methods to return the

output of SchemaNode objects displayed with the appropriate specified format and structure

over(select=None, as_names=None, as_types=None)¶

Notice: over(), i.e. projection is chained after the filter() method

Parameters

select – projection over the selected metadata columns
as_names – list of column names to use for resulting dataframe List of user-specified column names, these are used: i) to rename columns (SQL as operator) ii) to extend the result set with calculated columns from an expression
as_types – list of data types or comma separated string of data types

Returns

RecordBatch

select()¶

Warning: DO NOT CONFUSE select() with over() operator

In HyperMorph select() is used as a flag to alter the state of HyperAtom collections This is the associative filtering that takes place where we

Change the filtering state of HyperAtom collections

Update the selection, included states for each member of the HyperAtom collection

From an end-user perspective that results in selecting values from a HyperAtom collection

Notice: In associative filtering mode we use only where() restriction

and we filter with values from a SINGLE HyperAtom collection

Returns

slice(limit=None, offset=0)¶

Parameters

limit – number of records to return from the result set
offset – number of records to skip from the result set

Returns

A slice of records

start()¶: This is used as the first method in a chain of other methods where we set the filtered/unfiltered data pipeline methods over(), slice(), to_record_batch(), to_records(), to_table(), to_dataframe() start here :return: RecordBatch either in filtered or unfiltered state

to_dataframe(index=None, order_by=None, ascending=None, limit=None, offset=0)¶

Notice1: Use to_record_batch() transformation before chaining it to Pandas DataFrame,: it is a lot faster this way because it decodes PyArrow RecordBatch, i.e. RecordBatch columns are not dictionary encoded
Notice2: sorting (order_by, ascending) and slicing (limit, offset) in a Pandas dataframe is slow: but sorting has not been implemented in PyArrow and that is why we pass these parameters here

Parameters

order_by – str or list of str Name or list of names to sort by
ascending – bool or list of bool, default True the sorting order
limit – number of records to return from the result set
offset – number of records to skip from the result set
index – list (or comma separated string) of column names to include in pandas dataframe index

Returns

Pandas dataframe

to_hyperlinks()¶

Returns: HyperLinks (edges that connect a HyperBond with HyperAtoms) List of pairs in the form [ ((hb2, hb1), (ha2, ha1)), ((hb2, hb1), (ha2, ha1)), …] These are used to create a data graph

Notice: Set HACOLs to filtered state first,: using self._aset.update_hacols_filtered_state()

to_record_batch()¶

Returns: PyArrow RecordBatch but columns are not dictionary encoded

Notice: Always decode PyArrow RecordBatch before sending it to Pandas DataFrame, it is a lot faster

to_records()¶

Returns: NumPy Records

to_string_array(unique=False)¶

Parameters: unique –
Returns: List of string values This is a string representation for the valid (non-null) values of the filtered HACOL It is used in the construction of a data graph to set the value property of the node

Notice: Set HACOLs to filtered state first,: using self._aset.update_hacols_filtered_state()

to_table()¶

Returns: PyArrow Table

where(condition)¶

Notice: The minimum condition you specify is the attribute name or attribute dim2 dimension Valid conditions: ‘$2’, ‘quantity’, ‘price>=4’, ‘size = 10’

Parameters: condition –
Returns: BooleanArray Mask that is used in filter()

hypermorph.hassoc module¶

class hypermorph.hassoc.Association(*pos_args, **kw_args)¶

Bases: object

This is the analogue of a relational tuple, i.e. row of ordered values An Association is the basic construct of Associative Sets

It is called Association because it associates a HyperBond to a set of HyperAtoms HyperBond is a symbolic 2D numerical representation of a row and HyperAtom is a symbolic 2D numerical representation of a unique value in the table column HyperAtoms can also have textual (string) representation

Association can be represented in many ways: i) With the hb key A[7, 4]

ii) With keyword arguments Association(hb=(7, 4), prtcol=None, prtwgt=None, prtID=227, prtnam=’car battery’, prtunt=None)

iii) With positional arguments Association((7,4), None, None, 227, ‘car battery’, None)

heading: a set of attributes and a key e.g. (‘hb’, ‘prtcol’, ‘prtwgt’, ‘prtID’, ‘prtnam’, ‘prtunt’)

body: KV pairs e.g. Association(hb=(7, 4), prtcol=None, prtwgt=None, prtID=227, prtnam=’car battery’, prtunt=None)

property body¶

static change_heading(*fields)¶

get()¶

property heading_fields¶

hypermorph.mis module¶

class hypermorph.mis.MIS(debug=0, rebuild=False, warning=True, load=False, **kwargs)¶

Bases: object

MIS is a builder pattern class based on Schema class, ….

add(what, **kwargs)¶: Add new nodes to HyperMorph Schema or an Associative Entity Set :param what: the type of node to add (datamodel, entity, entities, attribute, dataset) :param kwargs: pass keyword arguments to Schema.add() method :return: the object(s) that were added to HyperMorph Schema

static add_aset(from_table=None, with_fields=None, entity=None, entity_name=None, entity_alias=None, entity_description=None, datamodel=None, datamodel_name='NEW Data Model', datamodel_alias='NEW_DM', datamodel_descr=None, attributes=None, as_names=None, as_types=None, debug=0)¶

There are three ways to create an ASET object:

From an Entity that has already a mapping defined (entity) fields are mapped onto the attributes of an existing Entity

From a Table of a dataset (from_table, with_fields) that are mapped onto the attributes of a NEW Entity that is created in an existing DataModel,

From a Table of a dataset (from_table, with fields) that are mapped onto the attributes of a NEW Entity that is created in a NEW DataModel

Case (2) and (3) define a new mapping between a data set and a data model

Parameters

from_table –
with_fields –
entity –
entity_name –
entity_alias –
entity_description –
datamodel –
datamodel_name –
datamodel_alias –
datamodel_descr –
attributes –
as_names –
as_types –
debug –

Returns

property all_nodes¶

at(*args)¶

property datamodels¶

property datasets¶

property dms¶

property drs¶

get(nid, what='node', select=None, index=None, out='dataframe', junction=None, mapped=None, key_column='nid', value_columns='cname', filter_attribute=None, filter_value=None, reset=False)¶

This method implements the functional paradigm, it is basically a wrapper of chainable methods, for example: get(461). get_entities(). over(select=’nid, dim3, dim2, cname, alias, descr’). to_dataframe(index=’dim3, dim2’). out()

can be written as get(461, what=’entities’, select=’nid, dim3, dim2, cname, alias, descr’, out=’dataframe’, index=’dim3, dim2’)

Parameters

nid –
what –
select –
index –
out –
junction –
mapped –
key_column –
value_columns –
filter_attribute –
filter_value –
reset –

Returns

get_all_nodes()¶

get_datamodels()¶

get_datasets()¶

get_overview()¶

get_systems()¶

property hls¶

load(**kwargs)¶

property mem¶

property mms¶

property overview¶

rebuild(warning=True, **kwargs)¶

property root¶

save()¶

static size_of_dataframe(df, deep=False)¶

static size_of_object(obj)¶

property sls¶

property systems¶

hypermorph.schema module¶

class hypermorph.schema.Schema(rebuild=False, load=False, **graph_properties)¶

Bases: object

Schema class creates a data catalog, i.e. meta-data repository. Data catalog resembles (TBox) a vocabulary of “terminological components”, i.e. abstract terms Data catalog properties e.g. dimensions, names, counters, etc describe the concepts in a data dictionary These terms are Entity types, Attribute types, Data Resource types, Link(edge) types, etc…. TBox is about types and relationships between types e.g. Entity-Attribute, Table-Column, Object-Fields, etc….

Schema of HyperMorph is represented with a directed graph that is based on graph_tool python module. Schema graph is composed from SchemaNodes and SchemaEdges. Each SchemaEdge links two SchemaNodes and we define a direction convention from a tail SchemaNode to a head SchemaNode.

System, DataModel, DataSet, GraphDataModel, Table, Field, classes are derived from SchemaNode class

Schema of HyperMorph is a hypergraph defined by two sets of objects (a.k.a. hyper-nodes & hyper-edges). If we have ‘hyper-edges’ HE={he1, he2, he3} and ‘hyper-nodes’ B={hn1, hn2, hn3} then we can make a map such as d = {he1: (hn1, hn2), he2: (hn2), he3: (hn1, hn2, hn3)} G(HE, HN, d) is the hypergraph

add(what, with_components=False, datamodel=None, **kwargs)¶

Wrapper method for add methods

Parameters

what – the type of node to add (datamodel, entity, entities, attribute, dataset)
with_components –

existing components of the dataset to add, valid parameters are
[‘tables’, ‘fields’], ‘tables’, ‘graph data models’, ‘schemata’)

”tables”: For datasets in a DBMS add database tables,
For datasets from files with a tabular structure add files of a specific type in a folder Files with tabular structure are flat files (CSV, TSV), Parquet files, Excel files, etc… Note: These are added as new Table nodes of HyperMorph Schema with type TBL

”fields”: Either add columns of a database table or fields of a file with tabular structure
Note: These are added as new Field nodes of HyperMorph Schema with type FLD

”graph data models”: A dataset of graph data models, i.e. files of type .graphml or .gt in a folder
Each files in the set serializes, represents, HyperMorph DataModel

”schemata”: A dataset of HyperMorph schemata, i.e. files of type .graphml or .gt in a folder
Each file in the set serializes, represents, HyperMorph Schema
datamodel – A node of type DM to add NEW nodes of type Entity and Attribute
kwargs – Other keyword arguments to pass

Returns

the object(s) that were added to HyperMorph Schema

add_datamodel(**nprops)¶

Parameters: nprops – schema node (vertex) properties
Returns: DataModel object

add_dataset(**nprops)¶

Parameters: nprops – schema node (vertex) properties
Returns: DataSet object

add_edge(from_vertex, to_vertex, **eprops)¶

Parameters

from_vertex – tail vertex
to_vertex – head vertex
eprops – Schema edge properties

Returns

an edge of Schema Graph

add_edges(elist)¶

Notice: it is not used in this module….

Parameters: elist – edge list
Returns

add_link(from_node, to_node, **eprops)¶

Parameters

from_node – tail node is a SchemaNode object or node ID
to_node – head node is a SchemaNode object or node ID
eprops – edge properties

If there isn’t a link from node, to node it will try to create a new one, otherwise it will return an existing SchemaLink instance

Returns: SchemaLink object, i.e. an edge of the schema graph

add_vertex(**vprops)¶

Parameters: vprops – Schema vertex properties
Returns: a vertex of Schema Graph

property alias¶

property all_nodes¶

Returns: shortcut for SchemaPipe operation to set the GraphView in unfiltered state and get all the nodes

at(dim4, dim3, dim2)¶

Notice: Only data model, data resource objects have keys with dimensions (dim4, dim3, dim2)

Parameters

dim4 – dim4 is taken from self.dms.dim4 or self.drs.dim4 it is fixed and never changes
dim3 – represents a datamodel or dataset object
dim2 – represents a component of datamodel or dataset object

Returns

the dataset or the datamodel object with the specific key

property cname¶

property counter¶

property ctype¶

property datamodels¶

Returns: shortcut for SchemaPipe operations to output datamodels metadata in a dataframe

property datasets¶

Returns: shortcut for SchemaPipe operations to output datasets metadata in a dataframe

property descr¶

property dim2¶

property dim3¶

property dim4¶

property dms¶

property drs¶

property ealias¶

property edge_properties¶

property elabel¶

property ename¶

property etype¶

property extra¶

get(nid)¶

Parameters: nid – Node ID (vertex id)
Returns: SchemaNode object

get_all_nodes()¶

Returns: result from get_all_nodes method that can be chained to other operations e.g. filter_view(),

get_datamodels()¶

Returns: result from get_datamodels method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_datasets()¶

Returns: result from get_datasets method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_ep(ep_name)¶

Parameters: ep_name – edge property name
Returns: EdgePropertyMap object

get_ep_value(ep_name, edge)¶

Parameters

ep_name – edge property name
edge –

Returns

the enumerated value of edge property on the specific edge of the graph the value is enumerated with a key in the eprop_dict

get_ep_values(ep_name)¶

get_node_by_id(nid)¶

Parameters: nid – node ID (vertex id)
Returns: SchemaNode object

get_node_by_key(dim4, dim3, dim2)¶

Notice: Only data model, data resource objects have keys with dimensions (dim4, dim3, dim2)

Parameters

dim4 – dim4 is taken from self.dms.dim4 or self.drs.dim4 it is fixed and never changes
dim3 – represents a datamodel or dataset object
dim2 – represents a component of datamodel or dataset object

Returns

the dataset or the datamodel object with the specific key

get_overview()¶

Returns: result from get_datamodels method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_systems()¶

Returns: result from get_systems method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_vp(vp_name)¶

Parameters: vp_name – vertex property name
Returns: VertexPropertyMap object

get_vp_value(vp_name, vid)¶

Parameters

vp_name – vertex property name
vid – either vertex object or vertex index (node id)

Returns

the value of vertex property on the specific vertex of the graph

get_vp_values(vp_name, filtered=False)¶

property graph¶

property graph_properties¶

property graph_view¶

property hls¶

property is_filtered¶

property is_view_filtered¶

property list_properties¶

property net_alias¶

property net_descr¶

property net_edges¶

property net_format¶

property net_name¶

property net_path¶

property net_tool¶

property net_type¶

property ntype¶

property overview¶

Returns: shortcut for SchemaPipe operations to output an overview of systems, datamodels, datasets in a dataframe

property root¶

save_graph()¶: Save HyperMorph Schema._graph using the self._net_name, self._net_path and self._net_format

set_filter(filter_value, filter_attribute=None, operator='eq', reset=True, inverted=False)¶

This is filtering the Schema Graph instance :param filter_value: the value of the attribute to filter vertices of the graph

or a list of node ids (vertex ids)

Parameters

filter_attribute – is a defined vertex property for filtering vertices of the graph (Schema nodes) to create a GraphView
operator – e.g. comparison operator for the values of node
reset – set the GraphView in unfiltered state, i.e. parameter vfilt=None set the vertex mask in unfiltered state, i.e. fill array with zeros this step is necessary when we filter with node_ids
inverted –

Returns

the filtered state

set_filter_view(filter_value, filter_attribute=None, operator='eq', reset=True)¶

GraphView is a filtered view of the Graph, in that case the state of the Graph is not affected by the filtering operation, i.e. after filtering Graph has the same nodes and edges as before filtering

Parameters

filter_value – the value of the attribute to filter vertices of the graph or a list of node ids (vertex ids)
filter_attribute – is a defined vertex property for filtering vertices of the graph (Schema nodes) to create a GraphView
operator – e.g. comparison operator for the values of node
reset – set the GraphView in unfiltered state, i.e. parameter vfilt=None set the vertex mask in unfiltered state, i.e. fill array with zeros this step is necessary when we filter with node_ids

Returns

property sls¶

property systems¶

Returns: shortcut for SchemaPipe operations to output systems metadata in a dataframe

unset_filter()¶: Reset the filtering of the Schema Graph, notice that Schema Graph :return: the filtered state

unset_filter_view()¶

property vertex_properties¶

property vertices¶

property vertices_view¶

property vid¶

property vids¶

property vids_view¶

property vmask¶

hypermorph.schema.str_to_class(class_name)¶

Parameters: class_name – e.g. Table, Entity, Attributes (see class_dict)
Returns: a class that is used in get(), get_node_by_id() methods

hypermorph.schema_dms_attribute module¶

class hypermorph.schema_dms_attribute.Attribute(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

Notice all get_* methods return node ids so that they can be converted easily to many forms keys, dataframe, SchemaNode objects, etc…

property datamodel¶

property entities¶: Notice: This has a different output < out(‘node’) >, i.e. not metadata in dataframe, because we use this property in projection. For example in DataSet.get_attributes….. :return: shortcut for SchemaPipe operations to output Entity nodes

property fields¶

property get_entities¶

Returns: result from get_entities method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property parent¶

hypermorph.schema_dms_datamodel module¶

class hypermorph.schema_dms_datamodel.DataModel(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects: so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…
ToDo: A method of DataModel to save it separately from Schema,: e.g. write it on disk with a serialized format (graphml) or in a database…. In the current version DataModel can be created with commands and saved in a .graphml, .gt file or it can be saved together with the Schema in a .graphml, .gt file

add_attribute(entalias, **nprops)¶

Parameters

entalias – Attribute is linked to Entities with the corresponding aliases
nprops – schema node (vertex) properties

Returns

single Attribute object

add_entities(metadata)¶

Parameters: metadata – list of dictionaries, dictionary keys are property names of Entity node (cname, alias, …)
Returns: Entity objects

add_entity(**nprops)¶

Parameters: nprops – schema node (vertex) properties
Returns: single Entity object

property attributes¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

property components¶

Returns: shortcut for SchemaPipe operations to output components metadata of the datamodel in a dataframe

property entities¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

get_attributes(junction=None)¶

Returns: result from get_attributes method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_components()¶

Returns: result from get_components method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property get_entities¶

Returns: result from get_entities method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property parent¶

to_hypergraph()¶

hypermorph.schema_dms_entity module¶

class hypermorph.schema_dms_entity.Entity(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects so that they can be chained to other methods of those classes That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

property attributes¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

property datamodel¶

get_attributes(junction=None)¶

Parameters: junction – True return junction Attributes, False return non-junction Attributes None return all Attributes
Returns: return result from get_attributes method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_fields(junction=None)¶

Parameters: junction – True return fields mapped on junction Attributes, False return fields mapped on non-junction Attributes None return all fields mapped on Attributes
Returns: Fields (node ids) that are mapped onto Attributes

Notice: In the general case, fields are mapped from more than one DataSet, Table, objects

get_tables()¶: From the fields mapped on non-junction Attributes find its parents, i.e. tables ToDo: Cover the case for fields from multiple tables mapped on attributes of the same entity :return: Table objects

has_mapping()¶

Returns: True if there are Field(s) of a Table mapped onto Attribute(s) of an Entity, otherwise False

property parent¶

to_hypergraph()¶

hypermorph.schema_drs_dataset module¶

class hypermorph.schema_drs_dataset.DataSet(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

DataSet is a set of data resources (tables, fields, graph datamodels) in the following data containers SQLite database, MySQL database, CSV/TSV flat files and graph data files

Notice: get_* methods return SchemaPipe, DataPipe objects: so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

add_fields()¶

Structure here is hierarchical a DataSet —has—> Tables each Table —has—> Fields

Returns: new Field objects

add_graph_datamodel(**nprops)¶

Add graph data model, this is a graph serialization of TRIADB data model

Parameters: nprops – schema node (vertex) properties
Returns: single GDM object

add_graph_datamodels()¶

Add graph data models

Returns: new GDM objects

add_graph_schema(**nprops)¶

Add graph schema, this is a graph serialization of HyperMorph Schema

Parameters: nprops – schema node (vertex) properties
Returns: single GSH object

add_graph_schemata()¶

Add graph schemata

Returns: new GSH objects

add_table(**nprops)¶

Parameters: nprops – schema node (vertex) properties
Returns: single Table object

add_tables(metadata=None)¶

Parameters: metadata – list of dictionaries, keys of dictionary are metadata property names of Table node
Returns: new Table objects

property components¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

property connection¶

property connection_metadata¶

container_metadata(**kwargs)¶

Returns: metadata for the data resource container e.g. metadata for a parquet file, or the tables of a database

property fields¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

get_components()¶

Returns: result from get_components method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_connection(db_client=None, port=None, trace=0)¶

Parameters

db_client –
port – use port for either HTTP or native client connection to clickhouse
trace –

Returns

get_fields(mapped=None)¶

Parameters: mapped – if True return ONLY those fields that are mapped onto attributes default return all fields
Returns: result from get_fields method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_graph_datamodels()¶

Returns: result from get_graph_datamodels method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_graph_schemata()¶

Returns: result from get_graph_schemata method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_tables()¶

Returns: result from get_tables method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property graph_datamodels¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

property graph_schemata¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

property parent¶

property tables¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

hypermorph.schema_drs_field module¶

class hypermorph.schema_drs_field.Field(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects: so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

property attributes¶

property metadata¶

property parent¶

hypermorph.schema_drs_graph_datamodel module¶

class hypermorph.schema_drs_graph_datamodel.GraphDataModel(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

load_into_schema()¶

Load GraphDataModel data resource into TRIADB Schema in memory

Notice: Do not confuse adding a set of GraphDataModels, i.e. a set of data resources with loading any of these graph data models into TRIADB Schema in memory.

The last one is a different operation, it creates new TRIADB data models into Schema i.e. loads metadata information about the DataModel, its Entities and Attributes into TRIADB Schema

Returns: DataModel object

property parent¶

hypermorph.schema_drs_graph_schema module¶

class hypermorph.schema_drs_graph_schema.GraphSchema(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

GraphSchema is a data resource, a child of DataSet like a Table, DO NOT confuse it with HyperMorph Schema An instance of GraphSchema resource is a serialized representation with a file that has <.graphml>, <.gt> format

property parent¶

hypermorph.schema_drs_table module¶

class hypermorph.schema_drs_table.Table(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects: so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

add_field(**nprops)¶

Parameters: nprops – schema node (vertex) properties
Returns: single Field object

add_fields(metadata=None)¶

Parameters: metadata – list of dictionaries, each dictionary contains metadata column properties for a field (column) in a table
Returns: new Field objects

container_metadata(**kwargs)¶

Returns: metadata for the data resource container e.g. metadata for columns of MySQL table

property fields¶

Returns: shortcut for SchemaPipe operations to output metadata in a dataframe

get_columns()¶: wrapper for DataPipe.get_columns() method :return: return result from get_rows method that can be chained to other operations use out() at the end of the chained methods to retrieve the final result

get_fields(mapped=None)¶

wrapper for SchemaPipe.get_fields() method :param mapped: if True return ONLY those fields that are mapped onto attributes

default return all fields

Returns: result from get_fields method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_rows(npartitions=None, partition_size=None)¶: wrapper for DataPipe.get_rows() method :return: result from get_rows method that can be chained to other operations use out() at the end of the chained methods to retrieve the final result

property parent¶

property sql¶

to_hypergraph()¶

hypermorph.schema_link module¶

class hypermorph.schema_link.SchemaLink(schema, from_node, to_node, **eprops)¶

Bases: object

Each instance of SchemaLink links a tail node with a head node, examples: (DataModel —> Entity), (Entity —> Attribute), (Field —> Attribute), (Table —> Field), (DataSet —> Table)

Each SchemaLink has two connectors (bidirectional edges): An outgoing edge from the tail An incoming edge to the head

In the case of a HyperEdge (HE) node there are <Many> Outgoing Edges that start < From One > HE In the case of a HyperNode (HN) node there are <Many> Incoming Edges that end < To One > HN

SchemaLink type represents a DIRECTED MANY TO MANY RELATIONSHIP

Important Notice: Do not confuse the DIRECTION OF RELATIONSHIP with the DIRECTION OF TRAVERSING THE BIDIRECTIONAL EDGES of the SchemaLink

Many-to-Many Relationship is defined as a (Many-to-One) and (One-to-Many)

MANY side (tail node) —ONE side (outgoing edge)— —ONE side (incoming edge)— MANY side (head node)

(fromID)
An outgoing edge

FROM Node ========================== SchemaLink ==========================> TO Node
(toID) An incoming edge

property all¶

property edge¶

get_edge_property(property_name)¶: this is used to access values that are returned from @properties of SchemaLink :param property_name: function name of the @property decorator :return:

get_value(prop_name)¶

Parameters: prop_name – Edge property name (ep_names)
Returns: the value of property for the specific link

property schema¶

hypermorph.schema_node module¶

class hypermorph.schema_node.SchemaNode(schema, vid=None, **vprops)¶

Bases: object

The SchemaNode class:

if vid is None
create a NEW node, i.e. a new vertex on the graph with properties
if vid is not None
initialize a node that is represented with an existing vertex with vid

Notice: All properties and methods defined here are accessible from derived classes Attribute, Entity, DataModel, DataSet, Table, Field

property all¶

property all_edges_ids¶

property all_links¶

property all_nids¶

property all_nodes¶

property all_vertices¶

property descriptive_metadata¶

property dpipe¶: Returns a Pipe (GenerativeBase object) that refers to an instance of SchemaNode use this object to chain operations defined in DataPipe class

get_value(prop_name)¶

Parameters: prop_name – Vertex property name (vp_names) or @property function name (calculated_properties) or data type properties (field_meta)
Returns: the value of property for the specific node

property in_edges_ids¶

property in_links¶

property in_nids¶

property in_nodes¶

property in_vertices¶

property key¶

property out_edges_ids¶

property out_links¶

property out_nids¶

property out_nodes¶

property out_vertices¶

property schema¶

property spipe¶: Returns a Pipe (GenerativeBase object) that refers to an instance of SchemaNode use this object to chain operations defined in SchemaPipe class

property system_metadata¶

property vertex¶

hypermorph.schema_pipe module¶

class hypermorph.schema_pipe.SchemaPipe(schema_node, result=None)¶

Bases: hypermorph.utils.GenerativeBase

Implements method chaining: A query operation, e.g. projection, counting, filtering can invoke multiple method calls. Each method corresponds to a query operator such as: get_components.over().to_dataframe().out()

out() method is always at the end of the chained generative methods to return the final result

Each one of these operators returns an intermediate result self.fetch allowing the calls to be chained together in a single statement.

SchemaPipe methods such as get_*(), are wrapped inside methods of derivative classes of Schema, SchemaNode so that when they are called from these methods the result can be chained to other methods of SchemaPipe In that way we implement easily and intuitively transformations and conversion to multiple output formats

Notice: we distinguish between two different execution types according to the evaluation of the result

Lazy evaluation, see for example to_***() methods
Eager evaluation

filter(value=None, attribute=None, operator='eq', reset=True)¶

Notice1: to create a filtered Graph from a list/array of nodes: that is a result of previous operations in a pipeline leave attribute=None, value=None to create a Graph from a list/array of nodes that is a result from the execution of other Python commands leave attribute=None and set value=[set of nodes]

Parameters

attribute – is a defined vertex property (node attribute) for filtering vertices of the graph (Schema nodes),
value – the value of the attribute to filter vertices of the graph
operator – e.g. comparison operator for the values of node
reset – set the Graph in unfiltered state then filter, otherwise it’s a composite filtering

Returns

pass self.fetch to the next chainable operation

filter_view(value=None, attribute=None, operator='eq', reset=True)¶

Notice1: to create a GraphView from a list/array of nodes: that is a result of previous SchemaPipe operations leave attribute=None, value=None to create a GraphView from a list/array of nodes that is a result from the execution of other Python commands leave attribute=None and set value=[set of nodes]

Parameters

attribute – is a defined vertex property (node attribute) for filtering vertices of the graph (Schema nodes) to create a GraphView,
value – the value of the attribute to filter vertices of the graph
operator – e.g. comparison operator for the values of node
reset – set the GraphView in unfiltered state, otherwise it’s a composite filtering

Returns

pass self.fetch to the next chainable operation

get_all_nodes()¶: sets Graph or GraphView to the unfiltered state :return: all the nodes of the Graph or all the nodes of the GraphView

get_attributes(junction=None)¶

Parameters: junction – if True fetch those that are junction nodes else fetch non-junction attributes
Returns: Attribute node ids of an Entity or Attribute node ids of a DataModel

get_components()¶

Get node IDs for the components of a specific DataModel (Entity, Attribute) or DataSet (Table, Field, ….) It creates a filtered GraphView of the Schema for nodes that have dim3=SchemaNode.dim3

Returns: self.fetch point to Entity, Attribute, Table, Field, GraphDataModel, GraphSchema nodes these node ids are passed to the next chainable operation

get_datamodels()¶: Get DataModel node IDs of data model system (dms) :return: self.fetch points to the set of DataModel node ids, these are passed to the next chainable operation

get_datasets()¶: Get DataSet node IDs of data resources system (drs) :return: self.fetch points to the set of DataSet node ids, these are passed to the next chainable operation

get_entities()¶: Get Entity node IDs of a DataModel or Entity node IDs of an Attribute :return: self.fetch point to Entity nodes these nodes are passed to the next chainable operation

get_fields(mapped=None)¶

Wrapped in Table(SchemaNode) class Get Field node IDs of a Table or Field node IDs of a DataSet :param mapped: if True return ONLY those fields that are mapped onto attributes

default return all fields

Returns: self.fetch points to the set of Field node ids, these are passed to the next chainable operation

get_graph_datamodels()¶: Get graph datamodel node ids :return: self.fetch that points to these node IDs

get_graph_schemata()¶: Get graph schemata node ids :return: self.fetch that points to these node IDs

get_overview()¶: Get an overview of systems, datasets, datamodels, etc by filtering Schema nodes that have dim2=0 :return: self.fetch point to the set of filtered node ids, these are passed to the next chainable operation

get_systems()¶: Get System node IDs including the root system :return: self.fetch points to the set of System node ids, these are passed to the next chainable operation

get_tables()¶: Get Table node IDs of a DataSet :return: self.fetch points to the set of Table node ids, these are passed to the next chainable operation

out(**kwargs)¶

Returns: use out() method at the end of the chained generative methods to return the

output of SchemaNode objects displayed with the appropriate specified format and structure

over(select=None)¶

Parameters: select – projection over the selected metadata columns
Returns: modifies self._project

plot(**kwargs)¶

Graphical output to visualize hypergraphs, it is also used in out() method (see IHyperGraphPlotter.plot method) Example: mis.get(535).to_hypergraph().plot() or mis.get(535).to_hypergraph().out()

Parameters: kwargs –
Returns

property schema_node¶

take(select, key_column='cname')¶

Take specific nodes from the result of get_*() methods :param select: list of integers (node IDs) or

list of strings (cname(s), alias(es))

Notice: all selected nodes specified must exist otherwise it will raise an exception

Parameters: key_column – e.g. cname, alias
Returns: a subset of numpy array with node IDs

Notice the difference:: over() is a projection over the selected metadata columns (e.g. nid, dim3, dim2,…) take() is a projection over the selected fields of a database table, flatfile (e.g. npi, city, state,…)

Example: mis.get(414).get_fields().over(‘nid, dim3, dim2, cname’)

.take(select=’npi, pacID, profID, city, state’).to_dataframe(‘dim3, dim2’).out()

to_dataframe(index=None)¶

Parameters: index – metadata column names to use in pandas dataframe index
Returns

to_dict(key_column, value_columns)¶

Parameters

key_column – e.g. cname, alias, nid
value_columns – e.g. [‘cname, alias’]

Returns

to_dict_records(lazy=False)¶

to_entity(entity_name='NEW Entity', entity_alias='NEW_ENT', entity_description=None, datamodel=None, datamodel_name='NEW DataModel', datamodel_alias='NEW_DM', datamodel_descr=None, attributes=None, as_names=None, as_types=None)¶

Map a Table object of a DataSet onto an Entity of a DataModel, there are two scenarios:

Map Table to a new Entity and selected fields (or all fields) of the table onto new attributes
The new entity can be linked to a new datamodel (datamodel=None) or to an existing datamodel
Map selected fields (or all fields) of a table onto existing attributes of a datamodel
It’s a bipartite matching of fields with attributes and there is one-to-one correspondence between fields and attributes. User must specify the datamodel parameter.

Notice1: The Field-Attribute relationship is a Many-To-One i.e. many fields of different Entity objects are mapped onto one (same) Attribute

Notice2: In both (a) and (b) cases fields are selected with a combination of get_fields() and take() SchemaPipe operations on the table object

Example for (a): get(414).get_fields().take(‘npi, pacID, profID, last, first, gender, graduated, city, state’).

to_entity(cname=’Physician’, alias=’Phys’).out()

Example for (b):

Parameters

entity_name –
entity_alias –
entity_description –
datamodel – create a new datamodel by default or pass an existing DataModel object
datamodel_name –
datamodel_alias –
datamodel_descr –
attributes – list of integers (Attribute IDs) or list of strings (Attribute cnames, aliases) of an existing Entity or None (default) to create new Attributes
as_names – in the case of creating new attributes, list of strings one for each new attribute
as_types – in the case of creating new attributes, list of strings one for each new attribute Notice: data types can be inferred later on when we use arrow dictionary encoding…

Returns

An Entity object

to_fields()¶: converts a list of Attribute objects to a list of Field objects :return: list of fields that are mapped onto an Attribute

to_hypergraph()¶

to_keys(lazy=False)¶

to_nids(lazy=False, array=True)¶

to_nodes(lazy=False)¶

to_tuples(lazy=False)¶

to_vertices(lazy=False)¶

hypermorph.schema_sys module¶

class hypermorph.schema_sys.System(schema, vid=None, **node_properties)¶

Bases: hypermorph.schema_node.SchemaNode

property datamodels¶

property datasets¶

property parent¶

property systems¶

hypermorph.test module¶

hypermorph.utils module¶

class hypermorph.utils.DSUtils¶

Bases: object

Data Structure Utils Class

static numpy_sorted_index(arr, adj=False, freq=False)¶

Parameters

arr – numpy 1d array that represents a table column of data values of the same type in the case of numpy array with string values and missing data, null values must be represented with np.NaN
adj – if True return adjacency lists
freq – if True return frequencies

Returns

secondary index, i.e. unique values of arr in ascending order without NaN (null)
for each unique value calculate a) list of primary key indices, i.e. pointers, to all rows of the table that contain that value

also known as adjacency lists in Graph terminology
1. count the number of rows that contain that value,
  also known as database cardinality (selectivity) also known as frequency in associative engine

static numpy_to_pyarrow(np_arr, dtype=None, dictionary=True)¶

Parameters

np_arr – numpy 1d array that represents a table column of data values of the same type
dtype – data type
dictionary – whether to use dictionary encoded form or not

Returns

pyarrow array representation of arr

static pyarrow_chunked_to_dict(chunked_array)¶

Parameters: chunked_array – PyArrow ChunkedArray
Returns: PyArrow Array / DictionaryArray

static pyarrow_dict_to_arr(dict_array)¶

Parameters: dict_array – PyArrow DictionaryArray
Returns: PyArrow 1d Array

static pyarrow_dtype_from_string(dtype, dictionary=False, precision=9, scale=3)¶

Parameters

dtype – string that specifies the PyArrow data type
dictionary – pyarrow dictionary data type, i.e. pa.dictionary(pa.int32(), pa.vtype())
precision – for decimal128bit width arrow data type (number of digits in the number - integer+fractional)
scale – for decimal128bit width arrow data type (number of digits for the fractional part)

Returns

pyarrow data type from a string

static pyarrow_get_dtype(arr)¶

Parameters: arr – PyArrow 1d Array either dictionary encoded or not
Returns: value type of PyArrow array elements

static pyarrow_record_batch_to_table(batch)¶

static pyarrow_sort(array, ascending=True)¶

Parameters

array – PyArrow Array
ascending –

Returns

static pyarrow_table_to_record_batch(table)¶

Parameters: table – PyArrow Table
Returns: PyArrow RecordBatch

static pyarrow_to_numpy(pa_arr)¶

Parameters: pa_arr – PyArrow 1d Array or DictionaryArray
Returns: NumPy 1d array

static pyarrow_vtype_to_numpy_vtype(arr)¶

Parameters: arr – PyArrow 1d Array
Returns: NumPy value type that is equivalent of PyArrow value type

class hypermorph.utils.DotDict¶

Bases: dict

dot.notation access to dictionary attributes

Example: person_dict = {‘first_name’: ‘John’, ‘last_name’: ‘Smith’, ‘age’: 32} address_dict = {‘country’: ‘UK’, ‘city’: ‘Sheffield’}

person = DotDict(person_dict) person.address = DotDict(address_dict)

print(person.first_name, person.last_name, person.age, person.address.country, person.address.city)

class hypermorph.utils.FileUtils¶

Bases: object

static change_cwd(fpath)¶

static feather_to_arrow_schema(source)¶

static feather_to_arrow_table(file_location, select=None, limit=None, offset=None, **pyarrow_kwargs)¶

This is using pyarrow.feather.read_table() https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_table.html#pyarrow.feather.read_table

Parameters

file_location – full path location of the file
select – use a subset of columns from feather file
limit – limit on the number of records to return
offset – exclude the first number of rows Notice: do not confuse offset with the number of rows to skip at the start of the flat file but in pandas.read_csv offset can also be used as skiprows
pyarrow_kwargs – other parameters that are passed to pyarrow.feather.read_table

Returns

static flatfile_delimiter(file_type)¶

Parameters: file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively
Returns: default delimiter or the specified delimiter in the argument

static flatfile_drop_extention(fname)¶

static flatfile_header(file_type, file_location, delimiter=None)¶

Parameters

file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively
delimiter – 1-character string specifying the boundary between fields of the record
file_location – full path location of the file with an extension (.tsv, .csv)

Returns

field names in a list

static flatfile_to_pandas_dataframe(file_type, file_location, select=None, as_columns=None, as_types=None, index=None, partition_size=None, limit=None, offset=None, delimiter=None, nulls=None, **pandas_kwargs)¶

Read rows from flat file and convert them to pandas dataframe with pandas.read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Parameters

file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively
file_location – full path location of the file
delimiter – 1-character string specifying the boundary between fields of the record
nulls – list of strings that denote nulls e.g. [‘N’]
partition_size – number of records to use for each partition or target size of each partition, in bytes
select – use a subset of columns from the flat file
as_columns – user specified column names for pandas dataframe (list of strings)
as_types – dictionary with column names as keys and data types as values this is used when we read data from flat files and we want to disable type inference on those columns
index – column names to be used in pandas dataframe index
limit – limit on the number of records to return
offset – exclude the first number of rows Notice: do not confuse offset with the number of rows to skip at the start of the flat file but in pandas.read_csv offset can also be used as skiprows
pandas_kwargs – other arguments of pandas read_csv method

Returns

pandas dataframe

Example of read_cvs(): read_csv(source, sep=’|’, index_col=False, nrows=10, skiprows=3, header = 0

usecols=[‘catsid’, ‘catpid’, ‘catcost’, ‘catfoo’, ‘catchk’], dtype={‘catsid’:int, ‘catpid’:int, ‘catcost’:float, ‘catfoo’:float, ‘catchk’:bool}, parse_dates=[‘catdate’])

static flatfile_to_pyarrow_table(file_type, file_location, select=None, as_columns=None, as_types=None, partition_size=None, limit=None, offset=None, skip=0, delimiter=None, nulls=None)¶

Read columnar data from CSV files https://arrow.apache.org/docs/python/csv.html

Parameters

file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively
file_location – full path location of the file
delimiter – 1-character string specifying the boundary between fields of the record
nulls – list of strings that denote nulls e.g. [‘N’]
partition_size – number of records to use for each partition or target size of each partition, in bytes
select – list of column names to include in the pyarrow Table, default None (all columns)
as_columns – user specified column names for pandas dataframe (list of strings)
as_types – Map column names to column types (disabling type inference on those columns)
limit – limit on the number of rows to return
offset – exclude the first number of rows Notice: do not confuse offset with skip, offset is used after we read the table
skip – number of rows to skip at the start of the flat file

Returns

pyarrow in-memory table

static flatfile_to_python_lists(file_type, file_location, nrows=10, skip_rows=1, delimiter=None)¶

Parameters

file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively
delimiter – 1-character string specifying the boundary between fields of the record
file_location – full path location of the file with an extension (.tsv, .csv)
nrows – number of rows to read from the file
skip_rows – number of rows to skip, default 1 skip the header of the file

Returns

rows of the file as python lists

static get_cwd()¶

static get_filenames(path, extension='json', window_title='Choose files', gui=False, select=None)¶

static get_full_path(path)¶

static get_full_path_filename(p, f)¶

static get_full_path_parent(path)¶

static json_to_dict(fname)¶

static parquet_metadata(source, **pyarrow_kwargs)¶

static parquet_to_arrow_schema(source, **pyarrow_kwargs)¶

static parquet_to_arrow_table(file_location, select=None, limit=None, offset=None, arrow_encoding=False, **pyarrow_kwargs)¶

This is using pyarrow.parquet.read_table() https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

Parameters

file_location – full path location of the file
select – use a subset of columns from parquet file
limit – limit on the number of records to return
offset – exclude the first number of rows Notice: do not confuse offset with the number of rows to skip at the start of the flat file but in pandas.read_csv offset can also be used as skiprows
arrow_encoding – PyArrow dictionary encoding
pyarrow_kwargs – other parameters that are passed to pyarrow.parquet.read_table

Returns

static pyarrow_read_record_batch(file_location, table=False)¶

Parameters

file_location –
table –

Returns

Either PyArrow RecordBatch, or PyArrow Table if table=True

static pyarrow_table_to_feather(table, file_location, **feather_kwargs)¶: Write a Table to Feather format :param table: pyarrow Table :param file_location: full path location of the feather file :param feather_kwargs: https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather :return:

static pyarrow_table_to_parquet(table, file_location, **pyarrow_kwargs)¶: Write a Table to Parquet format :param table: pyarrow Table :param file_location: full path location of the parquet file :param pyarrow_kwargs: row_group_size, version, use_dictionary, compression (see… https://pyarrow.readthedocs.io/en/latest/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table :return:

static pyarrow_write_record_batch(record_batch, file_location)¶

Parameters

record_batch – PyArrow RecordBatch
file_location –

Returns

static write_json(data, fname)¶

class hypermorph.utils.GenerativeBase¶

Bases: object

http://derrickgilland.com/posts/introduction-to-generative-classes-in-python/ A Python Generative Class is defined as a class that returns or clones, i.e. generates, itself when accessed by certain means This type of class can be used to implement method chaining or to mutate an object’s state without modifying the original class instance.

class hypermorph.utils.MemStats¶

Bases: object

Compare memory statistics with free -m Units are in MiB memibytes, 1 MiB = 2^20 bytes

property available¶

property buffers¶

property cached¶

property cpu¶

property difference¶

property free¶

property mem¶

print_stats()¶

property total¶

property used¶

class hypermorph.utils.PandasUtils¶

Bases: object

pandas dataframe utility methods

static dataframe(iterable, columns=None, ndx=None)¶

Parameters

iterable – e.g. list like objects
columns – comma separated string or list of strings labels to use for the columns of the resulting dataframe
ndx – comma separated string or list of strings column names to use for the index of resulting dataframe

Returns

pandas dataframe with an optional index

static dataframe_cardinality(df)¶

static dataframe_concat_columns(df1, df2)¶

static dataframe_memory_usage(df, deep=False)¶

static dataframe_selectivity(df)¶

static dataframe_to_pyarrow_table(df, columns=None, schema=None, index=False)¶

Parameters

df – pandas dataframe
columns – List of column to be converted. If None, use all columns
schema – the expected pyarrow schema of the pyarrow Table
index – Whether to store the index as an additional column in the resulting Table.

Returns

pyarrow.Table

static dataframes_to_html(*df_stylers)¶

static dict_to_dataframe(d, labels)¶

hypermorph.utils.bytes2mb(b)¶

hypermorph.utils.get_size(obj)¶: sum size of object & members.

hypermorph.utils.highlight_states(s)¶

hypermorph.utils.session_time()¶

hypermorph.utils.split_comma_string(names)¶

hypermorph.utils.sql_construct(select, frm, where=None, group_by=None, having=None, order=None, limit=None, offset=None)¶

hypermorph.utils.zip_with_scalar(num, arr)¶: Use: to generate hyperbond (hb2, hb1), hyperatom (ha2, ha1) tuples :param num: scalar value :param arr: array of values :return: generator of tuples in the form (i, num) where i in arr

Module contents¶

HyperMorph is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License v.3.0 as published by the Free Software Foundation.

HyperMorph is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with HyperMorph. If not, see <https://www.gnu.org/licenses/>.