hypermorph package

Submodules

hypermorph.clients module

class hypermorph.clients.ConnectionPool(db_client, dialect=None, host=None, port=None, user=None, password=None, database=None, path=None, trace=0)

Bases: object

ConnectionPool manages connections to DBMS and extends API database clients i) with useful introspective properties (api, database, sqlalchemy_engine, last_query, query_stats) ii) with a uniform sql command interface (sql method) iii) with common methods to access database metadata (get_tables_metadata, get_columns_metadata)

HyperMorph currently supports the following three database API clients (self._api_name)
  1. Clickhouse-Driver

  2. MySQL-Connector

  3. SQLAlchemy with the following three dialects (self._sqlalchemy_dialect)
    1. pymysql

    2. clickhouse

    3. sqlite

Consequently various database APIs are categorized as (self._api_category)

MYSQL CLICKHOUSE SQLite

property api_category
property api_name
clickhouse_connections = 0
property connector
property database
get_columns_metadata(table=None, columns=None, fields=None, aggr=None, **kwargs)
Parameters
  • table – name of the table in database

  • columns – list of ClickHouse column names

  • fields – select specific meta-data fields for the columns of a table in database dictionary metadata field names are dependent on the specific DBMS e.g. MySQL, SQLite, ClickHouse, etc…

  • aggr – aggregate metadata results for the columns of a clickhouse table

  • kwargs – pass extra parameters to sql() method

Returns

metadata for the columns of a table(s) in a database e.g. name of column, default value, nullable, etc

get_tables_metadata(fields=None, clickhouse_engine=None, name=None, **kwargs)
Parameters
  • clickhouse_engine – type of storage engine for clickhouse database

  • fields – select specific meta-data fields for a table in database dictionary metadata field names are dependent on the specific DBMS e.g. MySQL, SQLite, ClickHouse, etc…

  • name – table name regular expression e.g. name=’%200%’

  • kwargs – parameters passed to sql() method

Returns

metadata for the tables of a database e.g. name of table, number of rows, row length, collation, etc..

mysql_connections = 0
sql(query, **kwargs)

For kwargs and specific details on implementation see implementation of connector class for the specific API e.g. SQLAlchemy.sql() for sqlalchemy database API :param query: :param kwargs: pass other parameters to sql() method of connector class :return: result set represented with a pandas dataframe, tuples, ….

property sqlalchemy_dialect
sqlite_connections = 0

hypermorph.connector_clickhouse_driver module

class hypermorph.connector_clickhouse_driver.ClickHouse(host, port, user, password, database, trace=0)

Bases: object

ClickHouse class is based on clickhouse-driver python API for ClickHouse DBMS

property api_category
property connection
create_engine(table, engine, heading, partition_key=None, order_key=None, settings=None, execute=True)
Parameters
  • table – name of ClickHouse engine

  • engine – the type of clickhouse engine

  • settings – clickhouse engine settings

  • heading

    list of field names paired with clickhouse data types [ (‘fld1_name’, ‘dtype1’) ,

    (‘fld2_name’, ‘dtype2’,) ( -//- , -//- ) (‘fldN_name’, ‘dtypeN’ )

  • partition_key

  • order_key

  • execute

Returns

property cursor
disconnect()
get_columns(table=None, columns=None, fields=None, aggr=None, **kwargs)
Parameters
  • table – ClickHouse table name

  • columns – list of ClickHouse column names

  • fields – Metadata fields for columns

  • aggr – aggregate metadata results for the columns of a clickhouse table

Returns

metadata for clickhouse columns

get_mutations(table, limit=None, group_by=None, execute=True)
Parameters
  • table – clickhouse table

  • group_by

  • limit – SQL limit

  • execute

Returns

get_parts(table, hb2=None, active=True, execute=True)
Parameters
  • table – clickhouse table

  • hb2 – select parts with a specific hb2 dimension (hb2 is the dim2 of the Entity/ASET key) default hb2=’%’

  • active – select only active parts

  • execute – Execute the command only if execute=True

Returns

information about parts of MergeTree tables

get_query_log(execute=True)
property last_query_statistics
optimize_engine(table, execute=True)
property print_query_statistics
sql(sql, out='dataframe', as_columns=None, index=None, partition_size=None, arrow_encoding=True, params=None, qid=None, execute=True, trace=None)

This method is calling clickhouse-driver execute() method to execute sql query Connection has already been established. :param sql: clickhouse SQL query string that will be send to server :param out: output format, i.e. python data structure that will represent the result set

dataframe, tuples, json_rows

Parameters
  • as_columns – user specified column names for pandas dataframe, (list of strings, or comma separated string)

  • index – pandas dataframe columns

  • arrow_encoding – PyArrow columnar dictionary encoding

  • arrow_table – Output is PyArrow Table, otherwise it is a PyArrow RecordBatch

  • partition_size – ToDo number of records to use for each partition or target size of each partition, in bytes

  • params – clickhouse-client execute parameters

  • qid – query identifier. If no query id specified ClickHouse server will generate it

  • execute – execute SQL commands only if execute=True

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

Returns

result set formatted according to the out parameter

hypermorph.connector_mysql module

class hypermorph.connector_mysql.MySQL(host, port, user, password, database, trace=0)

Bases: object

property api_category
close()
property connection
property cursor
property last_query
set_cursor(buffered=True, raw=None, dictionary=None, named_tuple=None)
sql(sql, out='dataframe', as_columns=None, index=None, partition_size=None, arrow_encoding=True, execute=True, buffered=True, trace=None)

This method is calling the cursor.execute() method of mysql.connector to execute sql query Connection has already been established. :param sql: mysql query string that will be send to server :param out: output format, i.e. python data structure that will represent the result set

dataframe, tuples, named_tuples, json_rows

Parameters
  • partition_size – ToDo number of records to use for each partition or target size of each partition, in bytes

  • arrow_encoding – PyArrow columnar dictionary encoding

  • as_columns – user specified column names for pandas dataframe (list of strings, or comma separated string)

  • index – column names to be used in pandas dataframe index

  • execute – execute SQL commands only if execute=True

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

  • buffered – MySQLCursorBuffered cursor fetches the entire result set from the server and buffers the rows. For nonbuffered cursors, rows are not fetched from the server until a row-fetching method is called.

Returns

result set formatted according to the out parameter

For more details about MySQLCursor class execution see https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor.html

hypermorph.connector_sqlalchemy module

class hypermorph.connector_sqlalchemy.SQLAlchemy(dialect=None, host=None, port=None, user=None, password=None, database=None, path=None, trace=0)

Bases: object

property api_category
property connection
property cursor
property engine
property last_query
property last_query_stats
sql(sql, out='dataframe', execute=True, trace=None, arrow_encoding=True, as_columns=None, index=None, partition_size=None, **kwargs)
Parameters
  • sql – sql query string that will be send to server

  • out – output format e.g. dataframe, tuples, ….

  • execute – flag to enable execution of SQL statement

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

  • partition_size – number of records to use for each partition or target size of each partition, in bytes

  • arrow_encoding – PyArrow columnar dictionary encoding

  • as_columns – user specified column names for pandas dataframe (list of strings, or comma separated string)

  • index – column names to be used in pandas dataframe index

  • kwargs – parameters passed to pandas.read_sql() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

Returns

result formatted according to the out parameter

hypermorph.data_graph module

class hypermorph.data_graph.GData(rebuild=False, load=False, **graph_properties)

Bases: object

GData class represents ABox “assertion components” — a fact associated with a terminological vocabulary Such a fact is an instance of HB-HAs association ABox are TBox-compliant statements about that vocabulary Each instance of HB-HAs association is compliant with the model (schema) of Entity-Attributes

GData of HyperMorph are represented with a directed graph that is based on graph_tool python module. GData is composed from DataNodes and DataEdges. Each DataEdge links two DataNodes and we define a direction convention from a tail DataNode to a head DataNode.

HyperAtom, HyperBond classes are derived from DataNode class

GData of HyperMorph is a hypergraph defined by two sets of objects (a.k.a hyper-atoms HAs & hyper-bonds HBs) If we have ‘hyper-bonds’ HB={hb1, hb2, hb3} and ‘hyper-atoms’ B={ha1, ha2, ha3} then we can make a map such as d = {hn1: (ha1, ha2), hb2: (ha2), hb3: (ha1, ha2, ha3)} G(HB, HA, d) is the hypergraph

add_edge(from_vertex, to_vertex)

Used in GDataLink to create a new instance of an edge :param from_vertex: tail vertex :param to_vertex: head vertex :return: an edge of GData Graph

Method updates the GData graph with vertices (nodes) and edges (links) that are related to parameter hlinks It also adds dim2, dim1 and ntype vertex properties

Parameters

hb2 – dim2 value for hyperbonds, it is set at a high enough value to filter them later on in the graph of data

:param hlinks is a list of hyperlinks in the form [ ((hb2, hb1), (ha2, ha1)), ….., ((hb2, hb1), (ha2, ha1))] A hyperlink is defined as the edge that connect a HyperBond with a HyperAtom, i.e. HB(hb2, hb1) —> HA(ha2, ha1)

In a table of data (hb2) with pk=hb1 is associated (linked) to a column of data (ha2) with indices (ha1) hb2 uint16>10000 represents a data table hb1 uint32 represents a data table row or pk index ha2 uint16<10000 represents a column of the data table ha1 uint32 represents a unique value, secondary index value of the specific column (ha2)

Therefore the set of hyperlinks (HBi —> HA1, HBi —> HA2, HBi —> HAn) transforms the tuple of a Relation to an association between the table row and the column values, indices

This association is graphically represented on a hypergraph with a hyperedge (HB) that connects many hypernodes (HAs)

Parameters
  • from_node – tail node is a GDataNode object or node ID

  • to_node – head node is a GDataNode object or node ID

If there isn’t a link from node, to node it will try to create a new one, otherwise it will return an existing GDataLink instance

Returns

GDataLink object, i.e. an edge of the GData graph

add_node(**nprops)
Parameters

nprops – GData node (vertex) properties

Returns

HyperBond object

add_values(string_values, hb2=10000)

Create and set a value vertex property :param hb2: dim2 value for hyperbonds,

it is set at a high enough value to filter them later on in the graph of data

Parameters

string_values – string_repr string representation of ha2 column UNIQUE data values with a NumPy array of dtype=str

Returns

add_vertex(**vprops)

Used in GDataNode to create a new instance of a node :param vprops: GData vertex properties :return: a vertex of GData Graph

add_vertices(n)
at(dim2, dim1)
Parameters
  • dim2 – ha2 dimension of hyperatom or hb2 dimension of hyperbond

  • dim1 – ha1 dimension of hyperatom or hb1 dimension of hyperbond

Returns

the node of the graph with the specific dimensions

property dim1
property dim2
get(nid)
Parameters

nid – Node ID (vertex id)

Returns

GDataNode object

get_node_by_id(nid)
Parameters

nid – node ID (vertex id)

Returns

GDataNode object from the derived class, i.e. HyperAtom, HyperBond object see class_dict

get_node_by_key(dim2, dim1)
Parameters
  • dim2

  • dim1

Returns

object with the specific key

get_vp(vp_name)
Parameters

vp_name – vertex property name

Returns

VertexPropertyMap object

get_vp_value(vp_name, vid)
Parameters
  • vp_name – vertex property name

  • vid – either vertex object or vertex index (node id)

Returns

the value of vertex property on the specific vertex of the graph

get_vp_values(vp_name, filtered=False)
property graph
property graph_properties
property graph_view
property is_filtered
property is_view_filtered
property list_properties
property net_alias
property net_descr
property net_edges
property net_format
property net_name
property net_path
property net_tool
property net_type
property ntype
save_graph()

Save HyperMorph GData._graph using the self._net_name, self._net_path and self._net_format

set_filter(vmask, inverted=False)

This is filtering DGraph._graph instance Only the vertices with value different than False are kept in the filtered graph

Parameters
  • vmask – boolean mask for the vertices of the graph

  • inverted – if it is set to TRUE only the vertices with value FALSE are kept.

Returns

the filtered state of the graph

set_filter_view(vmask)

DGraph._graph_view is a filtered view of the DGraph._graph, in that case the state of the DGraph is not affected by the filtering operation, i.e. after filtering DGraph._graph has the same vertices and edges as before filtering :param vmask: boolean mask for the vertices of the graph :return: filtered state of the graph view

unset_filter()

Reset the filtering of the DGraph._graph instance :return: the filtered state

unset_filter_view()
property value
property vertex_properties
property vertices
property vertices_view
property vid
property vids
property vids_view
property vmask
hypermorph.data_graph.int_to_class(class_id)
Parameters

class_id – (0 - ‘HyperAtom’) or (1 - ‘HyperBond’)

Returns

a class that is used in get(), get_node_by_id() methods

hypermorph.data_graph_hyperatom module

class hypermorph.data_graph_hyperatom.HyperAtom(gdata, vid=None, **node_properties)

Bases: hypermorph.data_graph_node.GDataNode

hypermorph.data_graph_hyperbond module

class hypermorph.data_graph_hyperbond.HyperBond(gdata, vid=None, **node_properties)

Bases: hypermorph.data_graph_node.GDataNode

hypermorph.data_graph_node module

class hypermorph.data_graph_node.GDataNode(gdata, vid=None, **vprops)

Bases: object

The GDataNode class:
  1. if vid is None

    create a NEW node, i.e. a new vertex on the graph with properties

  2. if vid is not None

    initialize a node that is represented with an existing vertex with vid

property all
property all_edges_ids
property all_nids
property all_nodes
property all_vertices
property gdata
get_value(prop_name)
Parameters

prop_name – Vertex property name (vp_names) or @property function name (calculated_properties) or data type properties (field_meta)

Returns

the value of property for the specific node

property in_edges_ids
property in_nids
property in_nodes
property in_vertices
property key
property out_edges_ids
property out_nids
property out_nodes
property out_vertices
property vertex

hypermorph.data_pipe module

class hypermorph.data_pipe.DataPipe(schema_node, result=None)

Bases: hypermorph.utils.GenerativeBase

Implements method chaining: A query operation, e.g. projection, counting, filtering can invoke multiple method calls. Each method corresponds to a query operator such as: get_components.over().to_dataframe().out()

out() method is always at the end of the chained generative methods to return the final result

Each one of these operators returns an intermediate result self.fetch allowing the calls to be chained together in a single statement.

DataPipe methods such as get_rows() are wrapped inside methods of other classes e.g. get_rows() Table(SchemaNode) so that when they are called from these methods the result can be chained to other methods of DataPipe In that way we implement easily and intuitively transformations and conversion to multiple output formats

Notice: we distinguish between two different execution types according to the evaluation of the result
  1. Lazy evaluation, see for example to_***() methods

  2. Eager evaluation

This module has a dual combined purpose:
  1. perform transformations from one data structure to another data structure

  2. load data into volatile memory (RAM, DRAM, SDRAM, SRAM, GDDR) or import data into non-volatile storage (NVRAM, SSD, HDD, Database) with a specific format e.g. parquet, JSON, ClickHouse MergeTree engine table, MYSQL table, etc…

Transformation, importing and loading operations are based on pyarrow/numpy library and ClickHouse columnar DBMS

exclude(select=None)
Parameters

select – Exclude columns in projection

Returns

get_columns()

Wrapped in Table(SchemaNode) class :return: pass self.fetch to the next chainable operation

get_rows(npartitions=None, partition_size=None)

Wrapped in Table(SchemaNode) class Fetch either records of an SQL table or rows of a flat file Notice: Specify either npartitions or block_size parameter or none of them

Parameters
  • npartitions – split the values of the index column linearly slice() will have the effect of modifying accordingly the split

  • partition_size – number of records to use for each partition or target size of each partition, in bytes

Notice: npartitions or partition_size will perform a lazy evaluation and it will return a generator object

Returns

pass self.fetch to the next chainable operation

order_by(columns)
Parameters

columns – comma separated string column names to sort by

Returns

out(lazy=False)

We distinguish between two cases, eager vs lazy evaluation. This is particularly useful when we deal with very large dataframes that do not fit in memory

Parameters

:lazy

Returns

use out() method at the end of the chained generative methods to return the

output of SchemaNode objects displayed with the appropriate specified format and structure

over(select=None, as_names=None, as_types=None)
Notice: over() must be present in method chaining

when you fetch data by constructing and executing an SQL query in that case default projection self._project = ‘ * ‘

Parameters
  • select – projection over the selected metadata columns

  • as_names – list of column names to use for resulting frame List of user-specified column names, these are used: i) to rename columns (SQL as operator) ii) to extend the result set with calculated columns from an expression

  • as_types – list of data types or comma separated string of data types e.g. when we read data from flat files using pandas.read_csv and we want to disable type inference on those columns these are pandas data types

Returns

pass self.fetch to the next chainable operation

property schema_node
slice(limit=None, offset=0)
Parameters
  • limit – number of rows to return from the result set

  • offset – number of rows to skip from the result set

Returns

SQL statement

property sql_query
to_batch(delimiter=None, nulls=None, skip=0, trace=None, arrow_encoding=True)
Parameters
  • delimiter – 1-character string specifying the boundary between fields of the record

  • nulls – list of strings that denote nulls e.g. [‘N’]

  • skip – number of rows to skip at the start of the flat file

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

  • arrow_encoding – apply PyArrow columnar dictionary encoding

Returns

PyArrow RecordBatch with optionally dictionary encoded columns

to_dataframe(data=None, index=None, delimiter=None, nulls=None, trace=None)
Parameters
  • data – ndarray (structured or homogeneous), Iterable, dict

  • index – column names of the result set to use in pandas dataframe index

  • delimiter – 1-character string specifying the boundary between fields of the record

  • nulls – list of strings that denote nulls e.g. [‘N’]

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

Returns

pandas dataframe

to_feather(path, **feather_kwargs)
Parameters
  • path – full path of the feather file

  • feather_kwargs

Returns

file_location

to_parquet(path, **parquet_kwargs)
Parameters
  • path – full path string of the parquet file

  • parquet_kwargs – row_group_size, version, use_dictionary, compression (see…

https://pyarrow.readthedocs.io/en/latest/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table :return: file_location

to_table(delimiter=None, nulls=None, skip=0, trace=None, arrow_encoding=True)
Notice1: This is a transformation from a row layout to a column layout, i.e. chained to get_rows() method

Dictionary encoded columnar layout is a fundamental component of HyperMorph associative engine.

Notice2: The output is a PyArrow Table data structure with a columnar layout, NOT a row layout, Notice3: method is also used when we fetch columns directly from a columnar data storage e.g.

ClickHouse columnar database, parquet files, i.e. chained to get_columns() method

Parameters
  • delimiter – 1-character string specifying the boundary between fields of the record

  • nulls – list of strings that denote nulls e.g. [‘N’]

  • skip – number of rows to skip at the start of the flat file

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

  • arrow_encoding – apply PyArrow columnar dictionary encoding

Returns

PyArrow in-memory table with a columnar data structure with optionally dictionary encoded columns

to_tuples(trace=None)

ToDo NumPy structured arrays representation…. :param trace: trace execution of query, i.e. print query, ellapsed time, rows in set, etc…. :return:

where(condition=None)

hypermorph.draw_hypergraph module

class hypermorph.draw_hypergraph.IHyperGraphPlotter(edges, vertex_labels, vertex_colors)

Bases: object

This module draws a hypergraph from edges using the igraph library

plot(**kwargs)
Parameters

kwargs – pass parameters to igraph plot function

Returns

hypermorph.exceptions module

exception hypermorph.exceptions.ASETError

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to construct an AssociativeSet instance

exception hypermorph.exceptions.AssociationError

Bases: hypermorph.exceptions.HyperMorphError

exception hypermorph.exceptions.ClickHouseException

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to execute query in ClickHouse

exception hypermorph.exceptions.DBConnectionFailed

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to create a connection with the database

exception hypermorph.exceptions.GraphError

Bases: hypermorph.exceptions.HyperMorphError

Raised in Schema methods

exception hypermorph.exceptions.GraphLinkError

Bases: hypermorph.exceptions.HyperMorphError

Raised in SchemaLink methods

exception hypermorph.exceptions.GraphNodeError

Bases: hypermorph.exceptions.HyperMorphError

Raised in SchemaNode methods or in any of the methods of SchemaNode subclasses

exception hypermorph.exceptions.HACOLError

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to initialize HACOL

exception hypermorph.exceptions.HyperMorphError

Bases: Exception

Base class for all HyperMorph-related errors

exception hypermorph.exceptions.InvalidAddOperation

Bases: hypermorph.exceptions.HyperMorphError

Raised when you call DataManagementFramework.add() with invalid parameters

exception hypermorph.exceptions.InvalidDelOperation

Bases: hypermorph.exceptions.HyperMorphError

Raised when you call DataManagementFramework.del() with invalid parameters

exception hypermorph.exceptions.InvalidEngine

Bases: hypermorph.exceptions.HyperMorphError

Raised when we pass a wrong type of HyperMorph engine

exception hypermorph.exceptions.InvalidGetOperation

Bases: hypermorph.exceptions.HyperMorphError

Raised when you call DataManagementFramework.get() with invalid parameters

exception hypermorph.exceptions.InvalidPipeOperation

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to execute an operation in a pipeline

exception hypermorph.exceptions.InvalidSQLOperation

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to execute an SQL command

exception hypermorph.exceptions.InvalidSourceType

Bases: hypermorph.exceptions.HyperMorphError

Raised when we pass a wrong source type of HyperMorph

exception hypermorph.exceptions.MISError

Bases: hypermorph.exceptions.HyperMorphError

Raised in operations with DataDictionary

exception hypermorph.exceptions.PandasError

Bases: hypermorph.exceptions.HyperMorphError

Raised when it fails to construct pandas dataframe

exception hypermorph.exceptions.UnknownDictionaryType

Bases: hypermorph.exceptions.HyperMorphError

Raised when trying to add a term in the dictionary with an unknown type Types can be either : HyperEdges, i.e. instances of the TBoxTail class DRS, DMS, DLS - (dim4, 0 , 0) HLT, DS, DM - (dim4, dim3, 0) HyperNodes, i.e. instances of the TBoxHead class TSV, CSV, FLD - (dim4, dim3, dim2) ENT, ATTR - (dim4, dim3, dim2)

exception hypermorph.exceptions.UnknownPrimitiveDataType

Bases: hypermorph.exceptions.HyperMorphError

Primitive Data Types are: [‘bln’, ‘int’, ‘flt’, ‘date’, ‘time’, ‘dt’, ‘enm’, ‘uid’, ‘txt’, ‘wrd’]

exception hypermorph.exceptions.WrongDictionaryType

Bases: hypermorph.exceptions.HyperMorphError

raised when we attempt to call a specific method on an object that has wrong node type

hypermorph.hacol module

class hypermorph.hacol.HAtomCollection(attribute, data)

Bases: object

A HyperAtom Collection (HACOL) can be: 1. A set of hyperatoms (HACOL_SET) that represent the domain of values for a specific attribute

2. A multiset of hyperatoms (HACOL_BAG) that represents a column of data in a table Each hyperatom may appear multiple times in this collection because each hyperatom is linked to one or more hyperbonds (MANY-TO-MANY relationship)

3. A set of values of a specific data type (HACOL_VAL) where each value is associated with a hyperatom from the set of hyperatoms (HACOL_SET) to form a KV pair.

The set of KV pairs represents the domain of a specific attribute where: K is the key of hyperatom with dimensions (dim3-model, dim2-attribute, dim1-distinct value) V is the data type value

HyperAtoms can be displayed with K, V or K:V pair

All hyperatoms in (1), (2) and (3) have common dimensions (dim3, dim2) i.e. same model, same attribute

HACOL is bringing together but at the same time keep them separate under the same object:

metadata stored in an Attribute of the DataModel data (self._data) stored in PyArrow DictionaryEncoded Array object Notice: data points to a DictionaryEncoded Array object which is a column of a PyArrow Table

count(dataframe=True)
property data
dictionary(columns=None, index=None, order_by=None, ascending=None, limit=None, offset=0)
Parameters
  • columns – list (or comma separated string) of column names for pandas dataframe

  • index – list (or comma separated string) of column names to include in pandas dataframe index

  • order_by – str or list of str Name or list of names to sort by

  • ascending – bool or list of bool, default True the sorting order

  • limit – number of records to return from states dictionary

  • offset – number of records to skip from states dictionary

Returns

states dictionary of HACOL

property filtered
property filtered_data
property hatoms_included
is_filtered()
Returns

The filtered state of the HACOL

memory_usage(mb=True, dataframe=True)
property pipe

Returns a HACOLPipe GenerativeBase object that refers to an instance of a HyperCollection use this object to chain operations and to update the state of HyperCollection instance.

print_states(limit=10)

wrapper for dictionary() :param limit: :return:

property q

wrapper for the starting point of a query pipeline :return:

reset()
update_frequency_include_color_state(indices)

In associative filtering we update frequency, include and color state for ALL HACOLs

Parameters

indices – unique indices of filtered values (pyarrow.lib.Int32Array) these are values that are included in a column of a filtered table

Returns

update_select_state(indices)
Parameters

indices – unique indices of the selected values (pyarrow.lib.Int32Array)

Returns

property values_included

hypermorph.hacol_pipe module

class hypermorph.hacol_pipe.HACOLPipe(hacol, result=None)

Bases: hypermorph.utils.GenerativeBase

And()

ToDo: …. :return:

In()

ToDo:….. 1st case comma separated string or list of string values e.g. ‘Fairfax Village, Anacostia Metro, Thomas Circle, 15th & Crystal Dr’

(‘Fairfax Village’, ‘Anacostia Metro’, ‘Thomas Circle’, ‘15th & Crystal Dr’)

2nd case list of numeric values e.g. (31706, 31801, 31241, 31003)

Not()

ToDo: …. :return:

Or()

ToDo: …. :return:

between(low, high, low_open=False, high_open=False)

ToDo:… scalar operations with an interval :param low: lower limit point :param high: upper limit point :param low_open: :param high_open:

closed interval (default) —> low_open=False, high_open=False open interval —> low_open=True, high_open=True half open interval —> low_open=False, high_open=True half open interval —> low_open=True, high_open=False

Returns

BooleanArray Mask that is used in filter()

count(dataframe=True)
Parameters

dataframe – flag to display output with a Pandas dataframe

Returns

number of values in filtered/unfiltered state number of hatoms in filtered/unfiltered state

filter(mask=None)

It uses a boolean array mask (self.fetch) constructed in previous chained operation to filter HACOL data represented with a DictionaryArray

Parameters

mask – this is used when we call filter() externally from ASETPipe.filter() method to update the filtering state of HACOL

Returns

DictionaryArray, i.e. HACOL.data filtered the filtered DictionaryArray is pointed at self._hacol.filtered_data

like(pattern)

Notice: like operator can also be used in where() as a string :param str pattern: match substring in column string values :return: PyArrow Boolean Array mask (self.fetch) that is used in filter()

it also returns boolean mask to calls from ASETPipe.where(), ASETPipe.And() methods

out(lazy=False)

We distinguish between two cases, eager vs lazy evaluation. This is particularly useful when we deal with very large HyperAtom collections that do not fit in memory

Parameters

:lazy

Returns

use out() method at the end of the chained generative methods to return the output displayed with the appropriate specified format and structure

slice(limit=None, offset=0)

slice is used either to limit the number of entries to return in the states dictionary or to limit the members of HyperAtom collection, i.e. hyperatoms (values)

Parameters
  • limit – number of records to return from the result set

  • offset – number of records to skip from the result set

Returns

A slice of records

start()

This is used as the first method in a chain of other methods where we set the filtered/unfiltered data pipeline methods slice(), to_array(), to_numpy(), to_series() start here :return: DictionaryArray either in filtered or unfiltered state

to_array(order=None, unique=False)
Parameters
  • order – default None, ‘asc’, ‘desc’

  • unique – take distinct elements in array

Returns

by default PyArrow Array or PyArrow DictionaryArray if dictionary=False

Parameters

hb2 – dim2 value for hyperbonds, it is set at a high enough value >10000 to filter them later on in the graph of data

Returns

HyperLinks (edges that connect a HyperBond with HyperAtoms) List of pairs in the form [ ((hb2, hb1), (ha2, ha1)), ((hb2, hb1), (ha2, ha1)), …] These are used to create a data graph

to_numpy(order=None, limit=None, offset=0)
Parameters
  • order – default None, ‘asc’, ‘desc’

  • limit – number of values to return from HACOL

  • offset – number of values to skip from HACOL

Returns

to_series(order=None, limit=None, offset=0)
Parameters
  • order – default None, ‘asc’, ‘desc’

  • limit – number of values to return from HACOL

  • offset – number of values to skip from HACOL

Returns

Pandas Series

to_string_array()
Returns

List of string values This is a string representation for the valid (non-null) values of the filtered HACOL It is used in the construction of a data graph to set the value property of the node

where(condition='$v')

Example: phys.q.where(‘city like ATLANTA’) Notice: Entering where() method, self.fetch = self._hacol.filtered_data

Thus pc.match_substring(), pc.greater(), pc.equal() etc… are applied to either already filtered or unfiltered (self._hacol.filtered_data = self._hacol.data) DictionaryArray

Parameters

condition

Returns

PyArrow Boolean Array mask (self.fetch) that is used in filter() it also returns boolean mask to calls from ASETPipe.where(), ASETPipe.And() methods

hypermorph.haset module

class hypermorph.haset.ASET(entity, debug)

Bases: object

An AssociativeSet, also called AssociativeEntitySet, is ALWAYS bounded to a SINGLE entity An AssociativeSet is a Set of Association objects (see Association Class) An AssociativeSet can also be represented with a set of HyperBonds

There is a direct analogy with the Relational model:

Relation : A set of tuples —-> Associative Set : A set of Associations Body : tuples of ordered values —-> Body : Associations Heading : A tuple of ordered attribute names —-> Heading : A set of attributes View : Derived relation —-> Associative View: A derived set of Associations

ASET is bringing together but at the same time keep them separate under the same object:

metadata stored in an Entity of the DataModel data (self._data) stored in PyArrow DictionaryEncoded Table object from one or more DataSet(s)

property attributes
count()

wrapper for ASETPipe.count() method :return:

property data
dictionary_encode(delimiter=None, nulls=None, skip=0, trace=None)

It will load data from the DataSet, it currently supports tabular format (rows or columns of a data table) and will apply PyArrow DictionaryArray encoding to the columns

Parameters
  • delimiter – 1-character string specifying the boundary between fields of the record

  • nulls – list of strings that denote nulls e.g. [‘N’]

  • skip – number of rows to skip at the start of the flat file

  • trace – trace execution of query, i.e. print query, ellapsed time, rows in set, etc….

Returns

PyArrow RecordBatch constructed with DictionaryEncoded Array objects

property entity
property filtered
property filtered_data
property hacols
property hbonds
is_filtered()
Returns

The filtered state of ASET

property mask
memory_usage(mb=True, dataframe=True)
Parameters
  • mb – output units MegaBytes

  • dataframe – flag to display output with a Pandas dataframe

Returns

property num_rows
property pipe

Returns an ASETPipe GenerativeBase object that refers to an instance of a HyperCollection use this object to chain operations and to update the state of HyperCollection instance.

print_rows(select=cname_list, order_by='city, last, first', limit=20, index='npi, pacID')
Parameters
  • select

  • as_names

  • index

  • order_by

  • ascending

  • limit

  • offset

Returns

property q

wrapper for the starting point of a query pipeline :return:

reset(hacols_only=False)
ASET reset includes:

Construction of PyArrow Boolean Array mask with ALL True reset of filtered state to False reset of Hyperbonds reset of HACOLs

Parameters

hacols_only – Flag for partial reset of HACOLs only

Returns

property select

wrapper for the starting point of a query pipeline in associative filtering mode :return:

update_hacols_filtered_state()

Update the filtering state of HyperAtom collections This is used when we want to operate with HyperAtom collections at filtered state <aset>.<hacol>.<operation>

For a single HACOL we can also use the form <aset>.<hacol>.q.filter(<aset.mask>).<operation>.out() :return:

hypermorph.haset_pipe module

class hypermorph.haset_pipe.ASETPipe(aset, result=None)

Bases: hypermorph.utils.GenerativeBase

And(condition)
Parameters

condition

Returns

BooleanArray Mask that is used in filter()

count()
Returns

number of hbonds (rows) in filtered/unfiltered state

filter()
Returns

out(lazy=False)

We distinguish between two cases, eager vs lazy evaluation. This is particularly useful when we deal with very large dataframes that do not fit in memory

Parameters

:lazy

Returns

use out() method at the end of the chained generative methods to return the

output of SchemaNode objects displayed with the appropriate specified format and structure

over(select=None, as_names=None, as_types=None)

Notice: over(), i.e. projection is chained after the filter() method

Parameters
  • select – projection over the selected metadata columns

  • as_names – list of column names to use for resulting dataframe List of user-specified column names, these are used: i) to rename columns (SQL as operator) ii) to extend the result set with calculated columns from an expression

  • as_types – list of data types or comma separated string of data types

Returns

RecordBatch

select()
Warning: DO NOT CONFUSE select() with over() operator

In HyperMorph select() is used as a flag to alter the state of HyperAtom collections This is the associative filtering that takes place where we

  1. Change the filtering state of HyperAtom collections

  2. Update the selection, included states for each member of the HyperAtom collection

From an end-user perspective that results in selecting values from a HyperAtom collection

Notice: In associative filtering mode we use only where() restriction

and we filter with values from a SINGLE HyperAtom collection

Returns

slice(limit=None, offset=0)
Parameters
  • limit – number of records to return from the result set

  • offset – number of records to skip from the result set

Returns

A slice of records

start()

This is used as the first method in a chain of other methods where we set the filtered/unfiltered data pipeline methods over(), slice(), to_record_batch(), to_records(), to_table(), to_dataframe() start here :return: RecordBatch either in filtered or unfiltered state

to_dataframe(index=None, order_by=None, ascending=None, limit=None, offset=0)
Notice1: Use to_record_batch() transformation before chaining it to Pandas DataFrame,

it is a lot faster this way because it decodes PyArrow RecordBatch, i.e. RecordBatch columns are not dictionary encoded

Notice2: sorting (order_by, ascending) and slicing (limit, offset) in a Pandas dataframe is slow

but sorting has not been implemented in PyArrow and that is why we pass these parameters here

Parameters
  • order_by – str or list of str Name or list of names to sort by

  • ascending – bool or list of bool, default True the sorting order

  • limit – number of records to return from the result set

  • offset – number of records to skip from the result set

  • index – list (or comma separated string) of column names to include in pandas dataframe index

Returns

Pandas dataframe

Returns

HyperLinks (edges that connect a HyperBond with HyperAtoms) List of pairs in the form [ ((hb2, hb1), (ha2, ha1)), ((hb2, hb1), (ha2, ha1)), …] These are used to create a data graph

Notice: Set HACOLs to filtered state first,

using self._aset.update_hacols_filtered_state()

to_record_batch()
Returns

PyArrow RecordBatch but columns are not dictionary encoded

Notice: Always decode PyArrow RecordBatch before sending it to Pandas DataFrame, it is a lot faster

to_records()
Returns

NumPy Records

to_string_array(unique=False)
Parameters

unique

Returns

List of string values This is a string representation for the valid (non-null) values of the filtered HACOL It is used in the construction of a data graph to set the value property of the node

Notice: Set HACOLs to filtered state first,

using self._aset.update_hacols_filtered_state()

to_table()
Returns

PyArrow Table

where(condition)

Notice: The minimum condition you specify is the attribute name or attribute dim2 dimension Valid conditions: ‘$2’, ‘quantity’, ‘price>=4’, ‘size = 10’

Parameters

condition

Returns

BooleanArray Mask that is used in filter()

hypermorph.hassoc module

class hypermorph.hassoc.Association(*pos_args, **kw_args)

Bases: object

This is the analogue of a relational tuple, i.e. row of ordered values An Association is the basic construct of Associative Sets

It is called Association because it associates a HyperBond to a set of HyperAtoms HyperBond is a symbolic 2D numerical representation of a row and HyperAtom is a symbolic 2D numerical representation of a unique value in the table column HyperAtoms can also have textual (string) representation

Association can be represented in many ways: i) With the hb key A[7, 4]

ii) With keyword arguments Association(hb=(7, 4), prtcol=None, prtwgt=None, prtID=227, prtnam=’car battery’, prtunt=None)

iii) With positional arguments Association((7,4), None, None, 227, ‘car battery’, None)

heading: a set of attributes and a key e.g. (‘hb’, ‘prtcol’, ‘prtwgt’, ‘prtID’, ‘prtnam’, ‘prtunt’)

body: KV pairs e.g. Association(hb=(7, 4), prtcol=None, prtwgt=None, prtID=227, prtnam=’car battery’, prtunt=None)

property body
static change_heading(*fields)
get()
property heading_fields

hypermorph.mis module

class hypermorph.mis.MIS(debug=0, rebuild=False, warning=True, load=False, **kwargs)

Bases: object

MIS is a builder pattern class based on Schema class, ….

add(what, **kwargs)

Add new nodes to HyperMorph Schema or an Associative Entity Set :param what: the type of node to add (datamodel, entity, entities, attribute, dataset) :param kwargs: pass keyword arguments to Schema.add() method :return: the object(s) that were added to HyperMorph Schema

static add_aset(from_table=None, with_fields=None, entity=None, entity_name=None, entity_alias=None, entity_description=None, datamodel=None, datamodel_name='NEW Data Model', datamodel_alias='NEW_DM', datamodel_descr=None, attributes=None, as_names=None, as_types=None, debug=0)
There are three ways to create an ASET object:
  1. From an Entity that has already a mapping defined (entity) fields are mapped onto the attributes of an existing Entity

  2. From a Table of a dataset (from_table, with_fields) that are mapped onto the attributes of a NEW Entity that is created in an existing DataModel,

  3. From a Table of a dataset (from_table, with fields) that are mapped onto the attributes of a NEW Entity that is created in a NEW DataModel

Case (2) and (3) define a new mapping between a data set and a data model

Parameters
  • from_table

  • with_fields

  • entity

  • entity_name

  • entity_alias

  • entity_description

  • datamodel

  • datamodel_name

  • datamodel_alias

  • datamodel_descr

  • attributes

  • as_names

  • as_types

  • debug

Returns

property all_nodes
at(*args)
property datamodels
property datasets
property dms
property drs
get(nid, what='node', select=None, index=None, out='dataframe', junction=None, mapped=None, key_column='nid', value_columns='cname', filter_attribute=None, filter_value=None, reset=False)

This method implements the functional paradigm, it is basically a wrapper of chainable methods, for example: get(461). get_entities(). over(select=’nid, dim3, dim2, cname, alias, descr’). to_dataframe(index=’dim3, dim2’). out()

can be written as get(461, what=’entities’, select=’nid, dim3, dim2, cname, alias, descr’, out=’dataframe’, index=’dim3, dim2’)

Parameters
  • nid

  • what

  • select

  • index

  • out

  • junction

  • mapped

  • key_column

  • value_columns

  • filter_attribute

  • filter_value

  • reset

Returns

get_all_nodes()
get_datamodels()
get_datasets()
get_overview()
get_systems()
property hls
load(**kwargs)
property mem
property mms
property overview
rebuild(warning=True, **kwargs)
property root
save()
static size_of_dataframe(df, deep=False)
static size_of_object(obj)
property sls
property systems

hypermorph.schema module

class hypermorph.schema.Schema(rebuild=False, load=False, **graph_properties)

Bases: object

Schema class creates a data catalog, i.e. meta-data repository. Data catalog resembles (TBox) a vocabulary of “terminological components”, i.e. abstract terms Data catalog properties e.g. dimensions, names, counters, etc describe the concepts in a data dictionary These terms are Entity types, Attribute types, Data Resource types, Link(edge) types, etc…. TBox is about types and relationships between types e.g. Entity-Attribute, Table-Column, Object-Fields, etc….

Schema of HyperMorph is represented with a directed graph that is based on graph_tool python module. Schema graph is composed from SchemaNodes and SchemaEdges. Each SchemaEdge links two SchemaNodes and we define a direction convention from a tail SchemaNode to a head SchemaNode.

System, DataModel, DataSet, GraphDataModel, Table, Field, classes are derived from SchemaNode class

Schema of HyperMorph is a hypergraph defined by two sets of objects (a.k.a. hyper-nodes & hyper-edges). If we have ‘hyper-edges’ HE={he1, he2, he3} and ‘hyper-nodes’ B={hn1, hn2, hn3} then we can make a map such as d = {he1: (hn1, hn2), he2: (hn2), he3: (hn1, hn2, hn3)} G(HE, HN, d) is the hypergraph

add(what, with_components=False, datamodel=None, **kwargs)

Wrapper method for add methods

Parameters
  • what – the type of node to add (datamodel, entity, entities, attribute, dataset)

  • with_components

    existing components of the dataset to add, valid parameters are

    [‘tables’, ‘fields’], ‘tables’, ‘graph data models’, ‘schemata’)

    ”tables”: For datasets in a DBMS add database tables,

    For datasets from files with a tabular structure add files of a specific type in a folder Files with tabular structure are flat files (CSV, TSV), Parquet files, Excel files, etc… Note: These are added as new Table nodes of HyperMorph Schema with type TBL

    ”fields”: Either add columns of a database table or fields of a file with tabular structure

    Note: These are added as new Field nodes of HyperMorph Schema with type FLD

    ”graph data models”: A dataset of graph data models, i.e. files of type .graphml or .gt in a folder

    Each files in the set serializes, represents, HyperMorph DataModel

    ”schemata”: A dataset of HyperMorph schemata, i.e. files of type .graphml or .gt in a folder

    Each file in the set serializes, represents, HyperMorph Schema

  • datamodel – A node of type DM to add NEW nodes of type Entity and Attribute

  • kwargs – Other keyword arguments to pass

Returns

the object(s) that were added to HyperMorph Schema

add_datamodel(**nprops)
Parameters

nprops – schema node (vertex) properties

Returns

DataModel object

add_dataset(**nprops)
Parameters

nprops – schema node (vertex) properties

Returns

DataSet object

add_edge(from_vertex, to_vertex, **eprops)
Parameters
  • from_vertex – tail vertex

  • to_vertex – head vertex

  • eprops – Schema edge properties

Returns

an edge of Schema Graph

add_edges(elist)

Notice: it is not used in this module….

Parameters

elist – edge list

Returns

Parameters
  • from_node – tail node is a SchemaNode object or node ID

  • to_node – head node is a SchemaNode object or node ID

  • eprops – edge properties

If there isn’t a link from node, to node it will try to create a new one, otherwise it will return an existing SchemaLink instance

Returns

SchemaLink object, i.e. an edge of the schema graph

add_vertex(**vprops)
Parameters

vprops – Schema vertex properties

Returns

a vertex of Schema Graph

property alias
property all_nodes
Returns

shortcut for SchemaPipe operation to set the GraphView in unfiltered state and get all the nodes

at(dim4, dim3, dim2)

Notice: Only data model, data resource objects have keys with dimensions (dim4, dim3, dim2)

Parameters
  • dim4 – dim4 is taken from self.dms.dim4 or self.drs.dim4 it is fixed and never changes

  • dim3 – represents a datamodel or dataset object

  • dim2 – represents a component of datamodel or dataset object

Returns

the dataset or the datamodel object with the specific key

property cname
property counter
property ctype
property datamodels
Returns

shortcut for SchemaPipe operations to output datamodels metadata in a dataframe

property datasets
Returns

shortcut for SchemaPipe operations to output datasets metadata in a dataframe

property descr
property dim2
property dim3
property dim4
property dms
property drs
property ealias
property edge_properties
property elabel
property ename
property etype
property extra
get(nid)
Parameters

nid – Node ID (vertex id)

Returns

SchemaNode object

get_all_nodes()
Returns

result from get_all_nodes method that can be chained to other operations e.g. filter_view(),

get_datamodels()
Returns

result from get_datamodels method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_datasets()
Returns

result from get_datasets method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_ep(ep_name)
Parameters

ep_name – edge property name

Returns

EdgePropertyMap object

get_ep_value(ep_name, edge)
Parameters
  • ep_name – edge property name

  • edge

Returns

the enumerated value of edge property on the specific edge of the graph the value is enumerated with a key in the eprop_dict

get_ep_values(ep_name)
get_node_by_id(nid)
Parameters

nid – node ID (vertex id)

Returns

SchemaNode object

get_node_by_key(dim4, dim3, dim2)

Notice: Only data model, data resource objects have keys with dimensions (dim4, dim3, dim2)

Parameters
  • dim4 – dim4 is taken from self.dms.dim4 or self.drs.dim4 it is fixed and never changes

  • dim3 – represents a datamodel or dataset object

  • dim2 – represents a component of datamodel or dataset object

Returns

the dataset or the datamodel object with the specific key

get_overview()
Returns

result from get_datamodels method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_systems()
Returns

result from get_systems method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_vp(vp_name)
Parameters

vp_name – vertex property name

Returns

VertexPropertyMap object

get_vp_value(vp_name, vid)
Parameters
  • vp_name – vertex property name

  • vid – either vertex object or vertex index (node id)

Returns

the value of vertex property on the specific vertex of the graph

get_vp_values(vp_name, filtered=False)
property graph
property graph_properties
property graph_view
property hls
property is_filtered
property is_view_filtered
property list_properties
property net_alias
property net_descr
property net_edges
property net_format
property net_name
property net_path
property net_tool
property net_type
property ntype
property overview
Returns

shortcut for SchemaPipe operations to output an overview of systems, datamodels, datasets in a dataframe

property root
save_graph()

Save HyperMorph Schema._graph using the self._net_name, self._net_path and self._net_format

set_filter(filter_value, filter_attribute=None, operator='eq', reset=True, inverted=False)

This is filtering the Schema Graph instance :param filter_value: the value of the attribute to filter vertices of the graph

or a list of node ids (vertex ids)

Parameters
  • filter_attribute – is a defined vertex property for filtering vertices of the graph (Schema nodes) to create a GraphView

  • operator – e.g. comparison operator for the values of node

  • reset – set the GraphView in unfiltered state, i.e. parameter vfilt=None set the vertex mask in unfiltered state, i.e. fill array with zeros this step is necessary when we filter with node_ids

  • inverted

Returns

the filtered state

set_filter_view(filter_value, filter_attribute=None, operator='eq', reset=True)

GraphView is a filtered view of the Graph, in that case the state of the Graph is not affected by the filtering operation, i.e. after filtering Graph has the same nodes and edges as before filtering

Parameters
  • filter_value – the value of the attribute to filter vertices of the graph or a list of node ids (vertex ids)

  • filter_attribute – is a defined vertex property for filtering vertices of the graph (Schema nodes) to create a GraphView

  • operator – e.g. comparison operator for the values of node

  • reset – set the GraphView in unfiltered state, i.e. parameter vfilt=None set the vertex mask in unfiltered state, i.e. fill array with zeros this step is necessary when we filter with node_ids

Returns

property sls
property systems
Returns

shortcut for SchemaPipe operations to output systems metadata in a dataframe

unset_filter()

Reset the filtering of the Schema Graph, notice that Schema Graph :return: the filtered state

unset_filter_view()
property vertex_properties
property vertices
property vertices_view
property vid
property vids
property vids_view
property vmask
hypermorph.schema.str_to_class(class_name)
Parameters

class_name – e.g. Table, Entity, Attributes (see class_dict)

Returns

a class that is used in get(), get_node_by_id() methods

hypermorph.schema_dms_attribute module

class hypermorph.schema_dms_attribute.Attribute(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

Notice all get_* methods return node ids so that they can be converted easily to many forms keys, dataframe, SchemaNode objects, etc…

property datamodel
property entities

Notice: This has a different output < out(‘node’) >, i.e. not metadata in dataframe, because we use this property in projection. For example in DataSet.get_attributes….. :return: shortcut for SchemaPipe operations to output Entity nodes

property fields
property get_entities
Returns

result from get_entities method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property parent

hypermorph.schema_dms_datamodel module

class hypermorph.schema_dms_datamodel.DataModel(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects

so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

ToDo: A method of DataModel to save it separately from Schema,

e.g. write it on disk with a serialized format (graphml) or in a database…. In the current version DataModel can be created with commands and saved in a .graphml, .gt file or it can be saved together with the Schema in a .graphml, .gt file

add_attribute(entalias, **nprops)
Parameters
  • entalias – Attribute is linked to Entities with the corresponding aliases

  • nprops – schema node (vertex) properties

Returns

single Attribute object

add_entities(metadata)
Parameters

metadata – list of dictionaries, dictionary keys are property names of Entity node (cname, alias, …)

Returns

Entity objects

add_entity(**nprops)
Parameters

nprops – schema node (vertex) properties

Returns

single Entity object

property attributes
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

property components
Returns

shortcut for SchemaPipe operations to output components metadata of the datamodel in a dataframe

property entities
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

get_attributes(junction=None)
Returns

result from get_attributes method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_components()
Returns

result from get_components method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property get_entities
Returns

result from get_entities method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property parent
to_hypergraph()

hypermorph.schema_dms_entity module

class hypermorph.schema_dms_entity.Entity(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects so that they can be chained to other methods of those classes That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

property attributes
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

property datamodel
get_attributes(junction=None)
Parameters

junction – True return junction Attributes, False return non-junction Attributes None return all Attributes

Returns

return result from get_attributes method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_fields(junction=None)
Parameters

junction – True return fields mapped on junction Attributes, False return fields mapped on non-junction Attributes None return all fields mapped on Attributes

Returns

Fields (node ids) that are mapped onto Attributes

Notice: In the general case, fields are mapped from more than one DataSet, Table, objects

get_tables()

From the fields mapped on non-junction Attributes find its parents, i.e. tables ToDo: Cover the case for fields from multiple tables mapped on attributes of the same entity :return: Table objects

has_mapping()
Returns

True if there are Field(s) of a Table mapped onto Attribute(s) of an Entity, otherwise False

property parent
to_hypergraph()

hypermorph.schema_drs_dataset module

class hypermorph.schema_drs_dataset.DataSet(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

DataSet is a set of data resources (tables, fields, graph datamodels) in the following data containers SQLite database, MySQL database, CSV/TSV flat files and graph data files

Notice: get_* methods return SchemaPipe, DataPipe objects

so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

add_fields()

Structure here is hierarchical a DataSet —has—> Tables each Table —has—> Fields

Returns

new Field objects

add_graph_datamodel(**nprops)

Add graph data model, this is a graph serialization of TRIADB data model

Parameters

nprops – schema node (vertex) properties

Returns

single GDM object

add_graph_datamodels()

Add graph data models

Returns

new GDM objects

add_graph_schema(**nprops)

Add graph schema, this is a graph serialization of HyperMorph Schema

Parameters

nprops – schema node (vertex) properties

Returns

single GSH object

add_graph_schemata()

Add graph schemata

Returns

new GSH objects

add_table(**nprops)
Parameters

nprops – schema node (vertex) properties

Returns

single Table object

add_tables(metadata=None)
Parameters

metadata – list of dictionaries, keys of dictionary are metadata property names of Table node

Returns

new Table objects

property components
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

property connection
property connection_metadata
container_metadata(**kwargs)
Returns

metadata for the data resource container e.g. metadata for a parquet file, or the tables of a database

property fields
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

get_components()
Returns

result from get_components method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_connection(db_client=None, port=None, trace=0)
Parameters
  • db_client

  • port – use port for either HTTP or native client connection to clickhouse

  • trace

Returns

get_fields(mapped=None)
Parameters

mapped – if True return ONLY those fields that are mapped onto attributes default return all fields

Returns

result from get_fields method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_graph_datamodels()
Returns

result from get_graph_datamodels method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_graph_schemata()
Returns

result from get_graph_schemata method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_tables()
Returns

result from get_tables method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

property graph_datamodels
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

property graph_schemata
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

property parent
property tables
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

hypermorph.schema_drs_field module

class hypermorph.schema_drs_field.Field(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects

so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

property attributes
property metadata
property parent

hypermorph.schema_drs_graph_datamodel module

class hypermorph.schema_drs_graph_datamodel.GraphDataModel(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

load_into_schema()

Load GraphDataModel data resource into TRIADB Schema in memory

Notice: Do not confuse adding a set of GraphDataModels, i.e. a set of data resources with loading any of these graph data models into TRIADB Schema in memory.

The last one is a different operation, it creates new TRIADB data models into Schema i.e. loads metadata information about the DataModel, its Entities and Attributes into TRIADB Schema

Returns

DataModel object

property parent

hypermorph.schema_drs_graph_schema module

class hypermorph.schema_drs_graph_schema.GraphSchema(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

GraphSchema is a data resource, a child of DataSet like a Table, DO NOT confuse it with HyperMorph Schema An instance of GraphSchema resource is a serialized representation with a file that has <.graphml>, <.gt> format

property parent

hypermorph.schema_drs_table module

class hypermorph.schema_drs_table.Table(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

Notice: all get_* methods return SchemaPipe, DataPipe objects

so that they can be chained to other methods of those classes. That way we can convert, transform easily anything to many forms keys, dataframe, SchemaNode objects…

add_field(**nprops)
Parameters

nprops – schema node (vertex) properties

Returns

single Field object

add_fields(metadata=None)
Parameters

metadata – list of dictionaries, each dictionary contains metadata column properties for a field (column) in a table

Returns

new Field objects

container_metadata(**kwargs)
Returns

metadata for the data resource container e.g. metadata for columns of MySQL table

property fields
Returns

shortcut for SchemaPipe operations to output metadata in a dataframe

get_columns()

wrapper for DataPipe.get_columns() method :return: return result from get_rows method that can be chained to other operations use out() at the end of the chained methods to retrieve the final result

get_fields(mapped=None)

wrapper for SchemaPipe.get_fields() method :param mapped: if True return ONLY those fields that are mapped onto attributes

default return all fields

Returns

result from get_fields method that can be chained to other operations e.g. over(), out()

use out() at the end of the chained methods to retrieve the final result

get_rows(npartitions=None, partition_size=None)

wrapper for DataPipe.get_rows() method :return: result from get_rows method that can be chained to other operations use out() at the end of the chained methods to retrieve the final result

property parent
property sql
to_hypergraph()

hypermorph.schema_node module

class hypermorph.schema_node.SchemaNode(schema, vid=None, **vprops)

Bases: object

The SchemaNode class:
  1. if vid is None

    create a NEW node, i.e. a new vertex on the graph with properties

  2. if vid is not None

    initialize a node that is represented with an existing vertex with vid

Notice: All properties and methods defined here are accessible from derived classes Attribute, Entity, DataModel, DataSet, Table, Field

property all
property all_edges_ids
property all_nids
property all_nodes
property all_vertices
property descriptive_metadata
property dpipe

Returns a Pipe (GenerativeBase object) that refers to an instance of SchemaNode use this object to chain operations defined in DataPipe class

get_value(prop_name)
Parameters

prop_name – Vertex property name (vp_names) or @property function name (calculated_properties) or data type properties (field_meta)

Returns

the value of property for the specific node

property in_edges_ids
property in_nids
property in_nodes
property in_vertices
property key
property out_edges_ids
property out_nids
property out_nodes
property out_vertices
property schema
property spipe

Returns a Pipe (GenerativeBase object) that refers to an instance of SchemaNode use this object to chain operations defined in SchemaPipe class

property system_metadata
property vertex

hypermorph.schema_pipe module

class hypermorph.schema_pipe.SchemaPipe(schema_node, result=None)

Bases: hypermorph.utils.GenerativeBase

Implements method chaining: A query operation, e.g. projection, counting, filtering can invoke multiple method calls. Each method corresponds to a query operator such as: get_components.over().to_dataframe().out()

out() method is always at the end of the chained generative methods to return the final result

Each one of these operators returns an intermediate result self.fetch allowing the calls to be chained together in a single statement.

SchemaPipe methods such as get_*(), are wrapped inside methods of derivative classes of Schema, SchemaNode so that when they are called from these methods the result can be chained to other methods of SchemaPipe In that way we implement easily and intuitively transformations and conversion to multiple output formats

Notice: we distinguish between two different execution types according to the evaluation of the result
  1. Lazy evaluation, see for example to_***() methods

  2. Eager evaluation

filter(value=None, attribute=None, operator='eq', reset=True)
Notice1: to create a filtered Graph from a list/array of nodes

that is a result of previous operations in a pipeline leave attribute=None, value=None to create a Graph from a list/array of nodes that is a result from the execution of other Python commands leave attribute=None and set value=[set of nodes]

Parameters
  • attribute – is a defined vertex property (node attribute) for filtering vertices of the graph (Schema nodes),

  • value – the value of the attribute to filter vertices of the graph

  • operator – e.g. comparison operator for the values of node

  • reset – set the Graph in unfiltered state then filter, otherwise it’s a composite filtering

Returns

pass self.fetch to the next chainable operation

filter_view(value=None, attribute=None, operator='eq', reset=True)
Notice1: to create a GraphView from a list/array of nodes

that is a result of previous SchemaPipe operations leave attribute=None, value=None to create a GraphView from a list/array of nodes that is a result from the execution of other Python commands leave attribute=None and set value=[set of nodes]

Parameters
  • attribute – is a defined vertex property (node attribute) for filtering vertices of the graph (Schema nodes) to create a GraphView,

  • value – the value of the attribute to filter vertices of the graph

  • operator – e.g. comparison operator for the values of node

  • reset – set the GraphView in unfiltered state, otherwise it’s a composite filtering

Returns

pass self.fetch to the next chainable operation

get_all_nodes()

sets Graph or GraphView to the unfiltered state :return: all the nodes of the Graph or all the nodes of the GraphView

get_attributes(junction=None)
Parameters

junction – if True fetch those that are junction nodes else fetch non-junction attributes

Returns

Attribute node ids of an Entity or Attribute node ids of a DataModel

get_components()

Get node IDs for the components of a specific DataModel (Entity, Attribute) or DataSet (Table, Field, ….) It creates a filtered GraphView of the Schema for nodes that have dim3=SchemaNode.dim3

Returns

self.fetch point to Entity, Attribute, Table, Field, GraphDataModel, GraphSchema nodes these node ids are passed to the next chainable operation

get_datamodels()

Get DataModel node IDs of data model system (dms) :return: self.fetch points to the set of DataModel node ids, these are passed to the next chainable operation

get_datasets()

Get DataSet node IDs of data resources system (drs) :return: self.fetch points to the set of DataSet node ids, these are passed to the next chainable operation

get_entities()

Get Entity node IDs of a DataModel or Entity node IDs of an Attribute :return: self.fetch point to Entity nodes these nodes are passed to the next chainable operation

get_fields(mapped=None)

Wrapped in Table(SchemaNode) class Get Field node IDs of a Table or Field node IDs of a DataSet :param mapped: if True return ONLY those fields that are mapped onto attributes

default return all fields

Returns

self.fetch points to the set of Field node ids, these are passed to the next chainable operation

get_graph_datamodels()

Get graph datamodel node ids :return: self.fetch that points to these node IDs

get_graph_schemata()

Get graph schemata node ids :return: self.fetch that points to these node IDs

get_overview()

Get an overview of systems, datasets, datamodels, etc by filtering Schema nodes that have dim2=0 :return: self.fetch point to the set of filtered node ids, these are passed to the next chainable operation

get_systems()

Get System node IDs including the root system :return: self.fetch points to the set of System node ids, these are passed to the next chainable operation

get_tables()

Get Table node IDs of a DataSet :return: self.fetch points to the set of Table node ids, these are passed to the next chainable operation

out(**kwargs)
Returns

use out() method at the end of the chained generative methods to return the

output of SchemaNode objects displayed with the appropriate specified format and structure

over(select=None)
Parameters

select – projection over the selected metadata columns

Returns

modifies self._project

plot(**kwargs)

Graphical output to visualize hypergraphs, it is also used in out() method (see IHyperGraphPlotter.plot method) Example: mis.get(535).to_hypergraph().plot() or mis.get(535).to_hypergraph().out()

Parameters

kwargs

Returns

property schema_node
take(select, key_column='cname')

Take specific nodes from the result of get_*() methods :param select: list of integers (node IDs) or

list of strings (cname(s), alias(es))

Notice: all selected nodes specified must exist otherwise it will raise an exception

Parameters

key_column – e.g. cname, alias

Returns

a subset of numpy array with node IDs

Notice the difference:

over() is a projection over the selected metadata columns (e.g. nid, dim3, dim2,…) take() is a projection over the selected fields of a database table, flatfile (e.g. npi, city, state,…)

Example: mis.get(414).get_fields().over(‘nid, dim3, dim2, cname’)

.take(select=’npi, pacID, profID, city, state’).to_dataframe(‘dim3, dim2’).out()

to_dataframe(index=None)
Parameters

index – metadata column names to use in pandas dataframe index

Returns

to_dict(key_column, value_columns)
Parameters
  • key_column – e.g. cname, alias, nid

  • value_columns – e.g. [‘cname, alias’]

Returns

to_dict_records(lazy=False)
to_entity(entity_name='NEW Entity', entity_alias='NEW_ENT', entity_description=None, datamodel=None, datamodel_name='NEW DataModel', datamodel_alias='NEW_DM', datamodel_descr=None, attributes=None, as_names=None, as_types=None)

Map a Table object of a DataSet onto an Entity of a DataModel, there are two scenarios:

  1. Map Table to a new Entity and selected fields (or all fields) of the table onto new attributes

    The new entity can be linked to a new datamodel (datamodel=None) or to an existing datamodel

  2. Map selected fields (or all fields) of a table onto existing attributes of a datamodel

    It’s a bipartite matching of fields with attributes and there is one-to-one correspondence between fields and attributes. User must specify the datamodel parameter.

Notice1: The Field-Attribute relationship is a Many-To-One i.e. many fields of different Entity objects are mapped onto one (same) Attribute

Notice2: In both (a) and (b) cases fields are selected with a combination of get_fields() and take() SchemaPipe operations on the table object

Example for (a): get(414).get_fields().take(‘npi, pacID, profID, last, first, gender, graduated, city, state’).

to_entity(cname=’Physician’, alias=’Phys’).out()

Example for (b):

Parameters
  • entity_name

  • entity_alias

  • entity_description

  • datamodel – create a new datamodel by default or pass an existing DataModel object

  • datamodel_name

  • datamodel_alias

  • datamodel_descr

  • attributes – list of integers (Attribute IDs) or list of strings (Attribute cnames, aliases) of an existing Entity or None (default) to create new Attributes

  • as_names – in the case of creating new attributes, list of strings one for each new attribute

  • as_types – in the case of creating new attributes, list of strings one for each new attribute Notice: data types can be inferred later on when we use arrow dictionary encoding…

Returns

An Entity object

to_fields()

converts a list of Attribute objects to a list of Field objects :return: list of fields that are mapped onto an Attribute

to_hypergraph()
to_keys(lazy=False)
to_nids(lazy=False, array=True)
to_nodes(lazy=False)
to_tuples(lazy=False)
to_vertices(lazy=False)

hypermorph.schema_sys module

class hypermorph.schema_sys.System(schema, vid=None, **node_properties)

Bases: hypermorph.schema_node.SchemaNode

property datamodels
property datasets
property parent
property systems

hypermorph.test module

hypermorph.utils module

class hypermorph.utils.DSUtils

Bases: object

Data Structure Utils Class

static numpy_sorted_index(arr, adj=False, freq=False)
Parameters
  • arr – numpy 1d array that represents a table column of data values of the same type in the case of numpy array with string values and missing data, null values must be represented with np.NaN

  • adj – if True return adjacency lists

  • freq – if True return frequencies

Returns

  1. secondary index, i.e. unique values of arr in ascending order without NaN (null)

  2. for each unique value calculate a) list of primary key indices, i.e. pointers, to all rows of the table that contain that value

    also known as adjacency lists in Graph terminology

    1. count the number of rows that contain that value,

      also known as database cardinality (selectivity) also known as frequency in associative engine

static numpy_to_pyarrow(np_arr, dtype=None, dictionary=True)
Parameters
  • np_arr – numpy 1d array that represents a table column of data values of the same type

  • dtype – data type

  • dictionary – whether to use dictionary encoded form or not

Returns

pyarrow array representation of arr

static pyarrow_chunked_to_dict(chunked_array)
Parameters

chunked_array – PyArrow ChunkedArray

Returns

PyArrow Array / DictionaryArray

static pyarrow_dict_to_arr(dict_array)
Parameters

dict_array – PyArrow DictionaryArray

Returns

PyArrow 1d Array

static pyarrow_dtype_from_string(dtype, dictionary=False, precision=9, scale=3)
Parameters
  • dtype – string that specifies the PyArrow data type

  • dictionary – pyarrow dictionary data type, i.e. pa.dictionary(pa.int32(), pa.vtype())

  • precision – for decimal128bit width arrow data type (number of digits in the number - integer+fractional)

  • scale – for decimal128bit width arrow data type (number of digits for the fractional part)

Returns

pyarrow data type from a string

static pyarrow_get_dtype(arr)
Parameters

arr – PyArrow 1d Array either dictionary encoded or not

Returns

value type of PyArrow array elements

static pyarrow_record_batch_to_table(batch)
static pyarrow_sort(array, ascending=True)
Parameters
  • array – PyArrow Array

  • ascending

Returns

static pyarrow_table_to_record_batch(table)
Parameters

table – PyArrow Table

Returns

PyArrow RecordBatch

static pyarrow_to_numpy(pa_arr)
Parameters

pa_arr – PyArrow 1d Array or DictionaryArray

Returns

NumPy 1d array

static pyarrow_vtype_to_numpy_vtype(arr)
Parameters

arr – PyArrow 1d Array

Returns

NumPy value type that is equivalent of PyArrow value type

class hypermorph.utils.DotDict

Bases: dict

dot.notation access to dictionary attributes

Example: person_dict = {‘first_name’: ‘John’, ‘last_name’: ‘Smith’, ‘age’: 32} address_dict = {‘country’: ‘UK’, ‘city’: ‘Sheffield’}

person = DotDict(person_dict) person.address = DotDict(address_dict)

print(person.first_name, person.last_name, person.age, person.address.country, person.address.city)

class hypermorph.utils.FileUtils

Bases: object

static change_cwd(fpath)
static feather_to_arrow_schema(source)
static feather_to_arrow_table(file_location, select=None, limit=None, offset=None, **pyarrow_kwargs)

This is using pyarrow.feather.read_table() https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_table.html#pyarrow.feather.read_table

Parameters
  • file_location – full path location of the file

  • select – use a subset of columns from feather file

  • limit – limit on the number of records to return

  • offset – exclude the first number of rows Notice: do not confuse offset with the number of rows to skip at the start of the flat file but in pandas.read_csv offset can also be used as skiprows

  • pyarrow_kwargs – other parameters that are passed to pyarrow.feather.read_table

Returns

static flatfile_delimiter(file_type)
Parameters

file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively

Returns

default delimiter or the specified delimiter in the argument

static flatfile_drop_extention(fname)
static flatfile_header(file_type, file_location, delimiter=None)
Parameters
  • file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively

  • delimiter – 1-character string specifying the boundary between fields of the record

  • file_location – full path location of the file with an extension (.tsv, .csv)

Returns

field names in a list

static flatfile_to_pandas_dataframe(file_type, file_location, select=None, as_columns=None, as_types=None, index=None, partition_size=None, limit=None, offset=None, delimiter=None, nulls=None, **pandas_kwargs)

Read rows from flat file and convert them to pandas dataframe with pandas.read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Parameters
  • file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively

  • file_location – full path location of the file

  • delimiter – 1-character string specifying the boundary between fields of the record

  • nulls – list of strings that denote nulls e.g. [‘N’]

  • partition_size – number of records to use for each partition or target size of each partition, in bytes

  • select – use a subset of columns from the flat file

  • as_columns – user specified column names for pandas dataframe (list of strings)

  • as_types – dictionary with column names as keys and data types as values this is used when we read data from flat files and we want to disable type inference on those columns

  • index – column names to be used in pandas dataframe index

  • limit – limit on the number of records to return

  • offset – exclude the first number of rows Notice: do not confuse offset with the number of rows to skip at the start of the flat file but in pandas.read_csv offset can also be used as skiprows

  • pandas_kwargs – other arguments of pandas read_csv method

Returns

pandas dataframe

Example of read_cvs(): read_csv(source, sep=’|’, index_col=False, nrows=10, skiprows=3, header = 0

usecols=[‘catsid’, ‘catpid’, ‘catcost’, ‘catfoo’, ‘catchk’], dtype={‘catsid’:int, ‘catpid’:int, ‘catcost’:float, ‘catfoo’:float, ‘catchk’:bool}, parse_dates=[‘catdate’])

static flatfile_to_pyarrow_table(file_type, file_location, select=None, as_columns=None, as_types=None, partition_size=None, limit=None, offset=None, skip=0, delimiter=None, nulls=None)

Read columnar data from CSV files https://arrow.apache.org/docs/python/csv.html

Parameters
  • file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively

  • file_location – full path location of the file

  • delimiter – 1-character string specifying the boundary between fields of the record

  • nulls – list of strings that denote nulls e.g. [‘N’]

  • partition_size – number of records to use for each partition or target size of each partition, in bytes

  • select – list of column names to include in the pyarrow Table, default None (all columns)

  • as_columns – user specified column names for pandas dataframe (list of strings)

  • as_types – Map column names to column types (disabling type inference on those columns)

  • limit – limit on the number of rows to return

  • offset – exclude the first number of rows Notice: do not confuse offset with skip, offset is used after we read the table

  • skip – number of rows to skip at the start of the flat file

Returns

pyarrow in-memory table

static flatfile_to_python_lists(file_type, file_location, nrows=10, skip_rows=1, delimiter=None)
Parameters
  • file_type – CSV, TSV these have default delimiters ‘,’ and ‘ ‘ respectively

  • delimiter – 1-character string specifying the boundary between fields of the record

  • file_location – full path location of the file with an extension (.tsv, .csv)

  • nrows – number of rows to read from the file

  • skip_rows – number of rows to skip, default 1 skip the header of the file

Returns

rows of the file as python lists

static get_cwd()
static get_filenames(path, extension='json', window_title='Choose files', gui=False, select=None)
static get_full_path(path)
static get_full_path_filename(p, f)
static get_full_path_parent(path)
static json_to_dict(fname)
static parquet_metadata(source, **pyarrow_kwargs)
static parquet_to_arrow_schema(source, **pyarrow_kwargs)
static parquet_to_arrow_table(file_location, select=None, limit=None, offset=None, arrow_encoding=False, **pyarrow_kwargs)

This is using pyarrow.parquet.read_table() https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

Parameters
  • file_location – full path location of the file

  • select – use a subset of columns from parquet file

  • limit – limit on the number of records to return

  • offset – exclude the first number of rows Notice: do not confuse offset with the number of rows to skip at the start of the flat file but in pandas.read_csv offset can also be used as skiprows

  • arrow_encoding – PyArrow dictionary encoding

  • pyarrow_kwargs – other parameters that are passed to pyarrow.parquet.read_table

Returns

static pyarrow_read_record_batch(file_location, table=False)
Parameters
  • file_location

  • table

Returns

Either PyArrow RecordBatch, or PyArrow Table if table=True

static pyarrow_table_to_feather(table, file_location, **feather_kwargs)

Write a Table to Feather format :param table: pyarrow Table :param file_location: full path location of the feather file :param feather_kwargs: https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather :return:

static pyarrow_table_to_parquet(table, file_location, **pyarrow_kwargs)

Write a Table to Parquet format :param table: pyarrow Table :param file_location: full path location of the parquet file :param pyarrow_kwargs: row_group_size, version, use_dictionary, compression (see… https://pyarrow.readthedocs.io/en/latest/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table :return:

static pyarrow_write_record_batch(record_batch, file_location)
Parameters
  • record_batch – PyArrow RecordBatch

  • file_location

Returns

static write_json(data, fname)
class hypermorph.utils.GenerativeBase

Bases: object

http://derrickgilland.com/posts/introduction-to-generative-classes-in-python/ A Python Generative Class is defined as a class that returns or clones, i.e. generates, itself when accessed by certain means This type of class can be used to implement method chaining or to mutate an object’s state without modifying the original class instance.

class hypermorph.utils.MemStats

Bases: object

Compare memory statistics with free -m Units are in MiB memibytes, 1 MiB = 2^20 bytes

property available
property buffers
property cached
property cpu
property difference
property free
property mem
print_stats()
property total
property used
class hypermorph.utils.PandasUtils

Bases: object

pandas dataframe utility methods

static dataframe(iterable, columns=None, ndx=None)
Parameters
  • iterable – e.g. list like objects

  • columns – comma separated string or list of strings labels to use for the columns of the resulting dataframe

  • ndx – comma separated string or list of strings column names to use for the index of resulting dataframe

Returns

pandas dataframe with an optional index

static dataframe_cardinality(df)
static dataframe_concat_columns(df1, df2)
static dataframe_memory_usage(df, deep=False)
static dataframe_selectivity(df)
static dataframe_to_pyarrow_table(df, columns=None, schema=None, index=False)
Parameters
  • df – pandas dataframe

  • columns – List of column to be converted. If None, use all columns

  • schema – the expected pyarrow schema of the pyarrow Table

  • index – Whether to store the index as an additional column in the resulting Table.

Returns

pyarrow.Table

static dataframes_to_html(*df_stylers)
static dict_to_dataframe(d, labels)
hypermorph.utils.bytes2mb(b)
hypermorph.utils.get_size(obj)

sum size of object & members.

hypermorph.utils.highlight_states(s)
hypermorph.utils.session_time()
hypermorph.utils.split_comma_string(names)
hypermorph.utils.sql_construct(select, frm, where=None, group_by=None, having=None, order=None, limit=None, offset=None)
hypermorph.utils.zip_with_scalar(num, arr)

Use: to generate hyperbond (hb2, hb1), hyperatom (ha2, ha1) tuples :param num: scalar value :param arr: array of values :return: generator of tuples in the form (i, num) where i in arr

Module contents

This file is part of HyperMorph operational API for information management and data transformations on Associative Semiotic Hypergraph Development Framework (C) 2015-2019 Athanassios I. Hatzis

HyperMorph is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License v.3.0 as published by the Free Software Foundation.

HyperMorph is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with HyperMorph. If not, see <https://www.gnu.org/licenses/>.