API Reference#

Flatten module#

class spoonbill.flatten.TableFlattenConfig(split, pretty_headers=False, headers=<factory>, repeat=<factory>, unnest=<factory>, only=<factory>, name='')[source]#

Table specific flattening configuration

Parameters:

split (bool) – Split child arrays to separate tables
pretty_headers (bool) – Use human friendly headers extracted from schema
headers (Mapping[str, str]) – User edited headers to override automatically extracted
unnest (List[str]) – List of columns to output from child to parent table
repeat (List[str]) – List of columns to clone in child tables
only (List[str]) – List of columns to output
name (str) – Overwrite table name

split: bool#

pretty_headers: bool = False#

headers: Mapping[str, str]#

repeat: List[str]#

unnest: List[str]#

only: List[str]#

name: str = ''#

class spoonbill.flatten.FlattenOptions(selection, exclude=<factory>, count=False)[source]#

Flattening configuration

Parameters:

selection (Mapping[str, TableFlattenConfig]) – List of selected tables to extract from data
count (bool) – Include number of rows in child table in each parent table
exclude (List[str]) – List of tables to exclude from export

selection: Mapping[str, TableFlattenConfig]#

exclude: List[str]#

count: bool = False#

class spoonbill.flatten.Flattener(options, tables, language='en')[source]#

Data flattener

In order to export data correctly Flattener requires previously analyzed tables data. During the process flattener could add columns not based on schema analysis, such as itemsCount. In every generated row, depending on table type, flattener will always few add autogenerated columns. For root table: * rowID * id * ocid

For child tables this list well be extended with parentID column.

Parameters:

options (FlattenOptions) – Flattening options
tables (Mapping[str, Table]) – Analyzed tables data
language – Language to use for the human-readable headings

init_table_selection(tables)[source]#

init_child_tables(tables, table, options)[source]#

init_map(map, paths, table, only=None, target=None)[source]#

init_table_lookup(tables, table, target=None)[source]#

init_count(table, options)[source]#

init_unnest(table, options)[source]#

init_repeat(table, options)[source]#

init_options(tables)[source]#

init_only(table, only, split)[source]#

get_table(pointer)[source]#

flatten(releases)[source]#

Flatten releases

Parameters:: releases – releases as iterable object
Returns:: Iterator over mapping between table name and list of rows for each release

CLI module#

cli.py - Command line interface related routines

class spoonbill.cli.CommaSeparated[source]#

Click option type to convert comma separated string into list

name: str = 'comma'#: the descriptive name of this type

convert(value, param, ctx)[source]#

Convert the value to the correct type. This is not called if the value is None (the missing value).

This must accept string values from the command line, as well as values that are already the correct type. It may also convert other compatible types.

The param and ctx arguments may be None in certain situations, such as when converting prompt input.

If the value cannot be converted, call fail() with a descriptive message.

Parameters:

value – The value to convert.
param – The parameter that is using this type to convert its value. May be None.
ctx – The current context that arrived at this value. May be None.

spoonbill.cli.read_option_file(option, option_file)[source]#

spoonbill.cli.get_selected_tables(base, selection)[source]#

Spec module#

class spoonbill.spec.Column(id, path, title, type, hits=0, header=<factory>)[source]#

A container for column information.

Parameters:

id (str) – The JSON path without indexes
path (str) – The JSON path with indexes
title (str) – The human-friendly title
type (str) – The expected type
hits (int) – The number of times the column contains data during analysis
header (list) –

id: str#

path: str#

title: str#

type: str#

hits: int = 0#

header: list#

class spoonbill.spec.Table(name, path, total_rows=0, parent=<factory>, is_root=False, is_combined=False, splitted=False, rolled_up=False, columns=<factory>, combined_columns=<factory>, additional_columns=<factory>, arrays=<factory>, titles=<factory>, child_tables=<factory>, types=<factory>, array_columns=<factory>, array_positions=<factory>, preview_rows=<factory>, preview_rows_combined=<factory>)[source]#

A container for table information.

Parameters:

name (str) – Table name
path (List[str]) – List of paths to gather data to this table
total_rows (int) – Total available rows in this table
parent (object) – Parent table, None if this table is root table
is_root (bool) – This table is root table
is_combined (bool) – This table contains data collected from different paths
splitted (bool) – This table should be splitted
rolled_up (bool) – This table should be ated from its parent
columns (Mapping[str, Column]) – Columns extracted from schema for split version of this table
combined_columns (Mapping[str, Column]) – Columns extracted from schema for unsplit version of this table
additional_columns (Mapping[str, Column]) – Columns identified in dataset but not in schema
arrays (Mapping[str, int]) – Table array columns and maximum items (not the total count) in each array
titles (Mapping[str, str]) – All human-friendly column titles, extracted from the schema
child_tables (List[str]) – List of possible child tables
types (Mapping[str, List[str]]) – All paths matched to this table with corresponding object type on each path
preview_rows (Sequence[dict]) – Generated preview for split version of this table
preview_rows_combined (Sequence[dict]) – Generated preview for unsplit version of this table
array_columns (Mapping[str, Column]) –
array_positions (Mapping[str, str]) –

name: str#

path: List[str]#

total_rows: int = 0#

parent: object#

is_root: bool = False#

is_combined: bool = False#

splitted: bool = False#

rolled_up: bool = False#

columns: Mapping[str, Column]#

combined_columns: Mapping[str, Column]#

additional_columns: Mapping[str, Column]#

arrays: Mapping[str, int]#

titles: Mapping[str, str]#

child_tables: List[str]#

types: Mapping[str, List[str]]#

array_columns: Mapping[str, Column]#

array_positions: Mapping[str, str]#

preview_rows: Sequence[dict]#

preview_rows_combined: Sequence[dict]#

missing_rows(split=True)[source]#: Return the columns that are available in the schema, but not present in the analyzed data.

available_rows(split=True)[source]#: Return the columns that are available in the analyzed data.

filter_columns(filter)[source]#

add_array_column(col, path, abs_path, max)[source]#

add_column(path, item_type, title, *, propagated=False, additional=False, abs_path=None, header=[])[source]#

Add a new column to the table.

Parameters:

path – The column’s path
item_type – The column’s expected type
title – Column title
combined_only – Make this column available only in combined version of table
propagated – Add column to parent table
additional – Mark this column as missing in schema
abs_path – The column’s full JSON path

is_array(path)[source]#: Check whether the given path is in any table’s arrays.

inc_column(abs_path, path)[source]#

Increment the number of non-empty cells in the column.

Parameters:

abs_path – The column’s full JSON path
path – The column’s JSON path without array indexes

add_array(header)[source]#

set_array(header, item)[source]#

Try to set the maximum length of an array.

Parameters:

header – The path to the array
item – Array from data

Returns:

Whether the array is bigger than previously found and the length was updated

inc()[source]#: Increment the number of rows in the table.

set_preview_path(abs_path, path, value, max_items)[source]#

split(pointer)[source]#

spoonbill.spec.add_child_table(table, pointer, parent_key, key)[source]#

Create and append a new child table to the given table.

Parameters:

table – The parent table to the newly created table
pointer – Path to which table should match
parent_key – New table parent object filed name, used to generate table name
key – New table field name object filed name, used to generate table name

Returns:

Child table

Stats module#

class spoonbill.stats.DataPreprocessor(schema, root_tables, combined_tables=None, tables=None, table_threshold=5, total_items=0, language='en', multiple_values=False, pkg_type=None, with_preview=True)[source]#

Data analyzer

Processes the given schema and, based on this, extracts information from the iterable dataset.

Parameters:

schema (Mapping) – The dataset’s schema
root_tables (Mapping[str, List]) – The paths which should become root tables
combined_tables (Mapping[str, List]) – The paths which should become tables that combine data from different locations
tables (Mapping[str, Table]) – Use these tables objects instead of parsing the schema
table_threshold – The maximum array length, before it is recommended to split out a child table
total_items – The total objects processed
language – Language to use for the human-readable headings

name_check(parent_key, key)[source]#

guess_type(item)[source]#

init_tables(tables, is_combined=False)[source]#: Initialize the root tables with default fields.

is_base_table()[source]#

load_schema()[source]#

prepare_tables()[source]#

parse_schema()[source]#: Extract information from the schema.

add_column(pointer, typeset)[source]#

add_additional_table(pointer, abs_pointer, parent_key, key, item)[source]#

get_table(path)[source]#

Get the table that best matches the given path.

Parameters:: path – A path
Returns:: A table

add_preview_row(rows, item_id, parent_key)[source]#

Append a mostly-empty row to the previews.

This is important to do, because other code uses an index of -1 to access and update the current row.

Parameters:

rows – The Rows object
item_id – Object id

inc_table_rows(item, rows, parent_key, record)[source]#

is_new_row(pointer)[source]#

join_path(*args)[source]#

get_paths_for_combined_table(parent_key, key)[source]#

is_type_matched(pointer, item, item_type)[source]#

add_joinable_column(abs_pointer, pointer)[source]#

handle_array_expanded(pointer, item, abs_path, key)[source]#

is_array_col(abs_path)[source]#

clean_up_missing_arrays()[source]#

process_items(releases, with_preview=True)[source]#

Analyze releases.

Iterates over every release to calculate metrics and optionally generates previews for combined and split versions of each table.

Parameters:

releases – The releases to analyze
with_preview – Whether to generate previews for each table

dump(path)[source]#

Dump the data processor’s state to a file.

Parameters:: path – Full path to file

classmethod restore(path)[source]#

Restore a data preprocessor’s state from a file.

Parameters:: path – Full path to file

extend_table_types(pointer, item)[source]#: Check if path belong to table and expand its types :param pointer: Path to an item :param item: Item being analyzed

Writer modules#

class spoonbill.writers.base_writer.BaseWriter(workdir, tables, options, schema)[source]#

Base writer class

__init__(workdir, tables, options, schema)[source]#

Parameters:

workdir – Working directory
tables – The table objects
options – Flattening options

get_headers(table, options)[source]#

Return a table’s headers, respecting the human and override options.

Parameters:

table – A table object
options – Flattening options

Returns:

Mapping between the machine-readable headers and the output headers

init_sheet(name, table)[source]#

Initialize a sheet, setting its headers and unique name.

In this context, the sheet might be either a CSV file or a sheet in an Excel workbook.

Parameters:

name – Table name
table – Table object

CSV#

class spoonbill.writers.csv.CSVWriter(workdir, tables, options, schema)[source]#

Writer class with output to CSV files.

For each table, a corresponding CSV file will be created.

name = 'csv'#

writerow(table, row)[source]#: Write a row to the output file.

XLSX#

class spoonbill.writers.xlsx.XlsxWriter(workdir, tables, options, schema, filename='result.xlsx')[source]#

Writer class with output to XLSX files.

For each table, a corresponding sheet in an Excel workbook will be created.

name = 'xlsx'#

writerow(table, row)[source]#: Write a row to the output file.