API Reference#

Flatten module#

class spoonbill.flatten.TableFlattenConfig(split, pretty_headers=False, headers=<factory>, repeat=<factory>, unnest=<factory>, only=<factory>, name='')[source]#

Table specific flattening configuration

Parameters:
  • split (bool) – Split child arrays to separate tables

  • pretty_headers (bool) – Use human friendly headers extracted from schema

  • headers (Mapping[str, str]) – User edited headers to override automatically extracted

  • unnest (List[str]) – List of columns to output from child to parent table

  • repeat (List[str]) – List of columns to clone in child tables

  • only (List[str]) – List of columns to output

  • name (str) – Overwrite table name

split: bool#
pretty_headers: bool = False#
headers: Mapping[str, str]#
repeat: List[str]#
unnest: List[str]#
only: List[str]#
name: str = ''#
class spoonbill.flatten.FlattenOptions(selection, exclude=<factory>, count=False)[source]#

Flattening configuration

Parameters:
  • selection (Mapping[str, TableFlattenConfig]) – List of selected tables to extract from data

  • count (bool) – Include number of rows in child table in each parent table

  • exclude (List[str]) – List of tables to exclude from export

selection: Mapping[str, TableFlattenConfig]#
exclude: List[str]#
count: bool = False#
class spoonbill.flatten.Flattener(options, tables, language='en')[source]#

Data flattener

In order to export data correctly Flattener requires previously analyzed tables data. During the process flattener could add columns not based on schema analysis, such as itemsCount. In every generated row, depending on table type, flattener will always few add autogenerated columns. For root table: * rowID * id * ocid

For child tables this list well be extended with parentID column.

Parameters:
  • options (FlattenOptions) – Flattening options

  • tables (Mapping[str, Table]) – Analyzed tables data

  • language – Language to use for the human-readable headings

init_table_selection(tables)[source]#
init_child_tables(tables, table, options)[source]#
init_map(map, paths, table, only=None, target=None)[source]#
init_table_lookup(tables, table, target=None)[source]#
init_count(table, options)[source]#
init_unnest(table, options)[source]#
init_repeat(table, options)[source]#
init_options(tables)[source]#
init_only(table, only, split)[source]#
get_table(pointer)[source]#
flatten(releases)[source]#

Flatten releases

Parameters:

releases – releases as iterable object

Returns:

Iterator over mapping between table name and list of rows for each release

CLI module#

cli.py - Command line interface related routines

class spoonbill.cli.CommaSeparated[source]#

Click option type to convert comma separated string into list

name: str = 'comma'#

the descriptive name of this type

convert(value, param, ctx)[source]#

Convert the value to the correct type. This is not called if the value is None (the missing value).

This must accept string values from the command line, as well as values that are already the correct type. It may also convert other compatible types.

The param and ctx arguments may be None in certain situations, such as when converting prompt input.

If the value cannot be converted, call fail() with a descriptive message.

Parameters:
  • value – The value to convert.

  • param – The parameter that is using this type to convert its value. May be None.

  • ctx – The current context that arrived at this value. May be None.

spoonbill.cli.read_option_file(option, option_file)[source]#
spoonbill.cli.get_selected_tables(base, selection)[source]#

Spec module#

class spoonbill.spec.Column(id, path, title, type, hits=0, header=<factory>)[source]#

A container for column information.

Parameters:
  • id (str) – The JSON path without indexes

  • path (str) – The JSON path with indexes

  • title (str) – The human-friendly title

  • type (str) – The expected type

  • hits (int) – The number of times the column contains data during analysis

  • header (list) –

id: str#
path: str#
title: str#
type: str#
hits: int = 0#
header: list#
class spoonbill.spec.Table(name, path, total_rows=0, parent=<factory>, is_root=False, is_combined=False, splitted=False, rolled_up=False, columns=<factory>, combined_columns=<factory>, additional_columns=<factory>, arrays=<factory>, titles=<factory>, child_tables=<factory>, types=<factory>, array_columns=<factory>, array_positions=<factory>, preview_rows=<factory>, preview_rows_combined=<factory>)[source]#

A container for table information.

Parameters:
  • name (str) – Table name

  • path (List[str]) – List of paths to gather data to this table

  • total_rows (int) – Total available rows in this table

  • parent (object) – Parent table, None if this table is root table

  • is_root (bool) – This table is root table

  • is_combined (bool) – This table contains data collected from different paths

  • splitted (bool) – This table should be splitted

  • rolled_up (bool) – This table should be ated from its parent

  • columns (Mapping[str, Column]) – Columns extracted from schema for split version of this table

  • combined_columns (Mapping[str, Column]) – Columns extracted from schema for unsplit version of this table

  • additional_columns (Mapping[str, Column]) – Columns identified in dataset but not in schema

  • arrays (Mapping[str, int]) – Table array columns and maximum items (not the total count) in each array

  • titles (Mapping[str, str]) – All human-friendly column titles, extracted from the schema

  • child_tables (List[str]) – List of possible child tables

  • types (Mapping[str, List[str]]) – All paths matched to this table with corresponding object type on each path

  • preview_rows (Sequence[dict]) – Generated preview for split version of this table

  • preview_rows_combined (Sequence[dict]) – Generated preview for unsplit version of this table

  • array_columns (Mapping[str, Column]) –

  • array_positions (Mapping[str, str]) –

name: str#
path: List[str]#
total_rows: int = 0#
parent: object#
is_root: bool = False#
is_combined: bool = False#
splitted: bool = False#
rolled_up: bool = False#
columns: Mapping[str, Column]#
combined_columns: Mapping[str, Column]#
additional_columns: Mapping[str, Column]#
arrays: Mapping[str, int]#
titles: Mapping[str, str]#
child_tables: List[str]#
types: Mapping[str, List[str]]#
array_columns: Mapping[str, Column]#
array_positions: Mapping[str, str]#
preview_rows: Sequence[dict]#
preview_rows_combined: Sequence[dict]#
missing_rows(split=True)[source]#

Return the columns that are available in the schema, but not present in the analyzed data.

available_rows(split=True)[source]#

Return the columns that are available in the analyzed data.

filter_columns(filter)[source]#
add_array_column(col, path, abs_path, max)[source]#
add_column(path, item_type, title, *, propagated=False, additional=False, abs_path=None, header=[])[source]#

Add a new column to the table.

Parameters:
  • path – The column’s path

  • item_type – The column’s expected type

  • title – Column title

  • combined_only – Make this column available only in combined version of table

  • propagated – Add column to parent table

  • additional – Mark this column as missing in schema

  • abs_path – The column’s full JSON path

is_array(path)[source]#

Check whether the given path is in any table’s arrays.

inc_column(abs_path, path)[source]#

Increment the number of non-empty cells in the column.

Parameters:
  • abs_path – The column’s full JSON path

  • path – The column’s JSON path without array indexes

add_array(header)[source]#
set_array(header, item)[source]#

Try to set the maximum length of an array.

Parameters:
  • header – The path to the array

  • item – Array from data

Returns:

Whether the array is bigger than previously found and the length was updated

inc()[source]#

Increment the number of rows in the table.

set_preview_path(abs_path, path, value, max_items)[source]#
split(pointer)[source]#
spoonbill.spec.add_child_table(table, pointer, parent_key, key)[source]#

Create and append a new child table to the given table.

Parameters:
  • table – The parent table to the newly created table

  • pointer – Path to which table should match

  • parent_key – New table parent object filed name, used to generate table name

  • key – New table field name object filed name, used to generate table name

Returns:

Child table

Stats module#

class spoonbill.stats.DataPreprocessor(schema, root_tables, combined_tables=None, tables=None, table_threshold=5, total_items=0, language='en', multiple_values=False, pkg_type=None, with_preview=True)[source]#

Data analyzer

Processes the given schema and, based on this, extracts information from the iterable dataset.

Parameters:
  • schema (Mapping) – The dataset’s schema

  • root_tables (Mapping[str, List]) – The paths which should become root tables

  • combined_tables (Mapping[str, List]) – The paths which should become tables that combine data from different locations

  • tables (Mapping[str, Table]) – Use these tables objects instead of parsing the schema

  • table_threshold – The maximum array length, before it is recommended to split out a child table

  • total_items – The total objects processed

  • language – Language to use for the human-readable headings

name_check(parent_key, key)[source]#
guess_type(item)[source]#
init_tables(tables, is_combined=False)[source]#

Initialize the root tables with default fields.

is_base_table()[source]#
load_schema()[source]#
prepare_tables()[source]#
parse_schema()[source]#

Extract information from the schema.

add_column(pointer, typeset)[source]#
add_additional_table(pointer, abs_pointer, parent_key, key, item)[source]#
get_table(path)[source]#

Get the table that best matches the given path.

Parameters:

path – A path

Returns:

A table

add_preview_row(rows, item_id, parent_key)[source]#

Append a mostly-empty row to the previews.

This is important to do, because other code uses an index of -1 to access and update the current row.

Parameters:
  • rows – The Rows object

  • item_id – Object id

inc_table_rows(item, rows, parent_key, record)[source]#
is_new_row(pointer)[source]#
join_path(*args)[source]#
get_paths_for_combined_table(parent_key, key)[source]#
is_type_matched(pointer, item, item_type)[source]#
add_joinable_column(abs_pointer, pointer)[source]#
handle_array_expanded(pointer, item, abs_path, key)[source]#
is_array_col(abs_path)[source]#
clean_up_missing_arrays()[source]#
process_items(releases, with_preview=True)[source]#

Analyze releases.

Iterates over every release to calculate metrics and optionally generates previews for combined and split versions of each table.

Parameters:
  • releases – The releases to analyze

  • with_preview – Whether to generate previews for each table

dump(path)[source]#

Dump the data processor’s state to a file.

Parameters:

path – Full path to file

classmethod restore(path)[source]#

Restore a data preprocessor’s state from a file.

Parameters:

path – Full path to file

extend_table_types(pointer, item)[source]#

Check if path belong to table and expand its types :param pointer: Path to an item :param item: Item being analyzed

Writer modules#

class spoonbill.writers.base_writer.BaseWriter(workdir, tables, options, schema)[source]#

Base writer class

__init__(workdir, tables, options, schema)[source]#
Parameters:
  • workdir – Working directory

  • tables – The table objects

  • options – Flattening options

get_headers(table, options)[source]#

Return a table’s headers, respecting the human and override options.

Parameters:
  • table – A table object

  • options – Flattening options

Returns:

Mapping between the machine-readable headers and the output headers

init_sheet(name, table)[source]#

Initialize a sheet, setting its headers and unique name.

In this context, the sheet might be either a CSV file or a sheet in an Excel workbook.

Parameters:
  • name – Table name

  • table – Table object

CSV#

class spoonbill.writers.csv.CSVWriter(workdir, tables, options, schema)[source]#

Writer class with output to CSV files.

For each table, a corresponding CSV file will be created.

name = 'csv'#
writerow(table, row)[source]#

Write a row to the output file.

XLSX#

class spoonbill.writers.xlsx.XlsxWriter(workdir, tables, options, schema, filename='result.xlsx')[source]#

Writer class with output to XLSX files.

For each table, a corresponding sheet in an Excel workbook will be created.

name = 'xlsx'#
writerow(table, row)[source]#

Write a row to the output file.