Library Usage#

File Analyzer#

Analyze file#

To create analyzer object, use:

from spoonbill import  FileAnalyzer
from spoonbill.common import ROOT_TABLES, COMBINED_TABLES

analyzer = FileAnalyzer(
    '.',
    schema=path_to_schema,
    root_tables=ROOT_TABLES,
    combined_tables=COMBINED_TABLES,
    language='en',
    table_threshold=5,
)

To analyze file and track progress, use:

for bytes_read, count in analyzer.analyze_file(path_to_file):
    print(f'analyzed {count} ({bytes_read})')

Storing state#

To dump state file after analysis, use:

analyzer.dump_to_file('analyzed.state')

Note

This sile may be re-used for new instance of analyzer. Can be used to omit analysis step in case of multiple flatteting of the same file.

To restore from state, use:

analyzer = FileAnalyzer('.', state_file='analyzed.state')

Flattener#

Flattening options#

To create flattening options and extract only table and split if its possible,(for example, tenders) use:

from spoonbill.flatten import FlattenOptions

options = FlattenOptions({"selection": {"tenders": {"split": True}}})

To select multiple tables (for example, tender and parties), use:

from spoonbill.flatten import FlattenOptions

options = FlattenOptions(**{
    "selection": {
        "tenders": {"split": True},
        "parties": {"split": True}
    }
})

Flatten file#

To flatten file, use:

from spoonbill import FileFlattener

flattener = FileFlattener(
    '.',
    options,
    analyzer,
    csv=True, # Generate csv files
    xlsx=True, # Generate xlsx files
    language='en',
)

for count in flattener.flatten_file(filename):
    print(f'Flattened {count} items')

Note

Please note that flattening routine requires data to be analyzed beforehand.