How does Canopy Extract work?

All the data we need is in table format

Our data is invariably in table format. Typically we need to extract the following 3 tables from each PDF document

  • Holdings
  • Transactions
  • Current Account Credits and Debits

Canopy Extract is designed to extract any table (not just the 3 tables above) from any PDF document. In case you need to extract charts and images from a PDF document then Canopy Extract is not for you.

Extract needs the PDF document and an Excel Configuration file

To work the PDF Extract needs two files

  • PDF document to be extracted (e-PDF is preferred, but paper scans will also work)
  • Excel Configuration File (which describes the table to be extracted)

The Extract needs an Excel Configuration File (which describes the table to be extracted)

What does a Typical PDF document look like

Multilayer headers and nesting are the key issues while extracting data from a PDF table


Typical table in a Bank Statement

What does an Excel Configuration file look like

The Excel Configuration file for the above table is given below. Further details are on page Parts of a Config File


Excel Configuration file to extract the Holdings table in the image above