Extraction of tables from PDF documents
A lot data exists in PDF only
At Canopy our first preference is to get data as an API (or datafeed) from the custodian.
Unfortunately a large number of custodians (especially in Europe and Asia) are not yet able to provide investments data as APIs and only provide their regular monthly statements (which are provided in paper or PDF)
Therefore to get data from banks who do not offer APIs, Canopy has developed the ability to take these monthly bank statements directly as a data source. We prefer electronically generated PDFs or ePDFs (i.e. the ones downloaded from the bank's website) but can also handle Print PDFs (i.e. scans of paper statements).
Interestingly about 86% of in investments data in Europe and Asia is available in PDF format only (this number is around 15% for North America)
![191123 Comparison of Canopy PDF to Excel with Other Commerical Applications.png 1280](https://files.readme.io/53086e7-191123_Comparison_of_Canopy_PDF_to_Excel_with_Other_Commerical_Applications.png)
Large chunks of data are available only in PDF format
Banks statements have very complex tables
Multilayer column headers and nesting are the key issues
![Picture4.png 1350](https://files.readme.io/fb43c4b-Picture4.png)
typical private bank statement
Benchmarking of Canopy PDF Extraction to Adobe Acrobat
![Slide14.png 2500](https://files.readme.io/50c05a5-Slide14.png)
Canopy only extracts the relevant tables from the PDF document
![Slide15.png 2500](https://files.readme.io/0d7d87c-Slide15.png)
Cells do not get merged in Canopy's extraction of data from PDF
![Slide16.png 2500](https://files.readme.io/a022fdf-Slide16.png)
Tables breaking across pages is not an issue
![Slide17.png 2500](https://files.readme.io/1db320b-Slide17.png)
Multiple tables on the same page is also not an issue
![Slide18.png 2500](https://files.readme.io/7e8ab92-Slide18.png)
Alignment does not go haywire
![Slide19.png 2500](https://files.readme.io/c83985d-Slide19.png)
Multi-layer headers is not an issue
Updated about 3 years ago