We'd like to remind Forumites to please avoid political debate on the Forum. This is to keep it a safe and useful space for MoneySaving discussions. Threads that are - or become - political in nature may be removed in line with the Forum’s rules. Thank you for your understanding.

Help creating an automated workflow - extract data from .pdfs

RakeshS
RakeshS Posts: 27 Forumite
Fourth Anniversary 10 Posts
edited 18 December 2023 at 6:31AM in Techie Stuff
Hi all,
This is a very specific question, but I hope someone may be able to help with finding a solution.

I receive PDFs at work, and I need to extract some of the information contained in them, then create a new PDF with only that data included.
In an ideal world, my workflow would go something like this:
1. I email the original PDF to an email address/inbox
2. Data is automatically extracted from PDF.
3. A new PDF is compiled with only the extracted information, discarding the rest
4. The new PDF is emailed back to me.

Does anyone have any suggestions as to how this could be carried out?

Thanks in advance for any help you can provide.

R.
«1

Comments

  • wongataa
    wongataa Posts: 2,676 Forumite
    Part of the Furniture 1,000 Posts Name Dropper
    It would be heavily dependant on what info you want to extract from the PDF.  What info are you trying to extract?
  • Neil_Jones
    Neil_Jones Posts: 9,451 Forumite
    Part of the Furniture 1,000 Posts Name Dropper
    RakeshS said:
    Hi all,
    This is a very specific question, but I hope someone may be able to help with finding a solution.

    I receive PDFs at work, and I need to extract some of the information contained in them, then create a new PDF with only that data included.
    In an ideal world, my workflow would go something like this:
    1. I email the original PDF to an email address/inbox
    2. Data is automatically extracted from PDF.
    3. A new PDF is compiled with only the extracted information, discarding the rest
    4. The new PDF is emailed back to me.

    Does anyone have any suggestions as to how this could be carried out?

    Thanks in advance for any help you can provide.

    R.

    Wouldn't it be far easier to just ask whoever's emailing you just to cut the crap and only send the information you want in the first place?
  • Hi,

    Wouldn't it be far easier to just ask whoever's emailing you just to cut the crap and only send the information you want in the first place?
    is that a technical term?

  • Andy_L
    Andy_L Posts: 12,964 Forumite
    Part of the Furniture 10,000 Posts Name Dropper
    RakeshS said:
    Hi all,
    This is a very specific question, but I hope someone may be able to help with finding a solution.

    I receive PDFs at work, and I need to extract some of the information contained in them, then create a new PDF with only that data included.
    In an ideal world, my workflow would go something like this:
    1. I email the original PDF to an email address/inbox
    2. Data is automatically extracted from PDF.
    3. A new PDF is compiled with only the extracted information, discarding the rest
    4. The new PDF is emailed back to me.

    Does anyone have any suggestions as to how this could be carried out?

    Thanks in advance for any help you can provide.

    R.

    Wouldn't it be far easier to just ask whoever's emailing you just to cut the crap and only send the information you want in the first place?
    and in a more usable format than PDF?
  • DullGreyGuy
    DullGreyGuy Posts: 16,220 Forumite
    10,000 Posts Second Anniversary Name Dropper
    RakeshS said:
    This is a very specific question, but I hope someone may be able to help with finding a solution.

    I receive PDFs at work, and I need to extract some of the information contained in them, then create a new PDF with only that data included.
    In an ideal world, my workflow would go something like this:
    1. I email the original PDF to an email address/inbox
    2. Data is automatically extracted from PDF.
    3. A new PDF is compiled with only the extracted information, discarding the rest
    4. The new PDF is emailed back to me.

    Does anyone have any suggestions as to how this could be carried out?
    Firstly, if this is work, are they going to be happy with your mailing the PDFs to some unknown third party?

    There is plenty of automation software that can do this for you, depending on how consistent the forms are and if they are completed with type or hand but they aren't cheap!
  • RakeshS
    RakeshS Posts: 27 Forumite
    Fourth Anniversary 10 Posts
    edited 18 December 2023 at 1:21PM
    RakeshS said:
    This is a very specific question, but I hope someone may be able to help with finding a solution.

    I receive PDFs at work, and I need to extract some of the information contained in them, then create a new PDF with only that data included.
    In an ideal world, my workflow would go something like this:
    1. I email the original PDF to an email address/inbox
    2. Data is automatically extracted from PDF.
    3. A new PDF is compiled with only the extracted information, discarding the rest
    4. The new PDF is emailed back to me.

    Does anyone have any suggestions as to how this could be carried out?
    Firstly, if this is work, are they going to be happy with your mailing the PDFs to some unknown third party?

    There is plenty of automation software that can do this for you, depending on how consistent the forms are and if they are completed with type or hand but they aren't cheap!
    The forms are pretty consistent, at least in terms of the information they contain. We receive these forms from 5-6 companies (our clients) and although the structure/layout of the forms vary from company to company, all the forms from a given company will be structured identically.

    The cost will be the determining factor. If it saves us a lot of manual data input, it may be worth while. If I have an idea of solutions that will work, I can then weigh-up the cost vs labour saving side of things.

    Do you know of any services/tools that could do this?
  • wongataa said:
    It would be heavily dependant on what info you want to extract from the PDF.  What info are you trying to extract?
    contact information, job descriptions, sender info.
    We're a property maintenance/handyman company so it all instructions that have to be scraped from agent's instruction forms and transposed into our system.

    RakeshS said:
    Hi all,
    This is a very specific question, but I hope someone may be able to help with finding a solution.

    I receive PDFs at work, and I need to extract some of the information contained in them, then create a new PDF with only that data included.
    In an ideal world, my workflow would go something like this:
    1. I email the original PDF to an email address/inbox
    2. Data is automatically extracted from PDF.
    3. A new PDF is compiled with only the extracted information, discarding the rest
    4. The new PDF is emailed back to me.

    Does anyone have any suggestions as to how this could be carried out?

    Thanks in advance for any help you can provide.

    R.

    Wouldn't it be far easier to just ask whoever's emailing you just to cut the crap and only send the information you want in the first place?
    They're all very large companies, already working with small contractors such as ourselves. There's very little chance that they'd adjust their way of working to satisfy us, unfortunately.


  • I've got a lot of systems doing that type of thing at work via UiPath (RPA automation software) - it manages to process 30 different PDF layouts from 30 e-mails with ~10,000 pages total in 20mins and extracts the relevant data and loads it into a spreadsheet, plus splits all the files into separate one-page documents and sorts the output files based on certain text on each page (company names, document descriptions), re-naming the files accordingly and and placing in specific fodlers.

    It is for commercial purposes though: the software license will cost a few thousand per year and is cost €30k to build the solution (developers, security checks, etc), but it saves ~500 working days per year compared to the manual solution and provides a much faster, more reliable solution, so it paid for itself in the first few months.
  • RakeshS
    RakeshS Posts: 27 Forumite
    Fourth Anniversary 10 Posts
    edited 19 December 2023 at 7:00PM
    I've got a lot of systems doing that type of thing at work via UiPath (RPA automation software) - it manages to process 30 different PDF layouts from 30 e-mails with ~10,000 pages total in 20mins and extracts the relevant data and loads it into a spreadsheet, plus splits all the files into separate one-page documents and sorts the output files based on certain text on each page (company names, document descriptions), re-naming the files accordingly and and placing in specific fodlers.

    It is for commercial purposes though: the software license will cost a few thousand per year and is cost €30k to build the solution (developers, security checks, etc), but it saves ~500 working days per year compared to the manual solution and provides a much faster, more reliable solution, so it paid for itself in the first few months.
    Wow!!! That sounds incredible!
    I think that's a bit beyond what I'd be looking to do, both in terms of budget and capability.
    We'd be scanning (at most) 7 different document templates/layouts and only doing it 150-ish time per month (1 page per document.) I'm looking at extracting 8-9 fields of information from each pdf.
    This is something that is currently being done manually, but I really want the whole thing streamlined and improved, if possible.
    Any idea if something similar could be achieved for a small business, on a lower budget?

    I've been looking at solutions using Power Automate and similar but dont really have a clue where to start.
  • Have you tried something like Bard, Claude or ChatGPT?  Try "Can you extract certain pieces of information from a pdf?" and see what happens from there.
Meet your Ambassadors

🚀 Getting Started

Hi new member!

Our Getting Started Guide will help you get the most out of the Forum

Categories

  • All Categories
  • 348.7K Banking & Borrowing
  • 252.3K Reduce Debt & Boost Income
  • 452.5K Spending & Discounts
  • 241.3K Work, Benefits & Business
  • 617.9K Mortgages, Homes & Bills
  • 175.8K Life & Family
  • 254.5K Travel & Transport
  • 1.5M Hobbies & Leisure
  • 16.1K Discuss & Feedback
  • 15.1K Coronavirus Support Boards

Is this how you want to be seen?

We see you are using a default avatar. It takes only a few seconds to pick a picture.