Divya Nautiyal
28 Aug 2023
In the constantly changing field of financial management, quick access to and analysis of transaction data from bank statements is absolutely essential. This project presents a dynamic solution that combines Langchain – to automate the extraction of transaction information from bank statement PDFs – Python programming and GPT models to merge modern technologies.
This work uses image processing, optical character recognition (OCR), and the GPT-4 model to simplify the usually difficult chore of reviewing bank statements. Apart from saving significant time, automation helps to improve transaction data extraction accuracy, so supporting more efficient financial record keeping. In this paper, we will explore the methodically broken-out code, focusing on how each stage helps to create this automated bank statement analysis tool.
At its core, this project aims to develop a sophisticated system for automating the extraction, analysis, and organization of transaction details from bank statement PDFs. The technical objectives of the project are as follows:
The system creates intelligent, context-aware search queries using the GPT-3 (ChatOpenAI model) to extract transaction details from extracted text. GPT improves data extraction accuracy by assisting in the identification of patterns and structures within the financial data as opposed to relying solely on keyword searches. This guarantees that, even in cases where bank statements are formatted inconsistently, transactions are appropriately classified and interpreted.
To improve the accuracy of Optical Character Recognition (OCR), raw images of bank statements are preprocessed before extraction. Methods like thresholding and Gaussian blur to reduce noise and improve the quality of textual content within the images.
Once the images are optimized, OCR is used to convert printed or scanned text into machine-readable format. To achieve accuracy, the system is configured with:
The goal is to extract clear and precise transaction details without distortions.
Extracted transaction details are arranged in a structured format for consistency. Instead of raw, unorganized text, the data is formatted into JSON objects, where each transaction has clearly defined fields.
This structured format makes it easier to filter, analyze, and export transaction records.
To maintain accuracy, the system includes mechanisms to remove irrelevant or incorrect entries:
By applying these checks, only clean and meaningful transaction data is retained for analysis.
Bank statements come in different layouts and structures, making extraction challenging. This system is designed to adapt to:
Using a dynamic rule-based approach, the system adjusts its extraction logic based on detected patterns, ensuring reliable results across different banks.
After extracting and structuring the transaction data, the system enables easy export in two widely used formats:
This conversion ensures that users can directly use the extracted data without additional formatting efforts.
To ensure fast processing while handling large statements, several optimization techniques are applied:
These improvements reduce processing time, minimize memory usage, and improve scalability, making the system efficient for both small and large bank statements.
Among the several libraries we will be using—Fitz, PIL, numpy, cv2, pytesseract, pandas, and custom modules from the langchain package—are These libraries assist us in extracting images from the PDF, preprocess them, use OCR to extract text, apply text analysis, and arrange the outputs. To help us to grasp the script, let us divide it into steps:
In this step, necessary libraries are imported, including image processing (PIL, numpy, cv2), text extraction (pytesseract), data handling (pandas), and custom modules. The paths for the PDF, image folder, and text folder are defined. Also, you will have to create .env file and paste your OpenAI API key there in the format -
OPENAI_API_KEY=`YOUR_API_KEY`
Opening the PDF and then constantly reading through its pages comes next. A pixmap representation of every page is obtained using the get_pixmap function. This pixmap is subsequently transformed into a NumPy array with numpy.
The resultant array shows page image content. Several image preprocessing methods improve the text quality inside the pictures. These comprise Gaussian blur to lower noise, adaptive thresholding to translate the image into a binary format and contrast enhancement with the PIL library Image Enhancement module.
In this step, each preprocessed image is loaded, and OCR is applied using pytesseract. The extracted data dictionary is converted into a pandas DataFrame for easier manipulation and analysis. Rows with confidence (conf) values of -1, empty or single-space text are filtered out, resulting in a filtered DataFrame df1.
The filtered DataFrame df1 is grouped by the block number (block_num) and sorted by the vertical position (top) within each block. The loop iterates through each grouped block and reconstructs the text within each block. It takes care of line breaks and formatting. The organized text is then saved into separate text files for each page.
In this step we have used Langchain for doing the extractive QA over the bank statement text: https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa
Here is a basic flow of doing QA over docs using Langchain :-
Data Loading: Begin by importing unstructured data from various sources using LangChain's integration hub. Different loaders facilitate this process, transforming data into LangChain Documents. We have used TextLoader here.
Segmentation: Employ text splitters to break down Documents into smaller sections of specified sizes. This segmentation aids in effective data handling. CharacterTextSplitter helps us in doing the same
Storage: Utilize storage solutions, often vectorstores, to house and embed these segmented sections, enhancing their utility and context. For the same purpose we have used ChromaDB here. Embeddings are an index of text string relatedness and are expressed by a vector (list) of floating point integers.
Retrieval: Access segmented data from storage, often employing embeddings similar to input questions for accurate retrieval. Langchain provides an abstraction over this logic using QA chains. Here the RetrievalQA chain will do this work for us by pulling in the context, which is a chain for question-answering against an index. You can provide a custom prompt template to the chain just like the code above and give the structure/format in which you want to extract the data.
By breaking down the code into these steps, we've outlined the progression of tasks involved in extracting transaction details from a bank statement PDF and organizing them into a structured format for further analysis.
Although your system is powerful, there are some obstacles to consider when using it in the real world:
OCR Accuracy Issues
Challenge: Some bank statements have blurry fonts, complex tables, watermarks, or handwritten notes that make text extraction difficult.
Why it Matters: If the OCR (Optical Character Recognition) misreads numbers or text, it could result in incorrect financial records.
Possible Fix:
Challenge: Every bank has its own statement design—some use tables, others use lists, and some mix multiple formats.
Why it Matters: A one-size-fits-all extraction method might not work for all statements.
Possible Fix:
Challenge: Using GPT-4 and vector search tools like LangChain can be costly and slow when processing large volumes of data.
Why it Matters: High costs may limit scalability for businesses with thousands of statements to process daily.
Possible Fix:
When working with bank statements, security and privacy should be top priorities. Since financial data is highly sensitive, any system that processes such information must be designed to prevent unauthorized access, leaks, or data misuse. Here are the key security measures and considerations for this project:
Encryption ensures that financial data remains secure, both while being processed and when stored. Implementing encryption at different stages of the pipeline can help prevent unauthorized access.
Even if transaction data is extracted, some details (like account numbers) should remain private.
Depending on where the system is used, it may need to comply with financial data protection regulations:
By implementing these security measures, this project can safely handle sensitive financial data while preventing unauthorized access and ensuring compliance with legal regulations.
Since bank statements can be long and contain hundreds of transactions, the system should be optimized to handle large files efficiently. Below are key performance factors and improvements:
The time it takes to extract and handle transactions from a bank statement raises one of the main performance issues. Time taken relies on:
Large PDF files—more than one hundred pages—take more time to process.
Text extraction slows down high-resolution images with complicated formatting.
Extensive data extraction from hundreds or thousands of transactions calls for quick processing.
Use multi-threading or multiprocessing to gather data from several pages concurrently rather than pages one by one.
Should the bank statement have several pages, process them in batches rather than one at a time.
Find and extract just the portions of the document with transaction data rather than running OCR across the whole thing.
Although OCR (Optical Character Recognition) is a vital component of the project, occasionally it misreads characters (e.g., "0" with "O" or "1" with "I").
Elements influencing OCR accuracy:
Low-resolution or blurry PDFs can cause erroneous text extraction.
OCR mistakes arise from columns misaligned on some bank statements.
OCR models may not be able to recognize some fonts or handwritten texts readily.
Use adaptive thresholding to raise contrast and increase text legibility.
Cut image noise with Gaussian blur.
Before running OCR to improve recognition accuracy, enlarge and sharpen images.
Create custom OCR models by machine learning specifically for bank statements.
Financial term dictionaries help you identify and fix OCR mistakes in transaction descriptions.
Running GPT-powered transaction classification and extracting text from PDFs can both be resource-intensive. Particularly for big files, high memory and CPU use can cause processing to lag.
Convert high-resolution images to lower DPI (dots per inch) to cut memory use before processing.
Extract important sections first and only query relevant data instead of providing the GPT model complete bank statements.
For improved performance use Lightweight OCR Models i.e above the default engine, like the Tesseract LSTM mode.
To measure improvements, compare the system’s performance before and after optimizations.
Example Benchmarks:
All things considered; the incorporation of innovative technologies displayed here has the power to transform our approach to data taken from bank statement images. We release efficiency gains, enhanced accuracy, and insightful analysis previously buried in unstructured data by combining image preprocessing, advanced language models, and effective data retrieval.
These techniques, which provide better data handling and enable intelligent conversations with machines, can be used across businesses outside of banks. The range for improved data extraction, analysis, and application changes along with technology.
Divya Nautiyal
Wed Dec 27 2023
Jhansi Pothuru
Tue Dec 26 2023
Divya Nautiyal
Thu Dec 21 2023
Jhansi Pothuru
Tue Dec 19 2023
Partner with Reveation Labs today and let’s turn your business goals into tangible success. Get in touch with us to discover how we can help you.