Automated Bank Statement Analysis Using GPT, Python and Langchain

Divya Nautiyal

28 Aug 2023

GPT

Introduction

In the constantly changing field of financial management, quick access to and analysis of transaction data from bank statements is absolutely essential. This project presents a dynamic solution that combines Langchain – to automate the extraction of transaction information from bank statement PDFs – Python programming and GPT models to merge modern technologies.

This work uses image processing, optical character recognition (OCR), and the GPT-4 model to simplify the usually difficult chore of reviewing bank statements. Apart from saving significant time, automation helps to improve transaction data extraction accuracy, so supporting more efficient financial record keeping. In this paper, we will explore the methodically broken-out code, focusing on how each stage helps to create this automated bank statement analysis tool.

Aim of the Project

At its core, this project aims to develop a sophisticated system for automating the extraction, analysis, and organization of transaction details from bank statement PDFs. The technical objectives of the project are as follows:

1. GPT-Powered Query Generation and Retrieval

The system creates intelligent, context-aware search queries using the GPT-3 (ChatOpenAI model) to extract transaction details from extracted text. GPT improves data extraction accuracy by assisting in the identification of patterns and structures within the financial data as opposed to relying solely on keyword searches. This guarantees that, even in cases where bank statements are formatted inconsistently, transactions are appropriately classified and interpreted.

2. Image Preprocessing for OCR Accuracy

To improve the accuracy of Optical Character Recognition (OCR), raw images of bank statements are preprocessed before extraction. Methods like thresholding and Gaussian blur to reduce noise and improve the quality of textual content within the images.

3. High-Precision Text Extraction with OCR

Once the images are optimized, OCR is used to convert printed or scanned text into machine-readable format. To achieve accuracy, the system is configured with:

Customized OCR parameters: Page segmentation modes are adjusted to match different bank statement layouts.
Character whitelisting: Restricts OCR to recognize only numbers and relevant symbols, minimizing false detections.
Multi-pass scanning: Runs the OCR multiple times to improve recognition in low-quality scans.

The goal is to extract clear and precise transaction details without distortions.

4. Structured Data Organization

Extracted transaction details are arranged in a structured format for consistency. Instead of raw, unorganized text, the data is formatted into JSON objects, where each transaction has clearly defined fields.

This structured format makes it easier to filter, analyze, and export transaction records.

5. Data Filtering and Validation

To maintain accuracy, the system includes mechanisms to remove irrelevant or incorrect entries:

Duplicate Transaction Removal: Identifies repeated transactions to avoid data inflation.
Error Detection: Flags missing or inconsistent values in extracted data.
Category Assignment: Uses predefined rules to classify transactions (e.g., groceries, utilities, travel).
Irrelevant Data Exclusion: Filters out non-transactional text like disclaimers or bank policies.

By applying these checks, only clean and meaningful transaction data is retained for analysis.

6. Adaptability to Different Bank Statement Formats

Bank statements come in different layouts and structures, making extraction challenging. This system is designed to adapt to:

Varied column arrangements (some statements list transactions horizontally, others in tables).
Different date formats (MM/DD/YYYY, DD/MM/YYYY, or YYYY-MM-DD).
Bank-specific transaction descriptions (some abbreviate merchant names, others provide full details).

Using a dynamic rule-based approach, the system adjusts its extraction logic based on detected patterns, ensuring reliable results across different banks.

7. Seamless JSON and CSV Conversion

After extracting and structuring the transaction data, the system enables easy export in two widely used formats:

JSON: Ideal for integrating with applications, APIs, or databases.
CSV: Useful for financial analysis, spreadsheets, and reporting.

This conversion ensures that users can directly use the extracted data without additional formatting efforts.

8. Performance Optimization

To ensure fast processing while handling large statements, several optimization techniques are applied:

Batch Processing: Instead of processing pages one by one, transactions are extracted in groups to speed up performance.
Memory-Efficient Storage: Uses temporary files and clears unused data to reduce system load.
Selective OCR Runs: Instead of scanning the entire document, OCR is applied only to relevant sections, reducing processing time.
Parallel Processing: Utilizes multiple CPU threads to extract and analyze transactions simultaneously.

These improvements reduce processing time, minimize memory usage, and improve scalability, making the system efficient for both small and large bank statements.

STEP BY STEP GUIDE

Among the several libraries we will be using—Fitz, PIL, numpy, cv2, pytesseract, pandas, and custom modules from the langchain package—are These libraries assist us in extracting images from the PDF, preprocess them, use OCR to extract text, apply text analysis, and arrange the outputs. To help us to grasp the script, let us divide it into steps:

Step 1: Importing Libraries and Setting Up Paths

In this step, necessary libraries are imported, including image processing (PIL, numpy, cv2), text extraction (pytesseract), data handling (pandas), and custom modules. The paths for the PDF, image folder, and text folder are defined. Also, you will have to create .env file and paste your OpenAI API key there in the format -

OPENAI_API_KEY=`YOUR_API_KEY`

Step 2: Extracting Images and Preprocessing

Opening the PDF and then constantly reading through its pages comes next. A pixmap representation of every page is obtained using the get_pixmap function. This pixmap is subsequently transformed into a NumPy array with numpy.

The resultant array shows page image content. Several image preprocessing methods improve the text quality inside the pictures. These comprise Gaussian blur to lower noise, adaptive thresholding to translate the image into a binary format and contrast enhancement with the PIL library Image Enhancement module.

Step 3: Extracting Text Using OCR

In this step, each preprocessed image is loaded, and OCR is applied using pytesseract. The extracted data dictionary is converted into a pandas DataFrame for easier manipulation and analysis. Rows with confidence (conf) values of -1, empty or single-space text are filtered out, resulting in a filtered DataFrame df1.

The filtered DataFrame df1 is grouped by the block number (block_num) and sorted by the vertical position (top) within each block. The loop iterates through each grouped block and reconstructs the text within each block. It takes care of line breaks and formatting. The organized text is then saved into separate text files for each page.

Step 4: Utilizing GPT for Transaction Retrieval

In this step we have used Langchain for doing the extractive QA over the bank statement text: https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa

Here is a basic flow of doing QA over docs using Langchain :-

Data Loading: Begin by importing unstructured data from various sources using LangChain's integration hub. Different loaders facilitate this process, transforming data into LangChain Documents. We have used TextLoader here.

Segmentation: Employ text splitters to break down Documents into smaller sections of specified sizes. This segmentation aids in effective data handling. CharacterTextSplitter helps us in doing the same

Storage: Utilize storage solutions, often vectorstores, to house and embed these segmented sections, enhancing their utility and context. For the same purpose we have used ChromaDB here. Embeddings are an index of text string relatedness and are expressed by a vector (list) of floating point integers.

Retrieval: Access segmented data from storage, often employing embeddings similar to input questions for accurate retrieval. Langchain provides an abstraction over this logic using QA chains. Here the RetrievalQA chain will do this work for us by pulling in the context, which is a chain for question-answering against an index. You can provide a custom prompt template to the chain just like the code above and give the structure/format in which you want to extract the data.

Step 5:- Structuring the result

Now you can use the result and save the extracted transaction information as a JSON file.
The JSON data can be loaded, filtered to remove invalid transactions, and converted into a structured format.
You can further write it to a CSV file, providing a neat and organized representation of the bank statement transactions.

By breaking down the code into these steps, we've outlined the progression of tasks involved in extracting transaction details from a bank statement PDF and organizing them into a structured format for further analysis.

Challenges & Limitations

Although your system is powerful, there are some obstacles to consider when using it in the real world:

OCR Accuracy Issues

Challenge: Some bank statements have blurry fonts, complex tables, watermarks, or handwritten notes that make text extraction difficult.

Why it Matters: If the OCR (Optical Character Recognition) misreads numbers or text, it could result in incorrect financial records.

Possible Fix:

Use better image preprocessing (sharpening, noise reduction).
Train OCR models on different statement formats for better adaptability.

Handling Different Bank Formats

Challenge: Every bank has its own statement design—some use tables, others use lists, and some mix multiple formats.
Why it Matters: A one-size-fits-all extraction method might not work for all statements.
Possible Fix:

Implement machine learning-based layout recognition to adapt to different formats.
Allow manual corrections in case of errors.

API Cost & Processing Speed

Challenge: Using GPT-4 and vector search tools like LangChain can be costly and slow when processing large volumes of data.

Why it Matters: High costs may limit scalability for businesses with thousands of statements to process daily.

Possible Fix:

Use smaller, optimized language models for common queries.
Process statements in batches to reduce API calls.

Security & Data Privacy Considerations

When working with bank statements, security and privacy should be top priorities. Since financial data is highly sensitive, any system that processes such information must be designed to prevent unauthorized access, leaks, or data misuse. Here are the key security measures and considerations for this project:

1. Data Encryption

Encryption ensures that financial data remains secure, both while being processed and when stored. Implementing encryption at different stages of the pipeline can help prevent unauthorized access.

Data at Rest: When saving extracted transaction details as JSON or CSV files, ensure they are stored in an encrypted format (e.g., AES-256 encryption). This makes it harder for unauthorized users to access the data even if they gain access to the storage.
Data in Transit: If the project is deployed in a cloud environment or requires API calls, use SSL/TLS encryption to secure the data transmission between the client and server. This prevents interception by hackers.
API Key Protection: Since the system relies on the OpenAI API, storing API keys securely is critical. Use environment variables (.env file) instead of hardcoding them in the script. You can also use secrets management tools like AWS Secrets Manager or HashiCorp Vault.

2. Access Control & Authentication

Role-Based Access Control (RBAC): If multiple users will access the system, implement role-based permissions to limit who can view, edit, or export transaction data.
User Authentication: If this project is turned into a web app, include multi-factor authentication (MFA) to ensure only authorized users can access financial data.
API Request Limits: Set rate limits on API requests to prevent unauthorized access attempts or abuse.

3. Data Masking & Anonymization

Even if transaction data is extracted, some details (like account numbers) should remain private.

Masking: Show only the last four digits of account numbers (e.g., XXXX-XXXX-5678).
Anonymization: If sharing data for research or development, remove personally identifiable information (PII) like names, addresses, and full bank details.

4. Compliance with Financial Data Regulations

Depending on where the system is used, it may need to comply with financial data protection regulations:

GDPR (General Data Protection Regulation): If handling EU-based bank statements, ensure user consent is obtained before processing personal data.
CCPA (California Consumer Privacy Act): If dealing with US-based bank statements, users should have control over their data and be able to request its deletion.
PCI-DSS Compliance: If integrating with payment data, follow Payment Card Industry Data Security Standards (PCI-DSS) to ensure the safe handling of cardholder information.

5. Secure Storage & Deletion

Temporary Data Storage: If data is only needed for processing and not long-term storage, use temporary files that get automatically deleted after extraction.
Permanent Deletion Policies: If users want to remove their transaction history, implement a feature that securely deletes their extracted data from the system.

By implementing these security measures, this project can safely handle sensitive financial data while preventing unauthorized access and ensuring compliance with legal regulations.

Performance Benchmarks & Optimization

Since bank statements can be long and contain hundreds of transactions, the system should be optimized to handle large files efficiently. Below are key performance factors and improvements:

1. Processing Speed

The time it takes to extract and handle transactions from a bank statement raises one of the main performance issues. Time taken relies on:

Large PDF files—more than one hundred pages—take more time to process.
Text extraction slows down high-resolution images with complicated formatting.
Extensive data extraction from hundreds or thousands of transactions calls for quick processing.

Strategies for Optimization

Use multi-threading or multiprocessing to gather data from several pages concurrently rather than pages one by one.
Should the bank statement have several pages, process them in batches rather than one at a time.
Find and extract just the portions of the document with transaction data rather than running OCR across the whole thing.

2. OCR Accuracy & Improvements

Although OCR (Optical Character Recognition) is a vital component of the project, occasionally it misreads characters (e.g., "0" with "O" or "1" with "I").

Elements influencing OCR accuracy:

Low-resolution or blurry PDFs can cause erroneous text extraction.
OCR mistakes arise from columns misaligned on some bank statements.
OCR models may not be able to recognize some fonts or handwritten texts readily.

Techniques for Optimization

Use adaptive thresholding to raise contrast and increase text legibility.
Cut image noise with Gaussian blur.
Before running OCR to improve recognition accuracy, enlarge and sharpen images.
Create custom OCR models by machine learning specifically for bank statements.
Financial term dictionaries help you identify and fix OCR mistakes in transaction descriptions.

3. Memory & CPU Usage

Running GPT-powered transaction classification and extracting text from PDFs can both be resource-intensive. Particularly for big files, high memory and CPU use can cause processing to lag.

Techniques of Optimization

Convert high-resolution images to lower DPI (dots per inch) to cut memory use before processing.
Extract important sections first and only query relevant data instead of providing the GPT model complete bank statements.
For improved performance use Lightweight OCR Models i.e above the default engine, like the Tesseract LSTM mode.

4. Storage Efficiency

Use compressed formats like zip for temporary files rather than storing big, downloaded datasets.
Better performance in managing unstructured transaction data comes from using NoSQL databases (MongoDB, Firebase).
To minimize storage footprint, store just necessary extracted data rather than complete bank statements.

5. Benchmarking Performance

To measure improvements, compare the system’s performance before and after optimizations.

Example Benchmarks:

Metric	Before Optimization	After Optimization
Average Processing Time (5-page PDF)	30 seconds	10 seconds
OCR Accuracy	85%	95%
Memory Usage (for 10-page PDF)	500MB	250MB

Conclusion

All things considered; the incorporation of innovative technologies displayed here has the power to transform our approach to data taken from bank statement images. We release efficiency gains, enhanced accuracy, and insightful analysis previously buried in unstructured data by combining image preprocessing, advanced language models, and effective data retrieval.

These techniques, which provide better data handling and enable intelligent conversations with machines, can be used across businesses outside of banks. The range for improved data extraction, analysis, and application changes along with technology.