How to utilise Large Language Models For PDF Sorting
As technical professionals, researchers, and enthusiasts, we often accumulate vast collections of PDF documents over time. Conference papers, research articles, technical documentation, tutorials, and guides pile up in our download folders or cloud storage, creating an increasingly unmanageable digital library.
I found myself in this exact situation, with hundreds of technical PDFs spanning topics from malware analysis and cryptography to low-level programming and operating system internals. Finding a specific document became an exercise in frustration, wasting valuable time that could be better spent learning or solving problems.
This is what led me to develop an automated PDF organisation tool, combining several AI approaches to intelligently categorise documents. In this article, I’ll walk through the technical decisions behind the system and explain how it leverages both traditional NLP techniques and modern language models to create an effective, cost-efficient document classifier.
The Multi-Modal Solution
Rather than relying on a single approach, I designed a hybrid system that uses three complementary techniques:
- Keyword extraction via KeyBERT for identifying domain-specific terminology
- Pattern matching for recognizing filenames and patterns common in technical documents
- Language model analysis via GPT-3.5-Turbo for understanding document context and meaning
This multi-modal approach creates a robust system that gracefully handles different document types and quality levels. Let’s dive into each component and why I chose the specific technologies.
Text Extraction: The Foundation
Before any analysis can happen, we need to extract text from PDFs. After testing several libraries, I settled on PyMuPDF (also known as fitz) for its excellent performance, reliability with various PDF formats, and straightforward API.
1
2
3
4
5
6
7
8
9
10
11
12
13
def extract_text(file_path):
doc = pymupdf.open(file_path)
text = ""
num_pages_to_scan = min(PAGES_TO_SCAN, len(doc))
for page_num in range(num_pages_to_scan):
page = doc.load_page(page_num)
text += page.get_text("text")
doc.close()
return text
The system only processes the first several pages of each document (configurable via PAGES_TO_SCAN
), which provides enough content for accurate classification while significantly improving performance when dealing with large documents.
Semantic Keyword Extraction with KeyBERT
Once we have the text, the first analysis step uses KeyBERT, a powerful keyword extraction library that leverages BERT-based embeddings to find semantically meaningful terms in documents.
Why KeyBERT? Unlike traditional frequency-based approaches like TF-IDF, KeyBERT understands semantic relationships between words, making it vastly more effective at identifying domain-specific terminology. It can recognize that terms like “buffer overflow” and “stack smashing” are related to cybersecurity even if they don’t appear frequently in the document.
1
2
3
keywords = kw_model.extract_keywords(text, top_n=TOP_N_KEYWORDS)
These extracted keywords form the foundation of our classification system, providing a semantic fingerprint of each document.
Rule-Based Classification
The second component is a rule-based pattern matcher that compares extracted keywords and filename patterns against predefined categories. This approach works exceptionally well for technical domains where documents often contain distinctive terminology or follow naming conventions.
The system organizes rules in a hierarchical dictionary, allowing for nuanced classification logic:
1
2
3
4
5
6
7
8
9
CATEGORY_RULES = {
"Malware Analysis & Reverse Engineering": {
"keywords": {"malware", "disassembly", "ida", "ghidra", ...},
"filenames": [r"malware", r"analysis", r"apt\d*", ...]
},
# Additional categories...
}
This component is particularly powerful for documents with clear domain signatures, like conference papers or technical specifications.
Enhancing Classification with GPT-3.5-Turbo
While keyword extraction and pattern matching work well for many documents, they struggle with more complex or interdisciplinary content. This is where the integration of OpenAI’s GPT-3.5-Turbo transforms the system.
The language model component works by:
- Sending a portion of the extracted text to the API
- Asking the model to categorise it among our predefined categories
- Receiving both a classification and a confidence score
- Only accepting classifications above a configurable confidence threshold
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def chatgpt_categorize(filename, text, keywords):
categories = list(CATEGORY_RULES.keys())
categories_str = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)])
prompt = f"""Categorize this PDF document into one of the following categories:
{categories_str}
Details:
- Filename: {filename}
- Extracted keywords: {', '.join(keywords)}
Text excerpt from PDF:
{text[:MAX_TEXT_LENGTH]}
Respond with JSON only in this format:
category
"""
# API call and response processing...
The GPT-3.5-Turbo model understands nuanced relationships between concepts, allowing it to classify documents based on context rather than just keyword matching. This is particularly valuable for documents with limited extractable text.
Cost Management Strategies
One of the key concerns when integrating language models is cost management. The system implements several strategies to minimize API expenses:
- Text truncation - Only sending a portion of the document (configurable via
MAX_TEXT_LENGTH
) - Confidence thresholds - Only accepting language model classifications when confidence exceeds a minimum value
- Cost estimation - Analyzing the document corpus size and providing cost estimates before processing
- Fallback mechanisms - Using keyword-based classification when the language model is unavailable or lacks confidence
These measures ensure the system remains economically viable even for large document collections.
The Classification Decision Flow
The beauty of this hybrid approach is its adaptive classification strategy:
- First, extract text and keywords from the document
- If language model integration is enabled and we have text:
- Request a classification from GPT-3.5-Turbo
- If confidence exceeds threshold, use this classification
- If language model classification fails or lacks confidence:
- Check for keyword matches in our category rules
- Check for filename pattern matches
- If all classification attempts fail:
- Place in “Uncategorised” directory
This creates a resilient system that leverages the strengths of each approach while compensating for individual weaknesses.
Performance Results
In testing with my own collection of technical documents (approximately 500 PDFs), the hybrid approach achieved impressive results:
- 75% proper categorization with keyword-only approach
- 85% proper categorization with the hybrid approach including GPT-3.5-Turbo
- Average processing time of about 3 seconds per document (excluding API latency)
- Average API cost of approximately $0.002 per document using GPT-3.5-Turbo
The system particularly excels at correctly categorizing documents that traditional keyword approaches would miss:
- Documents with implicit rather than explicit topic references
- Scanned documents with imperfect text extraction
- Interdisciplinary materials that span multiple domains
- Documents with minimal keyword overlap but conceptual relevance
Future Directions
While the current system is highly effective, several enhancements could further improve its capabilities:
- Integrating OCR capabilities for image-heavy PDFs
- Implementing document clustering to identify new categories automatically
- Fine-tuning a domain-specific embedding model for improved keyword extraction
- Adding a feedback loop to improve classification based on user corrections
Conclusion
This hybrid approach to PDF classification demonstrates how combining traditional NLP techniques with modern language models can solve practical information management problems. By leveraging the semantic understanding of KeyBERT for keyword extraction and the contextual comprehension of GPT-3.5-Turbo, we’ve created a system that intelligently organizes technical document collections while maintaining reasonable computational and financial costs.
The solution is particularly valuable for anyone who maintains extensive document libraries and need efficient ways to organize and retrieve information.
If you’re interested in trying this approach yourself, the complete source code is available on GitHub. The system is configurable, allowing you to define your own categories and classification rules to match your particular document domain.