How to utilise Large Language Models For PDF Sorting

Posted Apr 10, 2025

By Nero0oo0 6 min read

As technical professionals, researchers, and enthusiasts, we often accumulate vast collections of PDF documents over time. Conference papers, research articles, technical documentation, tutorials, and guides pile up in our download folders or cloud storage, creating an increasingly unmanageable digital library.

I found myself in this exact situation, with hundreds of technical PDFs spanning topics from malware analysis and cryptography to low-level programming and operating system internals. Finding a specific document became an exercise in frustration, wasting valuable time that could be better spent learning or solving problems.

This is what led me to develop an automated PDF organisation tool, combining several AI approaches to intelligently categorise documents. In this article, I’ll walk through the technical decisions behind the system and explain how it leverages both traditional NLP techniques and modern language models to create an effective, cost-efficient document classifier.

Rather than relying on a single approach, I designed a hybrid system that uses three complementary techniques:

Keyword extraction via KeyBERT for identifying domain-specific terminology
Pattern matching for recognizing filenames and patterns common in technical documents
Language model analysis via GPT-3.5-Turbo for understanding document context and meaning

This multi-modal approach creates a robust system that gracefully handles different document types and quality levels. Let’s dive into each component and why I chose the specific technologies.

Text Extraction: The Foundation

Before any analysis can happen, we need to extract text from PDFs. After testing several libraries, I settled on PyMuPDF (also known as fitz) for its excellent performance, reliability with various PDF formats, and straightforward API.

  
def extract_text(file_path):
  doc = pymupdf.open(file_path)
  text = ""
  num_pages_to_scan = min(PAGES_TO_SCAN, len(doc))

  for page_num in range(num_pages_to_scan):
    page = doc.load_page(page_num)
    text += page.get_text("text")

  doc.close()
  return text

The system only processes the first several pages of each document (configurable via PAGES_TO_SCAN), which provides enough content for accurate classification while significantly improving performance when dealing with large documents.

Semantic Keyword Extraction with KeyBERT

Once we have the text, the first analysis step uses KeyBERT, a powerful keyword extraction library that leverages BERT-based embeddings to find semantically meaningful terms in documents.

Why KeyBERT? Unlike traditional frequency-based approaches like TF-IDF, KeyBERT understands semantic relationships between words, making it vastly more effective at identifying domain-specific terminology. It can recognize that terms like “buffer overflow” and “stack smashing” are related to cybersecurity even if they don’t appear frequently in the document.

  
keywords = kw_model.extract_keywords(text, top_n=TOP_N_KEYWORDS)

These extracted keywords form the foundation of our classification system, providing a semantic fingerprint of each document.

Rule-Based Classification

The second component is a rule-based pattern matcher that compares extracted keywords and filename patterns against predefined categories. This approach works exceptionally well for technical domains where documents often contain distinctive terminology or follow naming conventions.

The system organizes rules in a hierarchical dictionary, allowing for nuanced classification logic:

  
CATEGORY_RULES = {
  "Malware Analysis & Reverse Engineering": {
    "keywords": {"malware", "disassembly", "ida", "ghidra", ...},
    "filenames": [r"malware", r"analysis", r"apt\d*", ...]
  },

  # Additional categories...
}

This component is particularly powerful for documents with clear domain signatures, like conference papers or technical specifications.

Enhancing Classification with GPT-3.5-Turbo

While keyword extraction and pattern matching work well for many documents, they struggle with more complex or interdisciplinary content. This is where the integration of OpenAI’s GPT-3.5-Turbo transforms the system.

The language model component works by:

Sending a portion of the extracted text to the API
Asking the model to categorise it among our predefined categories
Receiving both a classification and a confidence score
Only accepting classifications above a configurable confidence threshold

  
def chatgpt_categorize(filename, text, keywords):
    categories = list(CATEGORY_RULES.keys())
    categories_str = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)])
    
    prompt = f"""Categorize this PDF document into one of the following categories:
    {categories_str}
    
    Details:
    - Filename: {filename}
    - Extracted keywords: {', '.join(keywords)}
    
    Text excerpt from PDF:
    {text[:MAX_TEXT_LENGTH]}
    
    Respond with JSON only in this format:
    category
    """
    
    # API call and response processing...

The GPT-3.5-Turbo model understands nuanced relationships between concepts, allowing it to classify documents based on context rather than just keyword matching. This is particularly valuable for documents with limited extractable text.

Cost Management Strategies

One of the key concerns when integrating language models is cost management. The system implements several strategies to minimize API expenses:

Text truncation - Only sending a portion of the document (configurable via MAX_TEXT_LENGTH)
Confidence thresholds - Only accepting language model classifications when confidence exceeds a minimum value
Cost estimation - Analyzing the document corpus size and providing cost estimates before processing
Fallback mechanisms - Using keyword-based classification when the language model is unavailable or lacks confidence

These measures ensure the system remains economically viable even for large document collections.

The Classification Decision Flow

The beauty of this hybrid approach is its adaptive classification strategy:

First, extract text and keywords from the document
If language model integration is enabled and we have text:
- Request a classification from GPT-3.5-Turbo
- If confidence exceeds threshold, use this classification
If language model classification fails or lacks confidence:
- Check for keyword matches in our category rules
- Check for filename pattern matches
If all classification attempts fail:
- Place in “Uncategorised” directory

This creates a resilient system that leverages the strengths of each approach while compensating for individual weaknesses.

Performance Results

In testing with my own collection of technical documents (approximately 500 PDFs), the hybrid approach achieved impressive results:

75% proper categorization with keyword-only approach
85% proper categorization with the hybrid approach including GPT-3.5-Turbo
Average processing time of about 3 seconds per document (excluding API latency)
Average API cost of approximately $0.002 per document using GPT-3.5-Turbo

The system particularly excels at correctly categorizing documents that traditional keyword approaches would miss:

Documents with implicit rather than explicit topic references
Scanned documents with imperfect text extraction
Interdisciplinary materials that span multiple domains
Documents with minimal keyword overlap but conceptual relevance

Future Directions

While the current system is highly effective, several enhancements could further improve its capabilities:

Integrating OCR capabilities for image-heavy PDFs
Implementing document clustering to identify new categories automatically
Fine-tuning a domain-specific embedding model for improved keyword extraction
Adding a feedback loop to improve classification based on user corrections

Conclusion

This hybrid approach to PDF classification demonstrates how combining traditional NLP techniques with modern language models can solve practical information management problems. By leveraging the semantic understanding of KeyBERT for keyword extraction and the contextual comprehension of GPT-3.5-Turbo, we’ve created a system that intelligently organizes technical document collections while maintaining reasonable computational and financial costs.

The solution is particularly valuable for anyone who maintains extensive document libraries and need efficient ways to organize and retrieve information.

If you’re interested in trying this approach yourself, the complete source code is available on GitHub. The system is configurable, allowing you to define your own categories and classification rules to match your particular document domain.

LLMs, AI

Programming Python

This post is licensed under CC BY 4.0 by the author.