DescribePDF

poster

Pull Requests MIT License HuggingFace Space

DescribePDF

DescribePDF is an open-source tool designed to convert PDF files into detailed page-by-page descriptions in Markdown format using Vision-Language Models (VLMs). Unlike traditional PDF extraction tools that focus on replicating the text layout, DescribePDF generates rich, contextual descriptions of each page’s content, making it perfect for visually complex documents like catalogs, scanned documents, and presentations.

Note: DescribePDF is designed to help you make PDF content accessible and searchable in contexts where traditional text extraction fails. It’s particularly useful for creating descriptive content that can be indexed in RAG systems or for making visual documents accessible to people with visual impairments.

Demo · Report Bug · Request Feature · Wiki

Table of Contents

Features

Motivation

How DescribePDF Works

Comparison with Similar Tools

Quick Start

Installation

Usage

Customization

Future Development

License

Contributing

Features

Motivation

The idea for DescribePDF was born from a practical challenge. While building a RAG-powered chatbot that needed to answer questions based on website content, I encountered catalogs in PDF format that couldn’t be properly indexed using traditional text extraction methods. These catalogs contained product images, specifications, and layouts that were visually rich but difficult to extract meaningfully.

Standard OCR produced imprecise, unstructured text, while modern PDF-to-markdown converters failed to capture the essence of these visual documents. When a catalog page consisted primarily of product images with scattered text elements, these tools would either miss important visual context or produce disorganized content.

What I needed was a detailed, page-by-page description that would allow an LLM to “see” what was on each page, enabling responses like: “You can find that product and similar ones on page 12 of the catalog,” along with a link. Existing tools like MinerU, MarkItDown, Nougat, and Vision Parse offered impressive conversion capabilities but weren’t designed for this specific use case.

DescribePDF fills this gap by generating rich, contextual descriptions that capture both the visual and textual elements of each page, making the content accessible for RAG systems and for people with visual impairments.

How DescribePDF Works

DescribePDF employs a methodical approach to convert visual PDF content into detailed descriptions:

  1. PDF Preparation: The process begins by analyzing the PDF structure and rendering individual pages as high-quality images.

  2. Enhanced Text Extraction (Optional): When enabled, DescribePDF uses the Markitdown library to extract text content that provides additional context for the description.

  3. Document Summarization (Optional): The tool can generate an overall summary of the document to provide context for page descriptions.

  4. Vision-Language Processing: Each page image is sent to a Vision-Language Model (VLM) with a carefully crafted prompt, which may include the extracted text and document summary.

  5. Multilingual Description Generation: The VLM generates detailed descriptions of each page in the specified language, including visual elements, text content, and structural information.

  6. Markdown Compilation: The individual page descriptions are compiled into a single, structured Markdown document that preserves the page-by-page organization of the original PDF.

This approach ensures that even visually complex documents like catalogs, presentations, and scanned materials can be effectively described and indexed.

Comparison with Similar Tools

DescribePDF differentiates itself from other PDF processing tools by focusing on creating rich descriptions rather than trying to replicate the exact document structure:

Feature DescribePDF MarkItDown Vision Parse MinerU
Primary Purpose Generate detailed page descriptions Convert PDF to Markdown with structure preserved Parse PDF to formatted Markdown Convert PDF to machine-readable formats
Output Focus Context-rich descriptions Document layout and structure Precise content replication Structure and formula preservation
Use Case Visual documents, catalogs, RAG indexing Text-heavy documents, general conversion Scientific literature, LaTeX equations Scientific papers, complex layouts
VLM Integration Primary feature Not core feature Primary feature Supplementary feature
Local Model Support ✅ (via Ollama) ✅ (via Ollama)
Cloud API Support ✅ (via OpenRouter) ✅ (optional) ✅ (OpenAI, Google, etc.)
Multilingual Support
Document Summary
Web Interface

Quick Start

Online Demo

Try DescribePDF without installation:

HuggingFace Space

Quick CLI Example

# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF

# Install the package locally
pip install -e .

# Process a PDF with default settings (OpenRouter)
describepdf document.pdf

# Process a PDF with Ollama local models
describepdf document.pdf --local

Installation

Prerequisites

Option 1: Install from source

# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Option 2: Install with venv

# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF

# Create and activate a virtual environment
python -m venv describepdf-env
source describepdf-env/bin/activate  # On Windows: describepdf-env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Option 3: Install with uv

# Install uv if you don't have it
pip install uv

# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF

# Create and activate a virtual environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

# Install the package
uv pip install -e .

Configuration

Create a .env file in your working directory with the following settings:

# API Keys
OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"

# Ollama Configuration
OLLAMA_ENDPOINT="http://localhost:11434"

# OpenRouter models
DEFAULT_OR_VLM_MODEL="qwen/qwen2.5-vl-72b-instruct"
DEFAULT_OR_SUMMARY_MODEL="google/gemini-2.5-flash-preview"

# Ollama models
DEFAULT_OLLAMA_VLM_MODEL="llama3.2-vision"
DEFAULT_OLLAMA_SUMMARY_MODEL="mistral-small3.1"

# Common Configuration
DEFAULT_LANGUAGE="English"
DEFAULT_USE_MARKITDOWN="true"
DEFAULT_USE_SUMMARY="false"
DEFAULT_PAGE_SELECTION=""

Usage

Command Line Interface

DescribePDF offers a flexible command-line interface:

# Basic usage (with OpenRouter)
describepdf document.pdf

# Use Ollama as provider
describepdf document.pdf --local --endpoint http://localhost:11434

# Specify an output file
describepdf document.pdf -o result.md

# Change the output language
describepdf document.pdf -l Spanish

# Process only specific pages
describepdf document.pdf --pages "1,3,5-10,15"

# Use Markitdown and summary generation
describepdf document.pdf --use-markitdown --use-summary

# View all available options
describepdf --help

Command Line Options

usage: describepdf [-h] [-o OUTPUT] [-k API_KEY] [--local] [--endpoint ENDPOINT]
                   [-m VLM_MODEL] [-l LANGUAGE] [--use-markitdown] [--use-summary]
                   [--summary-model SUMMARY_MODEL] [-v]
                   pdf_file

DescribePDF - Convert a PDF to detailed Markdown descriptions

positional arguments:
  pdf_file              Path to the PDF file to process

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Path to the output Markdown file
  -k API_KEY, --api-key API_KEY
                        OpenRouter API Key (overrides the one in .env file)
  --local               Use local Ollama instead of OpenRouter
  --endpoint ENDPOINT   Ollama endpoint URL (default: http://localhost:11434)
  -m VLM_MODEL, --vlm-model VLM_MODEL
                        VLM model to use
  -l LANGUAGE, --language LANGUAGE
                        Output language
  --pages PAGES         Pages to process (e.g. '1,3,5-10,15')
  --use-markitdown      Use Markitdown for enhanced text extraction
  --use-summary         Generate and use a PDF summary
  --summary-model SUMMARY_MODEL
                        Model to generate the summary
  -v, --verbose         Verbose mode (show debug messages)

Web Interface

DescribePDF provides two web interfaces powered by Gradio:

OpenRouter Interface

# Start the OpenRouter Gradio interface
describepdf-web

# Alternatively
python -m describepdf.ui

Ollama Interface

# Start the Ollama Gradio interface
describepdf-web-ollama

# Alternatively
python -m describepdf.ui_ollama

API

You can use DescribePDF in your production applications by leveraging the Gradio API interface. This allows you to run the web interface as a service and make API calls to it from Python, JavaScript, or directly using Bash/cURL.

To use the API, you first need to start the Gradio interface as a server:

# For OpenRouter interface
describepdf-web

# For Ollama interface
describepdf-web-ollama

The Gradio API automatically provides endpoints that can be accessed through various methods. Complete API documentation is available by clicking the Use via API link in the web interface once the server is running.

MCP Server (Model Context Protocol)

Starting with Gradio 5.28.0, you can run the application as an MCP server by setting an environment variable:

# Enable MCP server
export GRADIO_MCP_SERVER=true

# For OpenRouter interface
describepdf-web

# For Ollama interface
describepdf-web-ollama

The Model Context Protocol (MCP) standardizes how AI models interact with external tools. By enabling MCP:

This is particularly useful for integrating DescribePDF into AI workflows or when you want an LLM to generate PDF descriptions through API calls without human intervention.

Customization

Prompt Templates

DescribePDF uses customizable prompt templates located in the prompts/ directory:

You can modify these templates to customize the descriptions generated by the models.

Model Selection

DescribePDF leverages the capabilities of both OpenRouter and Ollama, giving you access to a wide range of models:

OpenRouter Support

DescribePDF supports all vision-capable models available on OpenRouter platform, as well as all text-based LLMs for summary generation. The dropdown menus in the interface show recommended models, but you can use any model by typing its name.

Ollama Support

Similarly, DescribePDF works with all vision models and LLMs available in your Ollama installation. This gives you flexibility to use any model you have pulled locally.

As of the current release, these models have been tested and provide excellent results:

OpenRouter:

Note: Currently, the large models available through OpenRouter generally produce significantly better results than local Ollama models, especially for complex documents with detailed visuals. If quality is your priority and you have an API key, OpenRouter is recommended. However, keep in mind that documents processed through OpenRouter will be shared with the respective model providers according to their privacy policies.

Ollama:

The performance and availability of models may change over time as new models are released.

Future Development

The DescribePDF project is under active development. Future plans include:

License

DescribePDF is released under the MIT License. You are free to use, modify, and distribute the code for both commercial and non-commercial purposes.

Contributing

Contributions to DescribePDF are welcome! Whether you’re improving the code, enhancing the documentation, or suggesting new features, your input is valuable. Please check out the CONTRIBUTING.md file for guidelines on how to get started and make your contributions count.