DescribePDF is an open-source tool designed to convert PDF files into detailed page-by-page descriptions in Markdown format using Vision-Language Models (VLMs). Unlike traditional PDF extraction tools that focus on replicating the text layout, DescribePDF generates rich, contextual descriptions of each page’s content, making it perfect for visually complex documents like catalogs, scanned documents, and presentations.
Note: DescribePDF is designed to help you make PDF content accessible and searchable in contexts where traditional text extraction fails. It’s particularly useful for creating descriptive content that can be indexed in RAG systems or for making visual documents accessible to people with visual impairments.
Demo · Report Bug · Request Feature · Wiki
The idea for DescribePDF was born from a practical challenge. While building a RAG-powered chatbot that needed to answer questions based on website content, I encountered catalogs in PDF format that couldn’t be properly indexed using traditional text extraction methods. These catalogs contained product images, specifications, and layouts that were visually rich but difficult to extract meaningfully.
Standard OCR produced imprecise, unstructured text, while modern PDF-to-markdown converters failed to capture the essence of these visual documents. When a catalog page consisted primarily of product images with scattered text elements, these tools would either miss important visual context or produce disorganized content.
What I needed was a detailed, page-by-page description that would allow an LLM to “see” what was on each page, enabling responses like: “You can find that product and similar ones on page 12 of the catalog,” along with a link. Existing tools like MinerU, MarkItDown, Nougat, and Vision Parse offered impressive conversion capabilities but weren’t designed for this specific use case.
DescribePDF fills this gap by generating rich, contextual descriptions that capture both the visual and textual elements of each page, making the content accessible for RAG systems and for people with visual impairments.
DescribePDF employs a methodical approach to convert visual PDF content into detailed descriptions:
PDF Preparation: The process begins by analyzing the PDF structure and rendering individual pages as high-quality images.
Enhanced Text Extraction (Optional): When enabled, DescribePDF uses the Markitdown library to extract text content that provides additional context for the description.
Document Summarization (Optional): The tool can generate an overall summary of the document to provide context for page descriptions.
Vision-Language Processing: Each page image is sent to a Vision-Language Model (VLM) with a carefully crafted prompt, which may include the extracted text and document summary.
Multilingual Description Generation: The VLM generates detailed descriptions of each page in the specified language, including visual elements, text content, and structural information.
Markdown Compilation: The individual page descriptions are compiled into a single, structured Markdown document that preserves the page-by-page organization of the original PDF.
This approach ensures that even visually complex documents like catalogs, presentations, and scanned materials can be effectively described and indexed.
DescribePDF differentiates itself from other PDF processing tools by focusing on creating rich descriptions rather than trying to replicate the exact document structure:
Feature | DescribePDF | MarkItDown | Vision Parse | MinerU |
---|---|---|---|---|
Primary Purpose | Generate detailed page descriptions | Convert PDF to Markdown with structure preserved | Parse PDF to formatted Markdown | Convert PDF to machine-readable formats |
Output Focus | Context-rich descriptions | Document layout and structure | Precise content replication | Structure and formula preservation |
Use Case | Visual documents, catalogs, RAG indexing | Text-heavy documents, general conversion | Scientific literature, LaTeX equations | Scientific papers, complex layouts |
VLM Integration | Primary feature | Not core feature | Primary feature | Supplementary feature |
Local Model Support | ✅ (via Ollama) | ❌ | ✅ (via Ollama) | ✅ |
Cloud API Support | ✅ (via OpenRouter) | ✅ (optional) | ✅ (OpenAI, Google, etc.) | ❌ |
Multilingual Support | ✅ | ✅ | ✅ | ✅ |
Document Summary | ✅ | ❌ | ❌ | ❌ |
Web Interface | ✅ | ❌ | ❌ | ✅ |
Try DescribePDF without installation:
# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF
# Install the package locally
pip install -e .
# Process a PDF with default settings (OpenRouter)
describepdf document.pdf
# Process a PDF with Ollama local models
describepdf document.pdf --local
# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .
# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF
# Create and activate a virtual environment
python -m venv describepdf-env
source describepdf-env/bin/activate # On Windows: describepdf-env\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .
# Install uv if you don't have it
pip install uv
# Clone the repository
git clone https://github.com/DavidLMS/DescribePDF.git
cd DescribePDF
# Create and activate a virtual environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txt
# Install the package
uv pip install -e .
Create a .env
file in your working directory with the following settings:
# API Keys
OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"
# Ollama Configuration
OLLAMA_ENDPOINT="http://localhost:11434"
# OpenRouter models
DEFAULT_OR_VLM_MODEL="qwen/qwen2.5-vl-72b-instruct"
DEFAULT_OR_SUMMARY_MODEL="google/gemini-2.5-flash-preview"
# Ollama models
DEFAULT_OLLAMA_VLM_MODEL="llama3.2-vision"
DEFAULT_OLLAMA_SUMMARY_MODEL="mistral-small3.1"
# Common Configuration
DEFAULT_LANGUAGE="English"
DEFAULT_USE_MARKITDOWN="true"
DEFAULT_USE_SUMMARY="false"
DEFAULT_PAGE_SELECTION=""
DescribePDF offers a flexible command-line interface:
# Basic usage (with OpenRouter)
describepdf document.pdf
# Use Ollama as provider
describepdf document.pdf --local --endpoint http://localhost:11434
# Specify an output file
describepdf document.pdf -o result.md
# Change the output language
describepdf document.pdf -l Spanish
# Process only specific pages
describepdf document.pdf --pages "1,3,5-10,15"
# Use Markitdown and summary generation
describepdf document.pdf --use-markitdown --use-summary
# View all available options
describepdf --help
usage: describepdf [-h] [-o OUTPUT] [-k API_KEY] [--local] [--endpoint ENDPOINT]
[-m VLM_MODEL] [-l LANGUAGE] [--use-markitdown] [--use-summary]
[--summary-model SUMMARY_MODEL] [-v]
pdf_file
DescribePDF - Convert a PDF to detailed Markdown descriptions
positional arguments:
pdf_file Path to the PDF file to process
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Path to the output Markdown file
-k API_KEY, --api-key API_KEY
OpenRouter API Key (overrides the one in .env file)
--local Use local Ollama instead of OpenRouter
--endpoint ENDPOINT Ollama endpoint URL (default: http://localhost:11434)
-m VLM_MODEL, --vlm-model VLM_MODEL
VLM model to use
-l LANGUAGE, --language LANGUAGE
Output language
--pages PAGES Pages to process (e.g. '1,3,5-10,15')
--use-markitdown Use Markitdown for enhanced text extraction
--use-summary Generate and use a PDF summary
--summary-model SUMMARY_MODEL
Model to generate the summary
-v, --verbose Verbose mode (show debug messages)
DescribePDF provides two web interfaces powered by Gradio:
# Start the OpenRouter Gradio interface
describepdf-web
# Alternatively
python -m describepdf.ui
# Start the Ollama Gradio interface
describepdf-web-ollama
# Alternatively
python -m describepdf.ui_ollama
You can use DescribePDF in your production applications by leveraging the Gradio API interface. This allows you to run the web interface as a service and make API calls to it from Python, JavaScript, or directly using Bash/cURL.
To use the API, you first need to start the Gradio interface as a server:
# For OpenRouter interface
describepdf-web
# For Ollama interface
describepdf-web-ollama
The Gradio API automatically provides endpoints that can be accessed through various methods. Complete API documentation is available by clicking the Use via API
link in the web interface once the server is running.
Starting with Gradio 5.28.0, you can run the application as an MCP server by setting an environment variable:
# Enable MCP server
export GRADIO_MCP_SERVER=true
# For OpenRouter interface
describepdf-web
# For Ollama interface
describepdf-web-ollama
The Model Context Protocol (MCP) standardizes how AI models interact with external tools. By enabling MCP:
This is particularly useful for integrating DescribePDF into AI workflows or when you want an LLM to generate PDF descriptions through API calls without human intervention.
DescribePDF uses customizable prompt templates located in the prompts/
directory:
vlm_prompt_base.md
- Base prompt for VLM descriptionvlm_prompt_with_markdown.md
- Prompt including Markitdown contextvlm_prompt_with_summary.md
- Prompt including document summaryvlm_prompt_full.md
- Combined prompt with both Markitdown and summarysummary_prompt.md
- Prompt for generating document summariesYou can modify these templates to customize the descriptions generated by the models.
DescribePDF leverages the capabilities of both OpenRouter and Ollama, giving you access to a wide range of models:
DescribePDF supports all vision-capable models available on OpenRouter platform, as well as all text-based LLMs for summary generation. The dropdown menus in the interface show recommended models, but you can use any model by typing its name.
Similarly, DescribePDF works with all vision models and LLMs available in your Ollama installation. This gives you flexibility to use any model you have pulled locally.
As of the current release, these models have been tested and provide excellent results:
OpenRouter:
qwen/qwen2.5-vl-72b-instruct
google/gemini-2.5-flash-preview
Note: Currently, the large models available through OpenRouter generally produce significantly better results than local Ollama models, especially for complex documents with detailed visuals. If quality is your priority and you have an API key, OpenRouter is recommended. However, keep in mind that documents processed through OpenRouter will be shared with the respective model providers according to their privacy policies.
Ollama:
llama3.2-vision
qwen2.5
The performance and availability of models may change over time as new models are released.
The DescribePDF project is under active development. Future plans include:
DescribePDF is released under the MIT License. You are free to use, modify, and distribute the code for both commercial and non-commercial purposes.
Contributions to DescribePDF are welcome! Whether you’re improving the code, enhancing the documentation, or suggesting new features, your input is valuable. Please check out the CONTRIBUTING.md file for guidelines on how to get started and make your contributions count.