inline::pypdf
Description
PyPDF-based file processor for extracting text content from documents.
Configuration
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
default_chunk_size_tokens | int | No | 800 | Default chunk size in tokens when chunking_strategy type is 'auto' |
default_chunk_overlap_tokens | int | No | 400 | Default chunk overlap in tokens when chunking_strategy type is 'auto' |
extract_metadata | bool | No | True | Whether to extract PDF metadata (title, author, etc.) |
clean_text | bool | No | True | Whether to clean extracted text (remove extra whitespace, normalize line breaks) |
Sample Configuration
{}