pygfx is a generic python library for extracting things out of PDF files, converting between document formats, and for doing OCR. It's distributed as part of the swftools package.
It so far can read PDF files, images and Flash files, and produce various outputs based on that. The internal document representation is as orthogonal as possible, so it's easy to add both import as well as export filters. It also contains a set of filters for modifying page content (like e.g. converting parts of a document to bitmap).