Splitting a PDF with Table of Contents
After I submitted my 50 page camera ready to POPL 19, I received an email from the conference publishers indicating my appendix (21 pages) was too long. They requested I split the appendix into a separate document.
There was only one problem: my appendix and paper had references to
each others’ sections, which meant they had to produced in the same
run of LaTeX (lest those pesky “??” placeholders start showing up).
However, using tools like pdftk
to split the resulting document
would destroy the nice table of contents generated by pdflatex
. I
did a lot of Googling, but there is no tool available that splits a
PDF while preserving these bookmarks out of the box.
To solve this problem, I’ve hacked up a simple python script that dumps a textual representation of the source PDF’s bookmarks, splits the PDF, and then updates the two resulting PDFs with the bookmarks extracted from the source PDF.
You can find the script here. To use it, you’ll need Python 2 and pdftk (>= 1.45) installed somewhere on your system.
The script is invoked as:
python ./split_toc source.pdf n second.pdf
source.pdf
is split into two PDFs, one containing pages 1 - n from
source.pdf
and the second containing pages n + 1 onward. The first
PDF (pages 1 - n) overwrites source.pdf
, and the second PDF is
written to second.pdf
. As a part of the splitting process, the table
of contents of source.pdf
are split between the new source.pdf
and
second.pdf
, with updates to the referenced pages as
appropriate. This script doesn’t support arbitrary page ranges, but
you can pretty easily accomplish what you want by composing multiple
calls to split_toc.py
.
I hope this saves you as much hair pulling as it did me!