Splitting a PDF with Table of Contents

After I submitted my 50 page camera ready to POPL 19, I received an email from the conference publishers indicating my appendix (21 pages) was too long. They requested I split the appendix into a separate document.

There was only one problem: my appendix and paper had references to each others’ sections, which meant they had to produced in the same run of LaTeX (lest those pesky “??” placeholders start showing up). However, using tools like pdftk to split the resulting document would destroy the nice table of contents generated by pdflatex. I did a lot of Googling, but there is no tool available that splits a PDF while preserving these bookmarks out of the box.

To solve this problem, I’ve hacked up a simple python script that dumps a textual representation of the source PDF’s bookmarks, splits the PDF, and then updates the two resulting PDFs with the bookmarks extracted from the source PDF.

You can find the script here. To use it, you’ll need Python 2 and pdftk (>= 1.45) installed somewhere on your system.

The script is invoked as:

python ./split_toc source.pdf n second.pdf

source.pdf is split into two PDFs, one containing pages 1 - n from source.pdf and the second containing pages n + 1 onward. The first PDF (pages 1 - n) overwrites source.pdf, and the second PDF is written to second.pdf. As a part of the splitting process, the table of contents of source.pdf are split between the new source.pdf and second.pdf, with updates to the referenced pages as appropriate. This script doesn’t support arbitrary page ranges, but you can pretty easily accomplish what you want by composing multiple calls to split_toc.py.

I hope this saves you as much hair pulling as it did me!