Modifying PDFs

From Torben's Wiki

Required Software

2019: New powerful tool: Coherent PDF

Linux

sudo apt-get install pdftk

Windows

PDFtk

Join several files

pdftk in1.pdf in2.pdf cat output out.pdf
# or using handles
pdftk A=in1.pdf B=in2.pdf cat A B output out.pdf
# or using wildcards
pdftk *.pdf cat output out.pdf
# 
# Remove 'page 13' from in1.pdf to create out1.pdf
pdftk in1.pdf cat 1-12 14-end output out.pdf
# or
pdftk A=in1.pdf cat A1-12 A14-end output out1.pdf 
# 
# join parts of several files
pdftk A=file1.pdf B=file2.pdf C=file3.pdf cat A B2-end C1-23 output out.pdf


extract and convert page to image

gs -dSAFER -r600 -sDEVICE=pngalpha -dFirstPage=1 -dLastPage=1 -o tmp/title-en.png tmp/hpmor.pdf
convert -density 150 tmp/title-en.png -resize 1186x1186\> -quality 75 tmp/title-en.jpg
# Here ghostscript instead of imagemagick used, since imagemagick throw this error:
# convert -density 150 tmp/hpmor.pdf[0] -quality 75 tmp/title-en.jpg
# attempt to perform an operation not allowed by the security policy

extract page range

via PDFtk(Windows)

@echo off
set pdftk=C:\Users\torben\Progs\PortableApps\PDFTKBuilderPortable\App\pdftkbuilder\pdftk.exe
set start=12
set end=40
for %%F in (input\*.pdf) do (
  echo "%%~nF"
  %pdftk% "%%F" cat %start%-%end% output "output\%%~nF-cat-pdftk.pdf"
)

via GhostScript (Windows) + image compression

@echo off
set gs=C:\Users\torben\Progs\PortableApps\CommonFiles\Ghostscript\bin\gswin32c.exe
REM output image quality is reduced to 300 dpi via -dPDFSETTINGS=\printer for 150 dpi use \ebook, for 75 dpi use \screen
set start=12
set end=40
for %%F in (input\*.pdf) do (
  echo "%%~nF"
  %gs% ^
  -o "output\%%~nF-cat-gs.pdf" ^
  -sDEVICE=pdfwrite ^
  -dFirstPage=%start% -dLastPage=%end% ^
  -dCompatibilityLevel=1.7 ^
  -dPDFSETTINGS=/printer ^
  -dAutoRotatePages=/None ^
  -dNOPAUSE -dBATCH ^
  -f "%%F"
)

Crop white borders

(useful for reading a book on a small netbook/laptop screen)

Linux

quick and dirty:

pdfcrop file.pdf file-crop.pdf

better, since preserves links and generates smaller filesize: Use this script pdfcrop.sh or the modified version below pdfcrop2.sh

# requires pdftk
pdfcrop2.sh file.pdf file-crop.pdf

Windows

use pdfcrop.bat

Render two or more pages onto one page

# requires pdftk
sudo apt-get install pdfjam # brings pdfnup

pdfnup AS_Chapter1.pdf


pdf shrinking and protection-removal using Ghostscript

Shrinks file size and removes protection, output -o comes first!!!

gswin64c.exe               ^
  -o "file-shrink.pdf"     ^
  -sDEVICE=pdfwrite        ^
  -dCompatibilityLevel=1.7 ^
  -dPDFSETTINGS=/printer   ^
  -dAutoRotatePages=/None  ^
  -dNOPAUSE -dBATCH        ^
  -f "file.pdf"

( Windows: gswin64c.exe and ^ ; Linux: gs and \ )

Further parametes Converting Bitmaps to 300dpi

-dPDFSETTINGS=/printer

Converting Bitmaps to 150dpi

-dPDFSETTINGS=/ebook

Converting Bitmaps to 75dpi

-dPDFSETTINGS=/screen

Page range

-dFirstPage=100 -dLastPage=123

Vector text on Raster image

Do not use Inkscape when working with colorful images like pictures, since internally it stores all raster images as png, not as jpg when exporting to pdf. Better use:

  1. Gimp to optimize file
    • crop borders
    • scale image
    • set dpi via Image -> Scale Image -> X+Y Resolution (300px/inch is fine for printing)
    • set image->mode->grayscale if suitable
    • export in wanted quality ( e.g. 85% for jpg or image->mode->indexed for png)
  2. LibreOffice
    • import image, preferable as link, since than you can still modify it afterwards using Gimp
    • place text or drawings
    • File-> export as pdf
    • you might choose "Lossless compression" if already done by Gimp
    • no need for "reduce image resolution" if already done by Gimp

Color to Greyscale

[1]

gs -sOutputFile=grayscale.pdf -sDEVICE=pdfwrite
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH c-color.pdf < /dev/null

Appendix Scripts

pdfcrop.bat

requires GhostScript and sed (included in UnixUtils for Windows)

@echo off
for %%F in (input\*.pdf) do (
  echo "%%~nF"
echo uncompressing
REM using pdtk to uncompress
pdftk "%%F" output uncompressed.pdf uncompress 

echo adding Boxes
REM sed from unix utils package
REM calc via mm2pt.xlsx
sed -e "s/\(\(Crop\|Media\)Box\).*/\1 [55.0 85.0 413.0 590.0]/g" uncompressed.pdf > uncompressed2.pdf
del uncompressed.pdf

REM using ghostscript to trim
echo compressing
  e:\win\progs\gs9.21\bin\gswin64c.exe ^
  -o "output\%%~nF-crop.pdf" ^
  -sDEVICE=pdfwrite ^
  -dCompatibilityLevel=1.7 ^
  -dAutoRotatePages=/None ^
  -dNOPAUSE -dBATCH ^
  -f "uncompressed2.pdf"
del uncompressed2.pdf
)

pdfcrop2.sh

#!/bin/bash
# from http://tex.stackexchange.com/questions/42236/pdfcrop-generates-larger-file

function usage () {
  echo "Usage: `basename $0` [Options] <input.pdf> [<output.pdf>]"
  echo
  echo " * Removes white margins from each page in the file. (Default operation)"
  echo " * Trims page edges by given amounts. (Alternative operation)"
  echo
  echo "If only <input.pdf> is given, it is overwritten with the cropped output."
  echo
  echo "Options:"
  echo
  echo " -m \"<left> [<top> [<right> <bottom>]]\""
  echo "    adds extra margins in default operation mode. Unit is bp. A single number"
  echo "    is used for all margins, two numbers \"<left> <top>\" are applied to the"
  echo "    right and bottom margins alike."
  echo
  echo " -t \"<left> [<top> [<right> <bottom>]]\""
  echo "    trims outer page edges by the given amounts. Unit is bp. A single number"
  echo "    is used for all trims, two numbers \"<left> <top>\" are applied to the"
  echo "    right and bottom trims alike."
  echo
  echo " -hires"
  echo "    %%HiResBoundingBox is used in default operation mode."
  echo
  echo " -help"
  echo "    prints this message."
}

c=0
mar=(0 0 0 0); tri=(0 0 0 0)
bbtype=BoundingBox

while getopts m:t:h: opt
do
  case $opt
  in
    m)
    eval mar=($OPTARG)
    [[ -z "${mar[1]}" ]] && mar[1]=${mar[0]}
    [[ -z "${mar[2]}" || -z "${mar[3]}" ]] && mar[2]=${mar[0]} && mar[3]=${mar[1]}
    c=0
    ;;
    t)
    eval tri=($OPTARG)
    [[ -z "${tri[1]}" ]] && tri[1]=${tri[0]}
    [[ -z "${tri[2]}" || -z "${tri[3]}" ]] && tri[2]=${tri[0]} && tri[3]=${tri[1]}
    c=1
    ;;
    h)
    if "$OPTARG" == "ires" 
    then
      bbtype=HiResBoundingBox
    else
      usage 1>&2; exit 0
    fi
    ;;
    \?)
    usage 1>&2; exit 1
    ;;
  esac
done
shift $((OPTIND-1))

-z "$1"  && echo "`basename $0`: missing filename" 1>&2 && usage 1>&2 && exit 1
input=$1;output=$1;shift;
-n "$1"  && output=$1 && shift;

# by TM
if [ $input == $output ] ; then
output="`basename $output`" # remove dirs
output="${output%\.*}" # remove ext .pdf & .PDF
output="$output-crop.pdf"
fi

echo "$input -> $output"
 
 
(
    "$c" -eq 0  && gs -dNOPAUSE -q -dBATCH -sDEVICE=bbox "$input" 2>&1 | grep "%%$bbtype"
    pdftk "$input" output - uncompress
) | perl -w -n -s -e '
  BEGIN {@m=split /\s+/, $mar; @t=split /\s+/, $tri;}
  if (/BoundingBox:\s+([\d\.\s]+\d)/) { push @bbox, $1; next;}
  elsif (/\/MediaBox\s+\[([\d\.\s]+\d)\]/) { @mb=split /\s+/, $1; next; }
  elsif (/pdftk_PageNum\s+(\d+)/) {
    $p=$1-1;
    if($c){
      $mb[0]+=$t[0];$mb[1]+=$t[1];$mb[2]-=$t[2];$mb[3]-=$t[3];
      print "/MediaBox [", join(" ", @mb), "]\n";
    } else {
      @bb=split /\s+/, $bbox[$p];
      $bb[0]+=$mb[0];$bb[1]+=$mb[1];$bb[2]+=$mb[0];$bb[3]+=$mb[1];
      $bb[0]-=$m[0];$bb[1]-=$m[1];$bb[2]+=$m[2];$bb[3]+=$m[3];
      print "/MediaBox [", join(" ", @bb), "]\n";
    }
  }
  print;
' -- -mar="${mar[*]}" -tri="${tri[*]}" -c=$c | pdftk - output "$output" compress