One of our website’s all-time most popular blog posts is this one written by Emma Goldsmith on how to translate scanned PDFs, which shows how popular yet tricky they can be to work with.
In addition to Emma’s useful blog post, we’ve recently created a new How to video page designed for project managers. Here you’ll find a selection of short videos which explain how to work with complex files such as PDFs, JSON, PO, InDesign, XML and more.
Visit our new How to video page for project managers here to discover simple ways to work with different files.
If you’re like me, you prefer translating straightforward Word documents. When a PDF lands in your inbox you know it’ll take longer and sometimes the end result will still need fixing. Here are some tips, from one translator to another, for processing PDF files in SDL Trados Studio and making the job a bit easier.
What’s a PDF and what’s the difference between scanned and editable files?
PDF stands for Portable Document Format, which means that the file will display exactly the same content and layout wherever it’s opened, regardless of the device and program used. That’s great for the document author, but not such good news for translators.
PDF documents can be editable or scanned. Editable PDFs have text layers and can be processed in Studio 2011 onwards. Scanned PDFs are simply whole-page images without electronic text characters. They can be processed in Studio 2015 onwards because Studio incorporates an engine that carries out optical character recognition (OCR) to extract the text.
It’s easy to tell the difference between the two types of PDF. Open a file in a PDF reader. You’ll only be able to select, copy and paste a word or a paragraph if it’s an editable PDF.
Language limitations and other non-starters
The OCR engine used for the PDF file type in SDL Trados Studio is powered by Solid Documents Technology. Since the OCR technology is dictionary based, it’s available in certain languages only: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish and Turkish.
Your source text needs to be in one of the above languages. It also needs to be a good quality image to achieve a good conversion. Skewed, blurry, faint, smudged or handwritten text are all non-starters:
If you’re faced with a PDF that looks like one of the above (both real-life examples) then I recommend dictating the source with speech recognition software in Word and then translating the Word file in Studio.
Previewing the output before translating
Let’s assume you’ve got a fairly good-quality scanned PDF, like the one below:
The next step is to test it in Studio. In Studio 2019, simply drop the PDF into the Welcome view.
In the next window, click Advanced.
On the left, select File Types>PDF>Converter and then click Browse to preview the file.
This gives you a quick preview of what the file will look like in the Editor window and at the same time it’ll save the file in docx format in the folder where the PDF is.
Now you can decide whether to go ahead and translate it as is, or whether you want to work on the formatting and layout in the source Word file and then translate the improved Word file instead of the original PDF in Studio.
Bear in mind that the file type preview uses standard segmentation rules, not the TM segmentation settings in a project. Also, the file type preview isn’t available when you add a file to a project, only when you open a file from the Welcome view, in project settings and in general options.
If you’re still working with Studio 2015, you won’t have the file preview feature at all. One workaround is to open the PDF in the Editor and then click Ctrl+shift+P to see and save the source file in Word.
OCR conversion and Word options in the PDF file type
The beauty of the Studio 2019 preview is that you can experiment with the PDF file type settings (see screenshot above) to see how the file will be processed with those settings. I usually set the Layout to Flowing. This is the most basic output you can get, but still gives correctly formatted bullet points, bold, etc.
I remove images, but you may need to keep them and convert them if possible. Headers and footers are best processed as such, although sometimes it’s easier to remove them here and add them by hand in the target Word file.
Detect tables is essential.
The last set of options define how Studio will recognise text.
- Every character is for combined PDFs (containing both editable and scanned text).
- Problem characters only is for scanned PDFs (although you can also use Every character).
- None is for editable PDFs.
Now go back to the list on the left. Common (below Converter) gives you all the options for settings in Word documents, including options for adding comments in the target document.
Practicalities during the translation
When you’ve finished setting up your project and are at the translation stage, watch out for typical OCR errors in the source text. “1" and “I" and “0" and “o" look very similar in some fonts (e.g., 2O December 20I6). Certain letter combinations can also be misinterpreted, especially in the case of proper names that aren’t in the OCR dictionary (e.g., “Dr Tumer" instead of “Dr Turner").
Don’t forget that you can edit source segments to correct errors. Not only does this improve the source text, but it will also give you better leverage from your TMs now and in the future. In the active segment, click Alt+F2 and modify the source.
Sometimes, the PDF conversion will add spurious hard returns (paragraph marks) that will split a sentence into two segments. In Studio 2019 it’s easy to merge these segments. Simply click Alt+Shift+Down arrow, right click in the number column and then click Merge Segments. If this option is greyed out, go to project settings and check that the Source Editing and Merge Segments options are set as shown in the screenshot below:
Finally, when you save your target document with Shift+F12, don’t worry that you can’t save it as a PDF. The target file will be a Word docx format.
Translating editable PDFs is plain sailing. As mentioned earlier, you can import editable PDFs in most versions of Studio, and you’ll find that Studio often does a better job of converting them to Word than if you open them in MS Word itself (this feature is available in Word 2013 onwards). Studio will insert headers and footers more reliably, preserve bold formatting better, and it doesn’t add an extra space before a paragraph mark at the beginning of a line.
PDFs and pricing
Despite all the advances with PDF file types, translating PDFs is still more time consuming than working with native file formats. Getting an accurate source word count is also harder. I recommend charging by the hour if possible, and if not, by the final target word count at a higher rate.
A final word of advice if you’re faced with a particularly tricky PDF: ask your client for the original file. Studio handles a huge number of different file formats, so even if you don’t have the native program on your computer, you’ll still be able to process it in Studio.