• Please note: In an effort to ensure that all of our users feel welcome on our forums, we’ve updated our forum rules. You can review the updated rules here: http://forums.wolflair.com/showthread.php?t=5528.

    If a fellow Community member is not following the forum rules, please report the post by clicking the Report button (the red yield sign on the left) located on every post. This will notify the moderators directly. If you have any questions about these new rules, please contact support@wolflair.com.

    - The Lone Wolf Development Team

Clean Copy from PDF

Croddwyn

Member
I love modules and adventure paths and Realm Works. Even with the hardship of pasting from PDFs, Realm Works cuts my prep time and makes it so much easier once in-game. I can get into my process if desired but that's not what I'm talking about, today.

What I hate about copying from PDFs is that it has a wild array of formatting issues. I want the bold and italics. I want the paragraph breaks. Just about everything else needs to go, though. Oh, and the paragraph breaks need to double and Acrobat always includes an extra paragraph at the start of anything pasted. That should go too.

And that's before you get to the oddity that is the "fl". I can get into the technical reasons for why Acrobat typically splits any word in half between an "f" and an "l" but it's only interesting because it about drove me nuts figuring out what to do about it.

So all of these were driving me nuts when I realized that, hey, I'm a programmer, I should just do something about it (because it turns out that nobody else has, that I could find). So I did. It's a simple little app where you paste from a PDF, hit "clean up" and then copy again. Extra steps, but it slays all those annoyances and even takes a stab at the fl issue based on windows spell check and stuff.

It's a simple WPF app and once finished, I thought "hey, others might like this, particularly in the Realm Works universe". I added XAML editing just for kicks and override words (because it turns out that "half" and "ling" are both valid words and thus didn't trip the spell checker). And then I published it using OneClick.

And yes, it's free to use. And yes, windows is going to bark at you about being unsigned so you'll have to go into "More Info" and tell it to run anyway. Sorry about that. Feel free to use it and let me know what you think!

And if you're with Wolf Lair and want to take a look at the code for use in Realm Works, just drop me a line. I'd love to cut out the extra steps and kill this app dead... :)
 
Always nice to see new tools. :)

In case you're wondering why others may not have similar complaints: I normally want all of the formatting gone, as I almost always end up formatting things differently. Thus everything gets brought in through Paste Special, Unformatted Text. (Which should be Ctrl-Shift-V, hint hint RW folks. ;)

The end result of copy/pasting is highly dependent on how the program that created the PDFs did the text layout and how the result was converted to PDF. For the text I've been bringing in from the Paranoia PDFs, I'd never want the paragraph breaks to double because I apply After Paragraph Spacing to my text if I want space between paragraphs. I also don't see extra paragraphs at the start of pasted text, though I'm pretty particular about what text gets selected. (Proper Triple and Quad-click-drag select would help here, but alas, not supported in my version of Acrobat.)

Also, some sections of the books are laid out with each line as a paragraph, while others have proper paragraphing. The After Paragraph Spacing helps here, making it obvious if something near the end of the line is the end of a line or end of a paragraph.

I'm not sure what's going on with ligatures in your path of PDF/Acrobat/what ends up on the Clipboard. When clearing the formatting, I'd probably break ligatures back into their component letters. Then maybe do a pass smushing together adjoining similar formatting runs, then do the cleanup, then smush formatting runs again. That leaves it up to Realm Works' word processing engine to use ligatures or not once the text is brought in there, which is probably closest to what you'd want. Unless I'm not understanding your fl problems, which is very likely. :)

Good luck!
 
Last edited:
Yeah, I tried doing the plain text copying, but I hate putting the italics and bolds back in and I don't want to do without them. Tedious.

And yeah, a lot depends on what you're using to view PDFs. I use Acrobat, though reluctantly. I actually prefer Foxit if all I'm doing is reading, but I've found that Acrobat is better at translating the "copy text" request into recognizable words. It does a good job putting the paragraphs where actual paragraphs are (as opposed to the end of every line, etc.).

You see, what I've learned is that a PDF document doesn't actually have text as such. It's all glyphs with spacing. So a PDF has no "spaces"--it has the space between glyphs, but no concept of "a space". So it has no words. And no paragraphs. It's all just spacing. So any reader that feeds you copied text has to intuit where words break and where paragraphs break when it receives a request to "copy" the selected glyphs. Some documents make that easy (by indenting paragraphs, for example). But I imagine that some documents make that very hard.

Fortunately for me, Paizo creates pretty clean documents. At least, clean enough that Acrobat does a very good job getting most words and most paragraphs right. Where it fails is, I've found, on the "fl" combination. Which is actually the opposite of a ligature. The problem is that in order to get the kerning right, there's more space than normal between the glyph for "f" and the glyph for "l" (because the top of the "l" shouldn't intersect the top of the "f" so the bases have to have longer-than-normal space between them). If they were a ligature, they'd already be joined and Acrobat would know to break the letters apart without spaces.
 
And yeah, a lot depends on what you're using to view PDFs. I use Acrobat, though reluctantly. I actually prefer Foxit if all I'm doing is reading, but I've found that Acrobat is better at translating the "copy text" request into recognizable words.
I don't use anything but Acrobat on my main machine. I do use an alternative viewer on my tablet, but only for viewing. I'd use Reader if I needed to do anything else but read a PDF on there.

You see, what I've learned is that a PDF document doesn't actually have text as such.
Welcome to simplified Postscript. :)

The problem is that in order to get the kerning right, there's more space than normal between the glyph for "f" and the glyph for "l" (because the top of the "l" shouldn't intersect the top of the "f" so the bases have to have longer-than-normal space between them).
A weird way for Paizo to build their document, but that's how it goes. I wonder what they don't like about the ligature.
 
Thank you for the program.

I will try it out later. Always good to have various tools in your arsenal :-)
 
I just tested it with some 15 line sections. Very nice for some uses. If doing a paste of a single column on a double column page, I still find it faster to delete all formatting and paste as a single paragraph using ctr Alt V since all the lines are short otherwise. It takes just as much time to fix this as the bold/italics. For full pages, it might be more useful.

I'd personally like an option in RW itself to paste unformatted text with paragraph breaks.
 
I'd personally like an option in RW itself to paste unformatted text with paragraph breaks.
Do you mean you wish it had a keyboard shortcut? Ctrl-Shift-V perhaps? ;)

I only ask because Paste Special/Unformatted Text keeps the paragraph breaks as Realm Works sees them. RW can't change what Acrobat, Adobe Reader, or your chosen PDF viewer puts on the Clipboard, and (as mentioned above) the base PDF specification knows nothing about paragraphs.
 
Last edited:
It is a chore but i've had to redo all the formatting for text that I have been copy and pasting between PDFs and realmworks.

I tried a few options but the PDFs im working with (Srun stuff) is all over the place in terms of formatting (i.e. it varies tremendously from one book to the next).
 
I just tested it with some 15 line sections. Very nice for some uses. If doing a paste of a single column on a double column page, I still find it faster to delete all formatting and paste as a single paragraph using ctr Alt V since all the lines are short otherwise. It takes just as much time to fix this as the bold/italics. For full pages, it might be more useful.

That sounds like something that originates in the copy source (a combination of publisher choices and reader program interpretation). Or, at least, I have no trouble selecting a single column in a multi-column document...
 
To clarify what I meant: If there is a double column on the pdf and I use Ctrl Alt V, I can either choose to paste it all as one paragraph with no formatting or paste it with the formatting (such as the paragraph breaks) but then it saves the HRt at the end of every line. Since it is a double column, the lines are now short and I have to through the snippet and remove each line break so that the line will extend across the page properly. Otherwise, each line is the same length as in a double column.
 
To clarify what I meant: If there is a double column on the pdf and I use Ctrl Alt V, I can either choose to paste it all as one paragraph with no formatting or paste it with the formatting (such as the paragraph breaks) but then it saves the HRt at the end of every line. Since it is a double column, the lines are now short and I have to through the snippet and remove each line break so that the line will extend across the page properly. Otherwise, each line is the same length as in a double column.

Yeah, I had this too.

I normally cut in the PDF, paste into Notepad++, fix up formatting, and then cut and paste into Realm Works. It's far easier to fix up the text in Notepad++ than in a snippet frame.

Bulk find and replace on special characters like tabs and CRs make it quick and easy.
 
If you want to keep the formatting, you can search-and-replace paragraph marks in Word as well.

Yeah, I started out using Word and some carefully crafted replaces. I even toyed with scripting some of it. Found making an app easier, but that's my natural response to a problem...
 
Yeah, I had this too.

I normally cut in the PDF, paste into Notepad++, fix up formatting, and then cut and paste into Realm Works. It's far easier to fix up the text in Notepad++ than in a snippet frame.

Bulk find and replace on special characters like tabs and CRs make it quick and easy.

That's a great solution for long sections and I will remember it for those. I usually find that most of what I am doing is two or three paragraphs long. It is easier to clean this kind of thing up internally. On the other hand, I could just do a lot of pages at once and then cut and paste from notepad. It's worth a try in any case. Thanks for the idea.
 
Back
Top