• Please note: In an effort to ensure that all of our users feel welcome on our forums, we’ve updated our forum rules. You can review the updated rules here: http://forums.wolflair.com/showthread.php?t=5528.

    If a fellow Community member is not following the forum rules, please report the post by clicking the Report button (the red yield sign on the left) located on every post. This will notify the moderators directly. If you have any questions about these new rules, please contact support@wolflair.com.

    - The Lone Wolf Development Team

Migrating PDFs to RW: Copy Issues

daplunk

Well-known member
Has anyone had experience with copying pre-made PDF's into Realm Works where the copy and paste goes all funky and adds spaces after every character?

Example: c h a r a c t e r s a r e c h i ld r e n

The issue being that this is obviously not going to be efficient to enter into the tool.

Is there a way to remove the excess spaces while ensuring a space remains after words?
 
The problem seems to be with the pdf then. You can Only copy what is there or does copy paste into notepad look better,?
 
It's the same in notepad. I tried Notepad++ and displayed all characters. It's got space in everything :(
 
I have a similar issue when pasting from Shadowrun pdfs. It is very odd, because it affects only words with certain sequences of letters. Afaik it is everything with "fi" and "fl". The sequence will be replaced with two spaces upon pasting it into any other program.

I guess that is some sort of copy protection. I don't really have a clue about programming, but probably in your case the pdf actually contains all those spaces, but somewhere in the code it tells the pdf reader to not display single spaces and display only one if there are two...
Just my guess :-D

Haven't found a workaround for that. If I am right, then a "conditioned paste option" would work theoretically. You would have to figure out what the pdf "really" looks like and what the pdf reader is told to ignore and then apply that to a paste mechanism.

I guess there are more urgent things for LWD to care about, but if a talented programmer out there hast too much time :-D
 
I often wondered why there are no odf readers that can hide the lay out and show the text formatting only. But likely that the top many layout issued with it.

If you use the option save as txt in Adobe reader you will see. Pdfs look like crap
 
I've run into the "fi" and "fl" issue with Srun PDFs as well.
It affects all of them.

But then I take it as an opportunity to read the material I'm copypastaing into my realm, editing as a I go.

I have a similar issue when pasting from Shadowrun pdfs. It is very odd, because it affects only words with certain sequences of letters. Afaik it is everything with "fi" and "fl". The sequence will be replaced with two spaces upon pasting it into any other program.

I guess that is some sort of copy protection. I don't really have a clue about programming, but probably in your case the pdf actually contains all those spaces, but somewhere in the code it tells the pdf reader to not display single spaces and display only one if there are two...
Just my guess :-D

Haven't found a workaround for that. If I am right, then a "conditioned paste option" would work theoretically. You would have to figure out what the pdf "really" looks like and what the pdf reader is told to ignore and then apply that to a paste mechanism.

I guess there are more urgent things for LWD to care about, but if a talented programmer out there hast too much time :-D
 
It's not copy protection or a fault with PDFs, it's an unfortunate side-effect of ligatures in text.

My understanding is that when you put text into a PDF, and save it, it converts them into images (which is why they can be viewed equally well regardless of the reading program: pdf stands for "portable document format" for a reason), but when you extract text from them, it takes a "best guess" as to what those images are meant to be, and sometimes fails to correctly separate common letter pairs, and sometimes inserts spaces between characters that weren't in the original text and don't appear to exist in the PDF).

The only way for it to be corrected is for PDF to be altered as a document format to retain the details of the original text, something which hasn't been done in decades for some reason.
 
Chemlak nailed it. Some fonts are better than others for ligatures and translation from PDFs but there's a lot of voodoo and even a touch of black magic involved. Welcome to the world of typography.

For cases like the spacing issue, I've sometimes resorted to replacing double spaces with "xx", replacing single spaces with no space, then replacing "xx" with a single space. That works sometimes but it's a pain. Most of the time I retype and move on.
 
If you do a lot of copying from messy PDFs, it may be worth your time to become familiar with Notepad++, UltraEdit, or similar text editor that can search for an remove/replace unwanted spaces, characters, and line breaks. I little bit of RegEx can go a long way.
 
@Chemlak is close, but he's not entirely correct. Ligatures are a definite complicating factor. But PDFs are not "images" (although they can contain them). The text portion of PDFs is actually just a bunch of characters on a page. The best analogy I can think of is the following...

1. Heat up a bowl of alphabet soup
2. Pull out the letters to form a sentence
3. Arrange them appropriately on the table in front of you
4. You now have the contents of a PDF

Yes, that's correct. A PDF is nothing more than individual letters positioned on a page. That's it. Nothing more.

So when you try to pull the text out of a PDF, all you have are the raw characters. Since each character is explicitly positioned on the page, the only way to get the "text" back out is to interpret the relative positions of those characters and calculate a "best guess" whether there's a space between them or not. Unfortunately, different fonts have different sizing and spacing characteristics. This means that the calculations for each font must take into consideration the wide-ranging font characteristics, which means understanding the details of each font and the individual symbols within it, which are all mathematically defined. That gets incredibly complicated, so nobody does it.

The net result is that simple logic is used to determine whether one character is "next to" another and whether a space should be inserted between them. That simple logic works alright for basic fonts, except when things like ligatures complicate things, and it fails horribly for more "interesting" fonts, which RPG publishers love to use to "dress up" their products.

That's why you end up with text sometimes extracting like @daplunk cited above.

Hope this helps to explain what you're seeing! :)
 
I bow to Rob's superior understanding of PDFs (which should come as no great surprise to anyone, really).
 
Yes, that's correct. A PDF is nothing more than individual letters positioned on a page. That's it. Nothing more.

I copy words
To paste them into RealmWorks, but the spacing's off.
It takes so long
To edit what I've pasted, man this task is rough.

Text on a page.
A PDF's just text on a page.
Copy Paste
Proofing takes forever, and it hurts my eyes
All this text
I think I'll hit the forums, get some good advice

Ooooh, ooooh, ooooooooooooh
*violins*
 
Here's my favorite one for this issue: Even printing it and running it through a decent OCR program it has issues.

If non~ ofth~ qu~tions ~aMWf'rw with
li~., and at IU5l On~ qu~.tionwa. an.w,,~wilh a
tm~ altS\<",r, th~ spirits II1iI)' implant a "P"ll in your
mind asArd<-sdlr$$oIkrCOllc.oct~d.. As Ions a.you
~ast at J.e.iil 011" spdI of 5£b In",1 or '-",r wilhin
l-l hours bdore casting Ard~.5af~conlaet,)'ou
immwiat~1y pu-~ a .prll in~of th~aprndw
spell. This n"" spdI is of th" """""' !nod a. th~
aprndw .prll. and is~by tM spirit in
q~n. no! by you: 'fOU nttd. DDt """" know 'M
spell, and in rareG1l56 ~ spdl IIW)'DDt nom be on
~clawiliR. Thccbo5rn spdI is 5Iond in your
mind as though p~in thr: nonn.tl. fashion. 1h~
spdl is 5riU prq»M ........, ifyou are a "P"n~
spdlG1lSt~r, m~anins thallhr .........."'~ spell 5101 can
only be apr~On 1Mrna..,., sprll (though 01M!"
spell MoIsan 1lIUl'Ifft~). If th~ implantw spell
""'I.uiU'S mat~rial compoIl"nts, you still mUSI provid~
thtm in ordu to cut ~ spdl. Thr implantw "P"U
u-mains prq»~ until th" =xt tim,,)'Ou ....t and
rKO\",r opells, and if it ham't!>ttn cast by the ~nd of
that time, it iswast~.
 
Yikes! More than likely, the 3PP is using a freeware font that doesn't properly support all the things that a professional font would normally include, which then results in the corresponding text just being gibberish.

I've seen this with a few fonts used sparsely in a few PDF products, but nothing remotely as messed up as the example above. Ugh.
 
Along with the possible font issue, I wonder what they're using to generate those PDFs and what application they came from.

If they left it in, you can see that info in Adobe Reader in the Properties window (File/Properties... or Ctrl-D) on the Description tab.
 
Last edited:
Let's see:

Helvetica
Helvetica Bold
Helvetica Oblique
Hidden HorzOCR (embedded)
Magical Medieval (Embedded subset)
Olsen TF Regular
Times Bold
Times Italic
Times Roman
 
OK then. It looks like some or all of the document was created by Acrobat's built-in OCR capabilities and they didn't bother to check and fix the resulting text. Depending on the options they chose it may actually look like a mix of text and bitmaps, or it might appear as just the scanned images and the text completely hidden (only used for searching and copy/paste).

FWIW: I have not seen this in legitimately published book PDFs but I can believe that some publishers might not care to fix their work, especially for old books for which they no longer have either the source document (and all of its parts) or the application in which it was created and a system that can run that application (or that predate "modern" desktop publishing). :(

(Sorry if I'm flailing a bit; I don't OCR things but I do a lot of other desktop publishing work.)
 
Last edited:
Back
Top