Senior Member
Join Date: Jan 2016
Location: Adelaide, Australia
Posts: 2,294
|
Has anyone had experience with copying pre-made PDF's into Realm Works where the copy and paste goes all funky and adds spaces after every character?
Example: c h a r a c t e r s a r e c h i ld r e n The issue being that this is obviously not going to be efficient to enter into the tool. Is there a way to remove the excess spaces while ensuring a space remains after words? |
#1 |
Senior Member
Join Date: Dec 2013
Posts: 798
|
The problem seems to be with the pdf then. You can Only copy what is there or does copy paste into notepad look better,?
Join the (unofficial) Realm-Works IRC Chat: #realm-works on the Rizon Network (https://wiki.rizon.net/index.php?title=Servers) -> Browser Client: https://kiwiirc.com/client/irc.rizon.net |
#2 |
Senior Member
Join Date: Jan 2016
Location: Adelaide, Australia
Posts: 2,294
|
It's the same in notepad. I tried Notepad++ and displayed all characters. It's got space in everything
|
#3 |
Member
Join Date: Mar 2016
Location: Cologne, Germany
Posts: 46
|
I have a similar issue when pasting from Shadowrun pdfs. It is very odd, because it affects only words with certain sequences of letters. Afaik it is everything with "fi" and "fl". The sequence will be replaced with two spaces upon pasting it into any other program.
I guess that is some sort of copy protection. I don't really have a clue about programming, but probably in your case the pdf actually contains all those spaces, but somewhere in the code it tells the pdf reader to not display single spaces and display only one if there are two... Just my guess :-D Haven't found a workaround for that. If I am right, then a "conditioned paste option" would work theoretically. You would have to figure out what the pdf "really" looks like and what the pdf reader is told to ignore and then apply that to a paste mechanism. I guess there are more urgent things for LWD to care about, but if a talented programmer out there hast too much time :-D |
#4 |
Senior Member
Join Date: Dec 2013
Posts: 798
|
I often wondered why there are no odf readers that can hide the lay out and show the text formatting only. But likely that the top many layout issued with it.
If you use the option save as txt in Adobe reader you will see. Pdfs look like crap Join the (unofficial) Realm-Works IRC Chat: #realm-works on the Rizon Network (https://wiki.rizon.net/index.php?title=Servers) -> Browser Client: https://kiwiirc.com/client/irc.rizon.net |
#5 |
Senior Member
Join Date: Apr 2015
Posts: 343
|
I've run into the "fi" and "fl" issue with Srun PDFs as well.
It affects all of them. But then I take it as an opportunity to read the material I'm copypastaing into my realm, editing as a I go. Quote:
|
|
#6 |
Senior Member
Join Date: Aug 2012
Posts: 432
|
It's not copy protection or a fault with PDFs, it's an unfortunate side-effect of ligatures in text.
My understanding is that when you put text into a PDF, and save it, it converts them into images (which is why they can be viewed equally well regardless of the reading program: pdf stands for "portable document format" for a reason), but when you extract text from them, it takes a "best guess" as to what those images are meant to be, and sometimes fails to correctly separate common letter pairs, and sometimes inserts spaces between characters that weren't in the original text and don't appear to exist in the PDF). The only way for it to be corrected is for PDF to be altered as a document format to retain the details of the original text, something which hasn't been done in decades for some reason. Chief Calendar Champion Chemlak Join the unofficial Realm Works IRC channel! Join #realm-works |
#7 |
Senior Member
Join Date: Jan 2012
Posts: 1,147
|
Chemlak nailed it. Some fonts are better than others for ligatures and translation from PDFs but there's a lot of voodoo and even a touch of black magic involved. Welcome to the world of typography.
For cases like the spacing issue, I've sometimes resorted to replacing double spaces with "xx", replacing single spaces with no space, then replacing "xx" with a single space. That works sometimes but it's a pain. Most of the time I retype and move on. |
#8 |
Senior Member
Join Date: Dec 2014
Location: Twin Cities Area, MN, USA
Posts: 1,325
|
If you do a lot of copying from messy PDFs, it may be worth your time to become familiar with Notepad++, UltraEdit, or similar text editor that can search for an remove/replace unwanted spaces, characters, and line breaks. I little bit of RegEx can go a long way.
RW Project: Dungeons & Dragons 5th edition homebrew world Other Tools: CampaignCartographer, Cityographer, Dungeonographer, Evernote |
#9 |
Senior Member
Lone Wolf Staff
Join Date: May 2005
Posts: 8,232
|
@Chemlak is close, but he's not entirely correct. Ligatures are a definite complicating factor. But PDFs are not "images" (although they can contain them). The text portion of PDFs is actually just a bunch of characters on a page. The best analogy I can think of is the following...
1. Heat up a bowl of alphabet soup 2. Pull out the letters to form a sentence 3. Arrange them appropriately on the table in front of you 4. You now have the contents of a PDF Yes, that's correct. A PDF is nothing more than individual letters positioned on a page. That's it. Nothing more. So when you try to pull the text out of a PDF, all you have are the raw characters. Since each character is explicitly positioned on the page, the only way to get the "text" back out is to interpret the relative positions of those characters and calculate a "best guess" whether there's a space between them or not. Unfortunately, different fonts have different sizing and spacing characteristics. This means that the calculations for each font must take into consideration the wide-ranging font characteristics, which means understanding the details of each font and the individual symbols within it, which are all mathematically defined. That gets incredibly complicated, so nobody does it. The net result is that simple logic is used to determine whether one character is "next to" another and whether a space should be inserted between them. That simple logic works alright for basic fonts, except when things like ligatures complicate things, and it fails horribly for more "interesting" fonts, which RPG publishers love to use to "dress up" their products. That's why you end up with text sometimes extracting like @daplunk cited above. Hope this helps to explain what you're seeing! |
#10 |
|
|