Page 17 of 24

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 2:17 pm
by Throne3d
Marri wrote:I have been waiting for you to get it into a state you're happy with, but then I'm totally down to convert it into a Rails-happy version I can use on my site.
Oh, right. Well, I've been pretty happy with it for a few weeks now, I think; it's mainly just been finding new indexes to add, where I just add small-ish bits of code at a single point to make it able to run that when you do it. If you set up a rails version, I should probably be able to add new indexes to that if I come across any more.

I don't think I've changed much of the actual structure of the epubs in a while, and that wouldn't affect you anyway. It downloads the pages fine, it gets the content from them fine, it can get more content if you need it to get more data (shouldn't be too difficult), and I'm only adding new indexes when I find them, but if I find more you can always use those later to import stuff.

Basically, I think it's already in a state where I'm pretty much happy with it!

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 2:40 pm
by MaggieoftheOwls
Oh, that reminds me, I do have an index now. https://maggie-of-the-owls.dreamwidth.org/454.html

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 4:58 pm
by Throne3d
MaggieoftheOwls wrote:Oh, that reminds me, I do have an index now. https://maggie-of-the-owls.dreamwidth.org/454.html
Cool! I'll add a scraper for that.

Um, I know it's a bit to ask, but is it possible that people could keep their indexes in a similar format? Perrrhaps also keep the HTML consistent (like, if you're gonna do a linebreak (<br />), could you keep it outside any <em> or <b> or <u> or whatever tags, unless you seriously need a multi-line underlined thing?). It'd be nice if sections were put into either ordered lists like:

Code: Select all

<ol><li>Entry1</li><li>Entry2</li></ol>
Or unordered lists like:

Code: Select all

<ul><li>My thing</li><li>A thing I did with A</li><li>Another thing</li></ul>
With the section name juuust above it (either as a regular string, since I can find that, or in a nice tag, like bolded or something, and if you want "extras" (e.g. "Blah (with ABC)"), make the extras also tagged, maybe italicised). Like this:

Code: Select all

Hi! This is my index.<br /><strong>Section 1</strong> <em>(with Name1)</em><ol><li><a href="URL1">Part 1</a> of section 1.</li><li><a href="URL2">Part 2</a> of section 1.</li></ol><br /><br /><strong>Other stories</strong><ul><li><a href="URL3">A story</a></li><li><a href="URL3">Another story</a></li></ul><br /><br />I hope you like them!
Like, the Effulgence index is good for the most part, since I can just go "look for all links" (the thread URLs), then I can go "look for all text in that numbered point" (the thread names), then I can go "look for the bit of text in the numbered point outside that" (the section names), and then just move on. Other things are not so great (and I'm not trying to name and shame here, seriously! I enjoy your content, and I get that not everyone gets HTML and so on, and that it's effort to maintain it and everything, so if you really want I can just do it for you, send you it, and hopefully you can just maintain the same format later), but... when you've got:

Code: Select all

<strong><u>Thing</u></strong><em><br>Section</em> (extra stuff)<br>1. <a href="about:blank">Link1</a><br>3. <a href="about:blank">Thing 2</a><u><br></u>
It has a random underlined linebreak, and the linebreaks are sometimes inside the section titles, and there are a couple of weird characters, and the numbers in the lists are written in, rather than being automatically done by ordered lists. I have quite a bit of code dedicated to working around the different formats people use.

I mean, I can get around it, and I have, and I suppose it's effort if you guys actually don't care whether I do or don't generate ebooks using your indexes, so maybe you don't want to, but if you couuuld, because you're making a new index or something, that'd be great. If it's a lot of work, just tell me, and I'll do it for you, so you can then just copy-and-paste and try to keep it in the same format later? Sorry. :\

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 7:13 pm
by Alicorn
I don't mean to be doing any horrid formatting in mine, but I'm very attached to putting double spaces between sentences so DW might be interpreting badly...

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 7:45 pm
by DanielH
I don’t think it’s the double spaces.

Part of the problem seems to be the order of the tags. Try to make sure to put <u>, <em>, etc. on the line with the stuff that should be formatted. However, I’m petty sure part of the problem is that the Dreamwidth auto-formatter has a lot of problems, and those can’t be fixed unless you want to manage linebreaks and stuff manually.

For example, in Incandescence it nests each item of each section’s list inside the next, and each section inside the next. Ordinarily you should have

Code: Select all

<ol>
    <li>[link 1]</li>
    <li>[link 2]</li>
    <li>[link 3]</li>
    <li>[link 4]</li>
    <li>[link 5]</li>
</ol>
Instead, it gives

Code: Select all

[code]
<ol>
    <li>[link 1]
    <li>[link 2]
    <li>[link 3]
    <li>[link 4]
    <li>[link 5]
    </li></li></li></li></li>
</ol>
I think you are using the auto-formatter instead of hand-writing the HTML, and there is just no reason DreamWidth should try that no matter what you’re doing. I was trying to ask you to fix this in the EPUB thread, but before I could clearly communicate the problem and how to fix it, Throne3d came along with something that parsed it anyway.

I think the conclusion to draw is that the auto-formatter is bad. It’s still a nice feature and I would not blame anybody for using it, but it makes things harder on the people who want to actually read and parse the HTML.

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 9:53 pm
by Ezra
Sometimes people write old-school HTML, where the close tags for things like "<p>" and "<li>" are inferred from context, never written out. It's been less fashionable since xhtml came on the scene, but I'm pretty sure it's still legal in HTML5.

I have to parse that kind of HTML for making the Elcenia print editions, certainly.

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 10:13 pm
by DanielH
Ah, that makes sense. And then I guess Dreamwidth tries and fails to sensibly add close tags. Because the close tags are there, and they mess up the BeautifulSoup library when it tries to parse the HTML. I bet it could handle it if the close tags were not there at all.

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 10:43 pm
by Alicorn
I handwrite my HTML, but I don't close tags that the thing should be able to figure out its own self.

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 10:44 pm
by Adelene
DanielH wrote:Ah, that makes sense. And then I guess Dreamwidth tries and fails to sensibly add close tags. Because the close tags are there, and they mess up the BeautifulSoup library when it tries to parse the HTML. I bet it could handle it if the close tags were not there at all.
What happens if you just strip out all the </p> and </li> tags from everything, properly formatted or not, before you do any other parsing?

Re: Index of Glowfics

Posted: Thu Jan 28, 2016 11:41 pm
by DanielH
I don’t know; before I tried to get the parser to work Throne3d came along with a working one. I expect BeautifulSoup would handle it correctly, but I haven’t really used the package much.

If you handwrite the HTML including the line breaks, then I think some of what Throne3d requested boils down to making the <br />s outside of the <u>s and <em>s.