Re: Index of Glowfics
Posted: Fri Jan 29, 2016 10:05 am
Ah, right, sorry, I don't think I was clear enough. Most of the HTML that people have written by hand is probably fine.
Long bits of text of how the thing scrapes in indexes:
-------------------------------
In the Incandescence index, it's a bit weird that there are <br>s inside some tags and outside others (like here - why isn't that last <br> just before the <ol>?), but the format is readily readable: Look for an ordered list (1., 2., 3.), then look for all the list elements inside that. In each list element, get all the text that's before a new list (e.g. "Chamomile"). After that, go through each of the elements of the sub-list (e.g. "Picknicking"), get the URL, get the text, and then list that in the list of chapters.
The Effulgence format is largely similar, but there's a bit of a difference in that the HTML on that page seems to be written with ending tags for each <li> element, whereas on Incandescence they don't. Since Dreamwidth messes this up a little, on the Incandescence index, I just look for any part of a list that has another list inside it, and take that to be the "Chamomile" thing, then I look for any list part inside that. It seems like the sub-lists have list elements that show up properly, but the main list seems to be a bit funny with nesting. I'll try Adelene's idea of stripping "</li>" tags to see if that helps, since it sounds like it probably will. On the Effulgence index, they all seem to be explicitly closed, so Dreamwidth doesn't do anything stupid, so I can go through it as I described above. If people are writing new indexes, it'd probably be nice just to explicitly close the list things in case, since it definitely seems like Dreamwidth is being stupid with that.
For the sandbox, I'm currently going through everything in a really horrible way: I remove all "<em>" tags, since that makes it harder to parse. I go through each top-level element in the entry, check to see if it contains the text "SHORT FORM SANDBOXES" or "MULTI-THREAD PLOTS" or whatever, update it so it knows which section it's on, then I find every link, state that I'm doing a new chapter, set the URL, add the text of any other elements (like "(with kappa)") and wait until I find another link, then I process that chapter. If, instead, each "Milliways Meetings" link (and accompanying text) were inside a <li> tag in a <ul>, that'd make it a lot easier to work out what text goes with which link, and to ensure I go through them all properly (you may have noticed the error with only adding a new chapter when I find a new link: it fails to add the last chapter in a list. I fixed this by manually getting it to check if it needs to save a chapter when it changes major group (e.g. SHORT FORM SANDBOXES)), but it would probably be some effort and if it were to be done, I'd probably do it since it's mainly just for the EPUB thing, and the current layout probably looks better, so people are probably more attached to this, and so on.
Pixiethreads gets processed in the same way as Effulgence. "Glowfic" has no neat, textual index, so I go through the monthly archives of posts and add them through that. Marri's index gets processed in a similar way to Effulgence's, but the headings are more like the example I gave, rather than the headings being parts of a list themselves (as in Effulgence and Incandescence). Radon-absinthe is done in a similar way to Effulgence, but the sections (parts of the main list) don't actually have their own names, so I've done away with the part that processes that, and I've just got an incrementing counter. Peterverse's index has been specifically coded in (it's rather different from the others, and it has a few of the weird "<br>" tags hidden inside tags at points, and outside at others). Maggie's index has been specifically coded in, but it's rather easy to process, so it's not too bad (but it'd be kinda nicer if the indexes were more consistent).
-------------------------------
I'm not saying every index has to be identical, just that it'd be nice if they were a bit more similar structurally, or they conformed to a couple or a few templates, so I can just make the script use template X to process Effulgence, Incandescence, Pixiethreads, ..., (numbered sections) and template Y to process Sandbox and Maggie's stuff (not numbered sections), or something like that.
Double spaces should be fine. If they're double spaces as in double line-breaking, that's also fine. It'd be nice if line-breaks were put together, outside of tags that style the text (such as <em> or <strong>), but inside (at the very end of) all "<li>" tags (you apparently shouldn't have "<br>" tags directly inside "<ul>" or "<ol>" tags). As far as I can tell, paragraphs are totally ignored everywhere, since people don't put these links in paragraphs. I'll try ignoring all "</li>" tags and see if that fixes how it processes the stuff that Dreamwidth messes up, but if anyone's making a new index, it'd probably be nice if you explicitly closed list elements. If you're consistently bolding and underlining or and italicising or whatever, it'd be nice if you could keep the order consistent (if you use "<em><b>blah</b></em>" somewhere, can you please not use "<b><em>blah2</em></b>" later on?). Not randomly styling things that don't show up as text (like linebreaks) would be nice (e.g. don't do "<u><br></u>"; remove the "<u>" and "</u>" if you ever see this, please). Keeping <br /> tags outside of this styling wherever possible, so they're as high-level as possible, would also be nice (e.g. "<u>section1</u><br><b>blah</b>", not "<u>section1<br></u><b>blah</b>").
Again, I'd be willing to make the HTML of people's indexes more consistent if they'd be happy for me to do so (I'd look at the page, copy the HTML, change it, then send it to you so you can hopefully update the post with the new layout / small changes).
Long bits of text of how the thing scrapes in indexes:
-------------------------------
In the Incandescence index, it's a bit weird that there are <br>s inside some tags and outside others (like here - why isn't that last <br> just before the <ol>?), but the format is readily readable: Look for an ordered list (1., 2., 3.), then look for all the list elements inside that. In each list element, get all the text that's before a new list (e.g. "Chamomile"). After that, go through each of the elements of the sub-list (e.g. "Picknicking"), get the URL, get the text, and then list that in the list of chapters.
The Effulgence format is largely similar, but there's a bit of a difference in that the HTML on that page seems to be written with ending tags for each <li> element, whereas on Incandescence they don't. Since Dreamwidth messes this up a little, on the Incandescence index, I just look for any part of a list that has another list inside it, and take that to be the "Chamomile" thing, then I look for any list part inside that. It seems like the sub-lists have list elements that show up properly, but the main list seems to be a bit funny with nesting. I'll try Adelene's idea of stripping "</li>" tags to see if that helps, since it sounds like it probably will. On the Effulgence index, they all seem to be explicitly closed, so Dreamwidth doesn't do anything stupid, so I can go through it as I described above. If people are writing new indexes, it'd probably be nice just to explicitly close the list things in case, since it definitely seems like Dreamwidth is being stupid with that.
For the sandbox, I'm currently going through everything in a really horrible way: I remove all "<em>" tags, since that makes it harder to parse. I go through each top-level element in the entry, check to see if it contains the text "SHORT FORM SANDBOXES" or "MULTI-THREAD PLOTS" or whatever, update it so it knows which section it's on, then I find every link, state that I'm doing a new chapter, set the URL, add the text of any other elements (like "(with kappa)") and wait until I find another link, then I process that chapter. If, instead, each "Milliways Meetings" link (and accompanying text) were inside a <li> tag in a <ul>, that'd make it a lot easier to work out what text goes with which link, and to ensure I go through them all properly (you may have noticed the error with only adding a new chapter when I find a new link: it fails to add the last chapter in a list. I fixed this by manually getting it to check if it needs to save a chapter when it changes major group (e.g. SHORT FORM SANDBOXES)), but it would probably be some effort and if it were to be done, I'd probably do it since it's mainly just for the EPUB thing, and the current layout probably looks better, so people are probably more attached to this, and so on.
Pixiethreads gets processed in the same way as Effulgence. "Glowfic" has no neat, textual index, so I go through the monthly archives of posts and add them through that. Marri's index gets processed in a similar way to Effulgence's, but the headings are more like the example I gave, rather than the headings being parts of a list themselves (as in Effulgence and Incandescence). Radon-absinthe is done in a similar way to Effulgence, but the sections (parts of the main list) don't actually have their own names, so I've done away with the part that processes that, and I've just got an incrementing counter. Peterverse's index has been specifically coded in (it's rather different from the others, and it has a few of the weird "<br>" tags hidden inside tags at points, and outside at others). Maggie's index has been specifically coded in, but it's rather easy to process, so it's not too bad (but it'd be kinda nicer if the indexes were more consistent).
-------------------------------
I'm not saying every index has to be identical, just that it'd be nice if they were a bit more similar structurally, or they conformed to a couple or a few templates, so I can just make the script use template X to process Effulgence, Incandescence, Pixiethreads, ..., (numbered sections) and template Y to process Sandbox and Maggie's stuff (not numbered sections), or something like that.
Double spaces should be fine. If they're double spaces as in double line-breaking, that's also fine. It'd be nice if line-breaks were put together, outside of tags that style the text (such as <em> or <strong>), but inside (at the very end of) all "<li>" tags (you apparently shouldn't have "<br>" tags directly inside "<ul>" or "<ol>" tags). As far as I can tell, paragraphs are totally ignored everywhere, since people don't put these links in paragraphs. I'll try ignoring all "</li>" tags and see if that fixes how it processes the stuff that Dreamwidth messes up, but if anyone's making a new index, it'd probably be nice if you explicitly closed list elements. If you're consistently bolding and underlining or and italicising or whatever, it'd be nice if you could keep the order consistent (if you use "<em><b>blah</b></em>" somewhere, can you please not use "<b><em>blah2</em></b>" later on?). Not randomly styling things that don't show up as text (like linebreaks) would be nice (e.g. don't do "<u><br></u>"; remove the "<u>" and "</u>" if you ever see this, please). Keeping <br /> tags outside of this styling wherever possible, so they're as high-level as possible, would also be nice (e.g. "<u>section1</u><br><b>blah</b>", not "<u>section1<br></u><b>blah</b>").
Again, I'd be willing to make the HTML of people's indexes more consistent if they'd be happy for me to do so (I'd look at the page, copy the HTML, change it, then send it to you so you can hopefully update the post with the new layout / small changes).