Page 1 of 2

songfight.org Homepage Issues

Posted: Thu Jan 24, 2013 1:05 pm
by Lunkhead
Maybe we can collect all these random little things in one thread rather than starting a new thread for each?

First off, while I am happy to keep seeing Merisan's only victory touted on the homepage, it made me realize that the "five years ago" and "ten years ago" content has gotten stale.

Re: songfight.org Homepage Issues

Posted: Fri Jan 25, 2013 5:50 pm
by Manhattan Glutton
The site started using mod_pagespeed - which apparently embeds images as base64 into the HTML. I had to fix any scraping I was doing.

Not that I'm complaining. I fixed it.

Re: songfight.org Homepage Issues

Posted: Fri Jan 25, 2013 7:05 pm
by Lunkhead
Whoa, mod_pagespeed and embedded images? That's kind of some futuristic stuff for Song Fight! How did that get there...?

Re: songfight.org Homepage Issues

Posted: Fri Jan 25, 2013 7:35 pm
by fluffy
Automatically.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 11:33 am
by jb
I turned on the auto-google-analytics for the domain, and was required to turn on pagespeed. Sorry f I messed anything up in your scripts.

J

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 2:31 pm
by fluffy
mod_pagespeed is pretty cool, although it seems like a greedy optimization that ignores that a lot of the small images are going to be cached (but then again, the amount of data embedded via base64 is probably less than the total amount of data and latency incurred by added HTTP transactions), and it also is a bit aggressive at "optimizing" Javascript in ways which break Project Wonderful's ad-checking bot (not that that affects Song Fight, but it does affect me). It definitely will break anything that scrapes HTML, though (and the way it breaks will always be a moving target), but HTML-scraping is also a terrible thing to do.

Since the Song Fight jukebox uses the underlying archive data for its updates, it's probably better for those of us who are scraping the site to switch to using Lunkhead's jukebox API for data updates instead. It'd definitely be safer and more reliable, in any case.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 7:23 pm
by Manhattan Glutton
The scrape that broke was the news page (eh, since news.rss is broken/not-robust or whatever). I just had to change the user agent and the XPath to be less specific. Not that it mattered for anybody. I swear I'll get to releasing this app this year.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 8:41 pm
by Lunkhead
I could poll the news.rss and import/archive the news items and make them available as json/xml/etc. if anybody wanted to use that.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 9:22 pm
by fluffy
rss.php's output should be as robust as any other XML document, but you need to parse it as XML, not using regular expressions or whatever. Although I have no idea what mod_pagespeed could have done to it anyway (as far as I can tell, mod_pagespeed isn't affecting it at all though, based on a diff of its raw output vs. curl'ing it).

Or is there some other RSS feed that you're referring to? There isn't one named news.rss so far as I'm aware.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 10:01 pm
by Lunkhead
Oh, right, the correct name is "rss.php" it looks like.

Just offering the news items in a different format isn't really adding much value, but I was thinking that archiving them and making them searchable might be kinda neat.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 10:35 pm
by fluffy
The value it adds is that you can subscribe to it via a feed reader so that you know when a new title is posted, when the new fight is posted, and when a new news item has been posted. You know, like any other RSS feed.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 10:39 pm
by Manhattan Glutton
I reported an issue with rss.php about a year ago, and out of that additional issues cropped up with regards to its existence and maintenance, so I figured it was a lost cause. If that's changed, I certainly would like to know, but I've already gone to the trouble to make my own news scrape: http://api.sfbase.net/news.php

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 10:47 pm
by fluffy
What was the problem? Has it been fixed? It's almost certainly going to be more reliable than scraping HTML.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 11:13 pm
by Lunkhead
fluffy wrote:The value it adds is that you can subscribe to it via a feed reader so that you know when a new title is posted, when the new fight is posted, and when a new news item has been posted. You know, like any other RSS feed.
I meant the value added by me importing the data into the Jukebox via polling the RSS.

Re: songfight.org Homepage Issues

Posted: Sun Jan 27, 2013 11:58 pm
by fluffy
Oh, that. Yeah, I don't know, might be neat, but since all of the item types are undifferentiated there'll be a lot of information that isn't really useful. I suppose I could add tags to indicate whether it's news, title, songs posted, or results, though. It'd be pretty easy since all the underlying data comes from different sources anyway.

[edit] added

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 12:16 pm
by jb
I've always been frustrated by the fact that we can't lump all the songs for a fight into one RSS post with multiple attachments. So in my RSS feed I get 20 posts, one for each entry. Somewhat cumbersome for the readers that I use.

Also, Lunkhead it would definitely be useful to have the titles as an RSS feed. I'd like to be able to create an IFTTT that posts the new title to the G+ community or possibly to the SF FB page. You have the @sfarchivist Tweeting the new titles, but IFTTT can't do Twitter any more. :P

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 12:18 pm
by jb
I am also frustrated that I can't easily upgrade these forums, and that I am nervous that my carefully hacked-in review functions will need to be carefully re-hacked if I upgrade the thing anyway.

This is why I hate programming.

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 1:47 pm
by Lunkhead
jb wrote:Also, Lunkhead it would definitely be useful to have the titles as an RSS feed. I'd like to be able to create an IFTTT that posts the new title to the G+ community or possibly to the SF FB page. You have the @sfarchivist Tweeting the new titles, but IFTTT can't do Twitter any more. :P
"IFTTT can't do Twitter any more" :roll: That is super lame. I think I already have a JSON endpoint for that data now so adding RSS formatting should be pretty trivial. Sadly the Jukebox and all my sites/email/IM are down at the moment...

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 1:48 pm
by Lunkhead
jb wrote:I am also frustrated that I can't easily upgrade these forums, and that I am nervous that my carefully hacked-in review functions will need to be carefully re-hacked if I upgrade the thing anyway.

This is why I hate programming.
That's not really programming's fault, that's PHPBB's fault. :P

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 3:21 pm
by fluffy
jb wrote:I've always been frustrated by the fact that we can't lump all the songs for a fight into one RSS post with multiple attachments. So in my RSS feed I get 20 posts, one for each entry. Somewhat cumbersome for the readers that I use.
That's mostly a failing in how podcasts were "designed," i.e. they're just RSS with a single media attachment. There's nothing about RSS itself that prevents you from using multiple enclosures - it's just that none of the podcast subscription engines that people use (notably iTunes) support it.

http://songfight.org/rss.php doesn't bother trying to do enclosures for entries; it just states as a news item when the songs are available. (Or more accurately, it produces a news item for what the current title is, so it's always regenerating the same item for as long as the title doesn't change.) If anyone wants to get the individual pieces of content they can always use the podcast feed, which is unaffected.

I could actually change rss.php to provide the songs as enclosures on the "now playing" item but I don't know what value that would add, aside from making it slightly easier for people to try to listen to songs from their enclosure-savvy RSS reader (which will probably get screwed up by the anti-hotlink thing if it's web-based).

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 5:28 pm
by jb
fluffy wrote:http://songfight.org/rss.php doesn't bother trying to do enclosures for entries; it just states as a news item when the songs are available. (Or more accurately, it produces a news item for what the current title is, so it's always regenerating the same item for as long as the title doesn't change.) If anyone wants to get the individual pieces of content they can always use the podcast feed, which is unaffected.
IIRC it does that because when we were figuring out how to do it we discovered that you couldn't do multiple enclosures well, so we went with the simplest solution.

For a while I was playing around with trying to write a script that would take all the entries and put them together in one MP3, with computer-generated voice between each song announcing the band name. Never really got traction on it, though it seemed like it should be possible based on some of the libraries that are available...

Re: songfight.org Homepage Issues

Posted: Mon Jan 28, 2013 5:41 pm
by fluffy
right, I meant it was iTunes et al which don't support multiple enclosures well, not RSS itself.

Turning it into a single long-form mp3 with voiceovers in between wouldn't be hard. sox can join together multiple wav files easily, and there's probably a more clever way of joining the mp3 frames directly without having to reencode (apparently you can just cat *.mp3 together and it'll Just Work in most players but I'd worry about intra-file id3 tags screwing things up for some).