Late Night Tech Bleg

Any of you sophisticated tech users know a way I could look up a no-longer-operative version of a Washington Post news article?

There’s a paragraph I noticed in the 3:15pm version of their “Malaysian prime minister says Flight MH370 ‘ended in the southern Indian Ocean’” story that I was curious about, but it’s not there now.

Not really surprised that it did go missing, but I kinda wish I’d saved the earlier version…

    J.Ty says:

    How old is it? tries to snag what it can, but they’re irregular. Google can archive some stuff but usually not hourly…

    Higgs Boson's Mate says:

    Found THIS with some quick Googling. Is it the one you were looking for?

    Anne Laurie says:

    @J.Ty: Yeah, says it can’t do WaPo pages because “robot.txt”.

    Google just sends me to the current ‘corrected’ article; is there an advanced search-term-of-art I should be trying?

    J.Ty says:

    Here you go, this is apparently some sort of version tracking news-article thing:

    ETA: This actually looks like an incredibly useful tool for tracking story revisions in general… and Anne, please do check that link! I don’t know what you’re looking for but it’s got four different versions of the article.

    I'mNotSureWhoIWantToBeYet says:

    Try a quoted google search for the title. Here’s a version that it says is 14 hours old that someone saved. Close enough?



    🍀 Martin says:

    @J.Ty: Yep. That’s a fantastic little service. Too bad they don’t cover more outlets. Good call.

    JGabriel says:

    Anne Laurie says:

    @Higgs Boson’s Mate: No, I have the article here — but the section about the families getting the news has been edited since it first appeared. Which I kind’ve expected, but I was in a hurry when I first read it & didn’t have a chance to cut and paste the suspicious paragraph.

    From memory, the gist was that ‘some men in plain clothes’ who ‘appeared to be undercover agents’ were in the room during the announcement, quietly discussing amongst themselves how the families’ outrage and anger ‘could be turned against targets like the Malaysian consulate’ and ‘away from Chinese authorities‘. One of those WTF?!? moments…

    J.Ty says:

    @🍀 Martin: *reads all the README*

    That can’t have been easy to make even at the current state…

    especially if they’re using beautifulsoup for their html parser =D

    Anne Laurie says:

    @J.Ty: THANK YOU!

    This is the paragraph I wondered about:

    Some men in the room who appeared to be undercover police in plain clothes were overheard discussing how to subdue particularly angry relatives and how to turn their anger away from the Chinese government and toward other targets such as the Malaysian Embassy.

    Most of the ‘human interest’ stuff seems to have been excised, but I was curious about that bit.

    I am definitely going to be using in the future!

    🍀 Martin says:

    @J.Ty: One of the benefits of targeting well budgeted operations is that you can usually either throw an @media = print at them or code in their print-friendly redirect and get them to toss most of the nasty shit before running it through beautiful soup.

    And I’ve found that at least some of the various publications are quite predictable about their layout, so it’s not too hard to run it through a simple preprocessing template for each site. WaPo means just grabbing [div id=content] off the print view, and that’s likely to hold up pretty well across redesigns.

    I’ve never scraped WaPo, though I have some similar publications. I have a somewhat strange job.

    J.Ty says:

    @Anne Laurie: As we say in the field, you just got librarian’d :)

    Amir Khalid says:

    The head of the Royal Australian Air Force has been quoted as likening the search in the Indian Ocean not to looking for a needle in a haystack, but to looking for the haystack.

    Malaysian newspapers are printing their front pages in black today, logos and all. I have not seen a period of mourning declared yet, but I expect it will be.

    J.Ty says:

    @🍀 Martin: I’m a hobbyist when it comes to news scraping, and I’ve noticed similar patterns, but nothing I’d be willing to wrap a regex around or anything.

    Thanks for the @media = print trick… that’s really good.

  15. 15
    @Anne Laurie: Google just sends me to the current ‘corrected’ article; is there an advanced search-term-of-art I should be trying?

    Google got fussed at awhile ago for caching copies of articles behind paywalls. So some sites “no-cache”. Google updates fast enough that the update would have been wiped out anyways – Newsdiff says the entirety of the 3:55pm version was struck out and replaced at 9:59 PM, the reporter credit was changed, and almost the entire discussion of the reaction and the Chinese authorities response was dumped. I suppose the Chinese government was unhappy.

    Thanks for the site link J. Ty. Just the thing I’d been hankering for for awhile.

    [‘Google – useful for finding something quickly, no longer good for research. Sad.’]

    RSR says:

    Pretty sure that was verbatim from the official statement. Try hunting that down.

    currants says:

    @J.Ty: librarian+computer tech collaborating = very cool to watch. One of the more random reasons I read B-J. Thanks!

    I'mNotSureWhoIWantToBeYet says:

    @J.Ty: Thanks very much for posting that. It’s a very nice site.


    Keith says:

    @J.Ty: Very useful site – bookmarking. Wonderful to see stenographers dance in time with the music.

    Schlemizel says:

    Very interesting deletion. While on jury duty I was stuck having to see NBCs Today Show one morning. They were interviewing the woman from Australia whose husband was on the plane. She started to talk about how awful it was for the Chinese families, locked in a hotel & given no info. The signal went dead & they cut back to the studio where the ‘talent’ looked like someone had just farted. AFter a bit of a stumbling recount of what was unknown by the designated blond – with no explanation for the jump – the designated white guy tossed out some lame “Its hard to give out information when so little is known”. Since that has not stopped them from giving out all kinds of BS I thought the whole thing was odd & that they didn’t want people to know what was happening in China.

