Conversation

news: “the UK wants to restrict wikipedia”

my paranoid USian ass expecting the worst: “…maybe i should download and set up an entire backup of english wikipedia”


ukpol politics pol US USpol
2
1
4

@tenna actually go for it imo, it’s actually not as big as you’d think if you only include the latest revisions

0
0
1

nobody’s talking me out of this idea.

welp, looks like i’ve got a multi-day project ahead of me :P

2
1
0
@tenna I believe in u


(downloading the entire English Wikipedia to tape, but inserting fake information to mess with historians of the distant future)
0
1
3

i was gonna go the wikimedia route, but it seems extraordinarily finicky and full of hacks and uncertainty

so trying xowa first. it took me a while to figure out how to do it without a GUI since it disables fetching the wiki from the http server, but i managed to find the page on how to do it via the command line, as well as use a backup i downloaded from a mirror.

it begins


tech wikipedia
1
1
0

yeah, soooo xowa is also extraordinarily finicky and full of uncertainty, it’s just that it for certain also has bugs parsing more modern backups (like, anything past 2021 apparently??) among other really irritating things.

guess it’s wikimedia time

1
1
0

wow! just the base install is completely falling flat on its face, it keeps reporting the database as read-only when that is clearly not the case. i have no idea what’s going on here. what the heck.


liveposting my tech woes
1
1
0

alright, so that was because i was using docker. apparently there’s no official docker image for mediawiki, and it shows.

doing the Official(TM) install and then trying to follow the guides i found for importing all of the pages, I… don’t see any of the pages I’m supposed to see :V

2
1
0

@tenna I’m so glad someone else is experiencing the joy of trying to run MediaWiki. It’s incredibly powerful for users but for sysadmins it’s 100% Lua error: unknown module 'strict'

(In other words, it was built ad hoc to fit Wikipedia’s needs and hosting and tooling and processes and it fucking shows lol)

0
1
2

welp, this is an absolute unmitigated mess and it’s kinda obvious that nobody does this, like, ever. the closest i’ve got at this point is from a good chunk of digging and research, installing old software, manually compiling other bits, and still ending up with a wiki full of broken pages.

if you’re reading this and you want your own backup, just go with with kiwix. kiwix is simple. maybe you’ll have a stale archive, but the other options i’ve found are basically broken and/or unmaintained


liveposting my tech woes wikipedia backup
1
1
0

i think if i really want to use their recent backups, i’m just going to have to begrudgingly use importDump.php like everything says not to do (and parallelize it and use options to speed things up so that it gets finished within my lifetime)

doing that with the simple wiki is the first thing i’ve done that’s actually worked, beyond just using kiwix and calling it a day

1
1
0

after playing around with some things today, i think i finally have a strategy for installing the english wikipedia backup into a mediawiki server.

i have to use importDump, there’s no way i’ve seen around that which actually works. to make it more manageable, i split the backup into discrete 100k page chunks so i can parallelize the dump import process without –jump-to, and i’m using –no-updates, because i can just fix that up in the end and not worry about it.

if this does work out for me, i’m gonna have to post about it in a blog or something. the popular guide going around does not have a working method for this and will just result in an inconsistent, broken database.


wikipedia
1
1
0

it’s going!


wikipedia tech techstuff
1
1
0

still going! had to tweak a few things with how i was running it but now i’ve got

  • error handling if a chunk fails to import!
  • a total progress readout!

i also wrote out a blog post/guide on this, but. um. i think i should probably wait for this to complete before publishing it, since i noticed multiple chunks fail without manually installing more extensions and other tweaks, so it might be good to keep track of what i have to install and modify and add it to the post


#wikipedia #tech #techStuff
0
1
0