Feature #3303: migrate openmoko.org into "archive mode" and remove dedicated server for it
create static archive of openmoko.org web pages
b) get rid of the existing server, by the following strategy: * web: convert the dynamically-generated media-wiki, trac, svnweb, gitweb, etc. pages into static renderings that can be served from a static web server. This could be done by something like a recursive wget through a http cache. This would remove the need to run trac, mediawiki and apache mod_svn, mysql, ... - and drastically reduce the CPU and Memory requirements. In the end, it would be a bunch of static HTML pages rendered by nginx or lighttpd somewhere on a virtual server or shared server. * svn: discontinue svn service and simply have * caches of the rendered html pages (for old hyperlinks to work), [...] * git: discontinue git service and simply have * caches of the rendered gitweb html pages (for old hyperlinks to work), and [...] Next to the fact of basically reducing our hosting requirements to zero, it also has the advantage that we don't have to worry about keeping trac,mediawiki,etc. installations secure and updated. Also, when moving to major new versions, there's always the risk of some issues with migrating the old data, some wiki rendering errors, etc. - conserving the generated output saves us from all of that. If we go for 'b', this would include us releasing SQL dumps of the trac, mediawiki, svn, etc. databases (probably clearing any passwords / password hashes), so that the raw information can be restored by anyone who has an interest to it.Please take care of devising a process by which we can
- generate the static cache/archive for anything web-visible on *.osmocom.org
- serve the static cache/archive continuously in the future
#1 Updated by roh over 2 years ago
after some research i did a dump of the mediawiki including the discussion pages but without any special pages or history by using wget.
the plugins for mediawiki seem to be very complicated and broken.
it seems to work in itself but has some (ignorable) rendering bugs. the links to the discussion pages are somehow broken, i guess by the ':' in the url as a relative link
i use this in /etc/hosts to test for now(http only)
todo for wiki:
- cut out 'login etc header' (the whole <div class="portlet" id="p-personal"> ... </div>
- move to new server
#2 Updated by laforge over 2 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 10
I've been playing around a bit with httrack. Unfortuantely even with tuning various options it appears very slow.
httrack openmoko.org wiki.openmoko.org docs.openmoko.org -O /foo/bar/websites/openmoko -s0 -w -bN -*p3 -%c50 -%p -%e0 -%k -c50 -%! -A9999999 -@i4 -v -iC1 -iC1 -v seems to be rendering pretty good results so far. Let's wait until it completes and test with some static web server.
As a side-note: I just removed the google-analytics.com link that was still present in he mediawiki skin/theme. That should have been removed 10 years ago :/
It luckily appeared to be the only third-party resource that I could find, and now at least we won't have it in the static archive.
#3 Updated by laforge over 2 years ago
http://netpreserve.org/web-archiving/tools-and-software/ seems to be a good collection of tools. https://webarchive.jira.com/wiki/spaces/Heritrix/overview is what the internet archive appears to be using as their own crawler.