Httrack est un célèbre aspirateur de sites web.
Les sites volumineux (le forum et la documentation Ubuntu-fr compris), ne doivent pas être aspirés automatiquement, sous peine de blocage de votre adresse IP par le site. L'aspiration de sites doit respecter une certaine éthique et doit être utilisée uniquement lorsqu'il y a un besoin d'accéder à des contenus hors lignes. L'aspiration demande au site visé des ressources matérielles bien plus importante que le simple affichage d'une page web. Demandez l'autorisation au webmaster avant de procéder ! N'oublions pas non plus les problématiques liées à la propriété intellectuelle.
Il existe deux versions de httrack :
httrack –mirror http://website.com
httrack(1) General Commands Manual httrack(1)
NAME
httrack - offline browser : copy websites to a local directory
SYNOPSIS
httrack [ url ]... [ -filter ]... [ +filter ]... [ -O, --path ] [ -w, --mirror ] [ -W, --mirror-wizard ] [ -g, --get-files ] [ -i, --continue ] [ -Y,
--mirrorlinks ] [ -P, --proxy ] [ -%f, --httpproxy-ftp[=N] ] [ -%b, --bind ] [ -rN, --depth[=N] ] [ -%eN, --ext-depth[=N] ] [ -mN, --max-files[=N] ] [
-MN, --max-size[=N] ] [ -EN, --max-time[=N] ] [ -AN, --max-rate[=N] ] [ -%cN, --connection-per-second[=N] ] [ -GN, --max-pause[=N] ] [ -cN, --sockets[=N]
] [ -TN, --timeout[=N] ] [ -RN, --retries[=N] ] [ -JN, --min-rate[=N] ] [ -HN, --host-control[=N] ] [ -%P, --extended-parsing[=N] ] [ -n, --near ] [ -t,
--test ] [ -%L, --list ] [ -%S, --urllist ] [ -NN, --structure[=N] ] [ -%D, --cached-delayed-type-check ] [ -%M, --mime-html ] [ -LN, --long-names[=N] ]
[ -KN, --keep-links[=N] ] [ -x, --replace-external ] [ -%x, --disable-passwords ] [ -%q, --include-query-string ] [ -o, --generate-errors ] [ -X,
--purge-old[=N] ] [ -%p, --preserve ] [ -%T, --utf8-conversion ] [ -bN, --cookies[=N] ] [ -u, --check-type[=N] ] [ -j, --parse-java[=N] ] [ -sN, --roâ€
bots[=N] ] [ -%h, --http-10 ] [ -%k, --keep-alive ] [ -%B, --tolerant ] [ -%s, --updatehack ] [ -%u, --urlhack ] [ -%A, --assume ] [ -@iN, --protocol[=N]
] [ -%w, --disable-module ] [ -F, --user-agent ] [ -%R, --referer ] [ -%E, --from ] [ -%F, --footer ] [ -%l, --language ] [ -%a, --accept ] [ -%X,
--headers ] [ -C, --cache[=N] ] [ -k, --store-all-in-cache ] [ -%n, --do-not-recatch ] [ -%v, --display ] [ -Q, --do-not-log ] [ -q, --quiet ] [ -z,
--extra-log ] [ -Z, --debug-log ] [ -v, --verbose ] [ -f, --file-log ] [ -f2, --single-log ] [ -I, --index ] [ -%i, --build-top-index ] [ -%I,
--search-index ] [ -pN, --priority[=N] ] [ -S, --stay-on-same-dir ] [ -D, --can-go-down ] [ -U, --can-go-up ] [ -B, --can-go-up-and-down ] [ -a,
--stay-on-same-address ] [ -d, --stay-on-same-domain ] [ -l, --stay-on-same-tld ] [ -e, --go-everywhere ] [ -%H, --debug-headers ] [ -%!, --disable-secuâ€
rity-limits ] [ -V, --userdef-cmd ] [ -%W, --callback ] [ -K, --keep-links[=N] ] [
DESCRIPTION
httrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images,
and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" webâ€
site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site,
and resume interrupted downloads.
EXAMPLES
httrack www.someweb.com/bob/
mirror site www.someweb.com/bob/ and only this site
httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*
mirror the two sites together (with shared links) and accept any .jpg files on .com sites
httrack www.someweb.com/bob/bobby.html +* -r6
means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web
httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
runs the spider on www.someweb.com/bob/bobby.html using a proxy
httrack --update
updates a mirror in the current folder
httrack
will bring you to the interactive mode
httrack --continue
continues a mirror in the current folder
OPTIONS
General options:
-O path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles]) (--path <param>)
Action options:
-w *mirror web sites (--mirror)
Proxy options:
-P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
Limits options:
-rN set the mirror depth to N (* r9999) (--depth[=N])
Flow control:
-cN number of multiple connections (*c8) (--sockets[=N])
Links options:
-%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don t use) (--extended-parsing[=N])
Build options:
-NN structure type (0 *original structure, 1+: see below) (--structure[=N])
extensions, * %N2 always use)
proxy link) (–keep-links[=N])
Spider options:
-bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (--cookies[=N])
(–parse-java[=N])
Browser ID:
-F user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent <param>)
Log, index, cache
-C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
Expert options:
-pN priority mode: (* p3) (--priority[=N])
Guru options: (do NOT use if possible)
-#X *use optimized engine (limited memory boundary checks) (--fast-engine)
Dangerous options: (do NOT use unless you exactly know what you are doing)
-%! bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simultaneous connections) (--disable-security-limits)
NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS
Command-line specific options:
-V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
Details: Option N
-N0 Site-structure (default)
Details: User-defined option N
%n Name of file without file type (ex: image)
%N Name of file, including file type (ex: image.gif)
%t File type (ex: gif)
%p Path [without ending /] (ex: /someimages)
%h Host name (ex: www.someweb.com)
%M URL MD5 (128 bits, 32 ascii bytes)
%Q query string MD5 (128 bits, 32 ascii bytes)
%k full query string
%r protocol name (ex: http)
%q small query string MD5 (16 bits, 4 ascii bytes)
%s? Short name version (ex: %sN)
%[param] param variable in query string
%[param:before:after:empty:notfound] advanced variable extraction
Details: User-defined option N and advanced variable extraction
%[param:before:after:empty:notfound]
: string to prepend if the parameter was found
: string replacement if the parameter could not be found
Crée un miroir :
httrack --mirror http://www.monsite.com
Mettre à jour le projet courant :
httrack --update
Nettoyage du cache et fichier log :
httrack --clean