In order to perform the first crawl, you just need to edit the configuration file, which resides
in the configuration directory with the name 'htcheck.conf
' (you may use another
file as configuration file, but you gotta run htcheck
it with the '-c
' option).
Just change the 'start_url
' attribute to whatever you want, for example:
start_url: http://www.foo.com
Remember that every URL must start with the service name, that is to say 'http://
'.
Then set the 'limit_urls_to
' attribute to $(start_url)
, in order to
scan only the 'http://www.foo.com' website.
You may change many other attributes (database name included), but for now, in order to test if it works or not, that's enough.
You can finally enter the bin
directory inside the 'htcheck' installation directory (by
default /opt/htcheck
) and run:
htcheck -vs
However, here are the available options (just run htcheck --help
) and you will get this:
usage: htcheck [-isvhr] [-c configfile] [-D dbname] [--help] [--version]
Options:
-v Verbose mode (more 'v's increment verbosity)
-s Statistics (broken links, etc...) available
-i Initialize the database: drop a previous db
-c configfile
Configuration file
-D dbname
Name of the database
--help Display this
-h Same as --help
--version Display version
-r Same as --version
Remember that htcheck
always check if the database already exists in the MySQL server. If it
does not exist, it is created from scratch. On the other hand, if htcheck
is launched with the '-i' option, this database is initialized again (this means that a new crawl is performed), else
the program just use a previous database, which is useful in order to get some reports like
broken links and anchors, content-type summaries (in this case you gotta set the '-s' option).