Next Previous Contents

5. Getting started

In order to perform the first crawl, you just need to edit the configuration file, which resides in the configuration directory with the name 'htcheck.conf' (you may use another file as configuration file, but you gotta run htcheck it with the '-c' option).

Just change the 'start_url' attribute to whatever you want, for example:

start_url:  http://www.foo.com

Remember that every URL must start with the service name, that is to say 'http://'.

Then set the 'limit_urls_to' attribute to $(start_url), in order to scan only the 'http://www.foo.com' website.

You may change many other attributes (database name included), but for now, in order to test if it works or not, that's enough.

You can finally enter the bin directory inside the 'htcheck' installation directory (by default /opt/htcheck) and run:

htcheck -vs

However, here are the available options (just run htcheck --help) and you will get this:

usage: htcheck  [-isvhr] [-c configfile] [-D dbname] [--help] [--version]

Options:
        -v      Verbose mode (more 'v's increment verbosity)

        -s      Statistics (broken links, etc...) available

        -i      Initialize the database: drop a previous db

        -c configfile
                Configuration file

        -D dbname
                Name of the database

        --help  Display this
        -h      Same as --help

        --version       Display version
        -r      Same as --version

Remember that htcheck always check if the database already exists in the MySQL server. If it does not exist, it is created from scratch. On the other hand, if htcheck is launched with the '-i' option, this database is initialized again (this means that a new crawl is performed), else the program just use a previous database, which is useful in order to get some reports like broken links and anchors, content-type summaries (in this case you gotta set the '-s' option).


Next Previous Contents