Next Previous Contents

6. The configuration file

6.1 General syntax

ht://Check uses a flexible configuration file. This configuration file is a plain ASCII text file. Each line in the file is either a comment or contains an attribute. Comment lines are blank lines or lines that start with a '#'.

6.2 Attributes

Attributes consist of a variable name and an associated value:

<name>:<whitespace><value><newline> 

The name contains any alphanumeric character or underline (_).

The value can include any character except newline. It also cannot start with spaces or tabs since those are considered part of the whitespace after the colon. It is important to keep in mind that any trailing spaces or tabs will be included.

It is possible to split the value across several lines of the configuration file by ending each line with a backslash (\). The effect on the value is that a space is added where the line split occurs.

If ht://Check needs a particular attribute and it is not in the configuration file, it will use the default value which is defined in htcommon/defaults.cc of the source directory.

6.3 Inclusion and variable expansion

A configuration file can include another file, by using a special name, include. The value is taken as the file name of another configuration file to be read in at this point. If the given file name is not fully qualified, it is taken relative to the directory in which the current configuration file is found.

Variable expansion is permitted in the file name. Multiple include statements, and nested includes are also permitted. Example:

include: common.conf 

6.4 Configuration attributes

Here you can find a brief explanation of ht://Check configuration attributes.

They've been grouped in these sections:

Setting the "spider"

start_url

This is the list of URLs that will be used to start a dig when there was no existing database. Note that multiple URLs can be given here.

Type: string

Default: http://htcheck.sourceforge.net/

Example:

start_url:      http://www.somewhere.org/alldata/index.html

limit_urls_to

This specifies a set of patterns that all URLs have to match against in order for them to be included in the search. Any number of strings can be specified, separated by spaces. If multiple patterns are given, at least one of the patterns has to match the URL. Matching is a case-insensitive string match on the URL to be used. The match will be performed after the relative references have been converted to a valid URL. This means that the URL will always start with http://. Granted, this is not the perfect way of doing this, but it is simple enough and it covers most cases.

Type: string

Default: ${start_url}

Example:

limit_urls_to:  .sdsu.edu kpbs

limit_normalized

This specifies a set of patterns that all URLs have to match against in order for them to be included in the search. Unlike the limit_urls_to directive, this is done after the URL is normalized.

Type: string

Default:

Example:

limit_normalized: http://www.mydomain.com

exclude_urls

If a URL contains any of the space separated patterns, it will be rejected. This is used to exclude such common things such as an infinite virtual web-tree which start with cgi-bin.

Type: string

Default:

Example:

exclude_urls: students.html cgi-bin

bad_extensions

This is a list of extensions on URLs which are considered non-parsable. This list is used mainly to supplement the MIME-types that the HTTP server provides with documents. Some HTTP servers do not have a correct list of MIME-types and so can advertise certain documents as text while they are some binary format.

Type: string

Default:

Example:

bad_extensions: .foo .bar .bad

bad_querystr

This is a list of CGI query strings to be excluded from indexing. This can be used in conjunction with CGI-generated portions of a website to control which pages are indexed.

Type: string

Default:

Example:

bad_querystr: forum=private section=topsecret&passwd=required

max_hop_count

Instead of limiting the indexing process by URL pattern, it can also be limited by the number of hops or clicks a document is removed from the starting URL. The starting page will have hop count 0.

Type: number

Default: 999999

Example:

max_hop_count: 4

check_external

If set to 'true', htcheck check if external Urls exist or not. An external Url is an Url which doesn't match limit configuration attributes. External URLs aren't parsed.

Type: boolean

Default: true

Example:

check_external: false

disable_cookies

If set to 'true', htcheck will disable the HTTP cookies management.

Type: boolean

Default: false

Example:

disable_cookies: true

Setting the database info

db_name

This is the list of URLs that will be used to start a dig when there was no existing database. Note that multiple URLs can be given here.

Type: string

Default: htcheck (or defined by the --with-db-name configure option)

Example:

db_name: test

mysql_conf_file_prefix

Prefix for the MySQL configuration file to be searched. Default is 'my' and The file searched is usually ~/.my.cnf (suggested). If it is not found the /etc/.my.cnf file is searched. For its syntax, look at the 'Option File' contents inside the MySQL documentation.

Type: string

Default: my

Example:

mysql_conf_file_prefix: htcheck

mysql_conf_group

Group to be searched inside the .my.cnf file of MySQL for getting the settings for the connection to the server. In other words, it's the section marked with [<group>] inside the MySQL option file (default is [client]).

Type: string

Default: client

Example:

mysql_conf_group: htcheck

optimize_db

Optimize the database tables at the end of the crawl. Disable it if the database server doesn't support it.

Type: boolean

Default: false

Example:

optimize_db: true

sql_big_table_option

Enable or disable this option that is useful when performing huge queries. Otherwise, sometimes when it's not set, the MySQL db server may return a 'table is full' error.

Type: boolean

Default: true

Example:

sql_big_table_option: false

url_index_length

This number specifies the length of the index of the Url field in the Schedule and Url tables of the database. You can set different values depending on the average length of the URLs that htcheck can find in your sites. If you don't want to set any limitation, just put a '-1' value. This now allows the user to control the length of the index for the Url field in the Schedule and Url tables. This attribute may affect the performance of the crawls, as long as the length of a index can either slow down or speed up the spidering process.

Type: number

Default: 64

Example:

url_index_length: -1

Setting HTTP connections

user_agent

This allows customization of the user_agent: field sent when the digger requests a file from a server.

Type: string

Default: ht://Check

Example:

user_agent: htcheck-crawler

persistent_connections

If set to true, when servers make it possible, htdig can take advantage of persistent connections, as defined by HTTP/1.1 (RFC2616). This permits to reduce the number of open/close operations of connections, when retrieving a document with HTTP.

Type: boolean

Default: true

Example:

persistent_connections: false

head_before_get

This option works only if we take advantage of persistent connections (see persistent_connections attribute). If set to true an HTTP/1.1 HEAD call is made in order to retrieve header information about a document. If the status code and the content-type returned let the document be parsable, then a following 'GET' call is made.

Type: boolean

Default: true

Example:

head_before_get: false

timeout

Specifies the time the digger will wait to complete a network read. This is just a safeguard against unforeseen things like the all too common transformation from a network to a notwork.

The timeout is specified in seconds.

Type: number

Default: 30

Example:

timeout: 42

authorization

This tells htcheck to send the supplied username:password with each HTTP request. The credentials will be encoded using the "Basic" authentication scheme. There must be a colon (:) between the username and password.

Type: string

Default:

Example:

authorization: myusername:mypassword

max_retries

This option set the maximum number of retries when retrieving a document fails (mainly for reasons of connection).

Type: number

Default: 3

Example:

max_retries: 6

tcp_max_retries

This option set the maximum number of attempts when a connection raises a timeout. After all these retries, the connection attempt results timed out.

Type: number

Default: 1

Example:

tcp_max_retries: 6

tcp_wait_time

This attribute sets the wait time after a connection fails and the timeout is raised.

Type: number

Default: 5

Example:

tcp_wait_time: 10

http_proxy

When this attribute is set, all HTTP document retrievals will be done using the HTTP-PROXY protocol. The URL specified in this attribute points to the host and port where the proxy server resides.

The use of a proxy server greatly improves performance of the indexing process.

Type: string

Default:

Example:

http_proxy: http://proxy.bigbucks.com:3128

http_proxy_exclude

When this is set, URLs matching this will not use the proxy. This is useful when you have a mixture of sites near to the digging server and far away.

Type: string

Default:

Example:

http_proxy_exclude: http://intranet.foo.com/

accept_language

This attribute allows to restrict the set of natural languages that are preferred as a response to an HTTP request performed by the digger. This can be done by putting one or more language tags (as defined by RFC 1766) in the preferred order, separated by spaces. By doing this, when the server performs a content negotiation based on the 'accept-language' given by the HTTP user agent, a different content can be shown depending on the value of this attribute. If set empty, no language will be sent and the server default will be returned.

Type: string

Default:

Example:

accept_language:        en-us en it

Setting what to store

max_doc_size

This is the upper limit to the amount of data retrieved for documents. This is mainly used to prevent unreasonable memory consumption since each document will be read into memory by htcheck.

Type: number

Default: 100000

Example:

max_doc_size: 5000000

store_only_links

If set to false, htcheck will store in the DB every tag he finds in every document it crawls. If set to true, htcheck stores only those Html attributes and statements that produce a link or set an anchor (identified by the pair tag: A, attribute: name).

Type: boolean

Default: false

Example:

store_only_links: false

Setting what to report

summary_anchor_not_found

Enable or disable the show of the summary of the HTML anchors that have not been found.

Type: string

Default: true

Example:

summary_anchor_not_found: false


Next Previous Contents