ht://Check uses a flexible configuration file. This configuration file is a plain ASCII text file. Each line in the file is either a comment or contains an attribute. Comment lines are blank lines or lines that start with a '#'.
Attributes consist of a variable name and an associated value:
<name>:<whitespace><value><newline>
The name
contains any alphanumeric character or underline (_).
The value
can include any character except newline. It also cannot start with
spaces or tabs since those are considered part of the whitespace after the
colon. It is important to keep in mind that any trailing spaces or
tabs will be included.
It is possible to split the value
across several lines of the configuration file by ending each
line with a backslash (\). The effect on the value is that a space is added where the line split
occurs.
If ht://Check needs a particular attribute and it is not in the configuration file, it will use the default value which is defined in htcommon/defaults.cc of the source directory.
A configuration file can include another file, by using a special name
, include.
The value
is taken as the file name of another configuration file to be read
in at this point. If the given file name is not fully qualified, it is taken
relative to the directory in which the current configuration
file is found.
Variable expansion is permitted in the file name. Multiple include statements, and nested includes are also permitted. Example:
include: common.conf
Here you can find a brief explanation of ht://Check configuration attributes.
They've been grouped in these sections:
start_url
This is the list of URLs that will be used to start a dig when there was no existing database. Note that multiple URLs can be given here.
Type: string
Default: http://htcheck.sourceforge.net/
Example:
start_url: http://www.somewhere.org/alldata/index.html
limit_urls_to
This specifies a set of patterns that all URLs have to
match against in order for them to be included in the
search. Any number of strings can be specified,
separated by spaces. If multiple patterns are given, at
least one of the patterns has to match the URL.
Matching is a case-insensitive string match on the URL
to be used. The match will be performed after
the relative references have been converted to a valid
URL. This means that the URL will always start
with http://
.
Granted, this is not the perfect way of doing this,
but it is simple enough and it covers most cases.
Type: string
Default: ${start_url}
Example:
limit_urls_to: .sdsu.edu kpbs
limit_normalized
This specifies a set of patterns that all URLs have to match against in order for them to be included in the search. Unlike the limit_urls_to directive, this is done after the URL is normalized.
Type: string
Default:
Example:
limit_normalized: http://www.mydomain.com
exclude_urls
If a URL contains any of the space separated patterns, it will be rejected. This is used to exclude such common things such as an infinite virtual web-tree which start with cgi-bin.
Type: string
Default:
Example:
exclude_urls: students.html cgi-bin
bad_extensions
This is a list of extensions on URLs which are considered non-parsable. This list is used mainly to supplement the MIME-types that the HTTP server provides with documents. Some HTTP servers do not have a correct list of MIME-types and so can advertise certain documents as text while they are some binary format.
Type: string
Default:
Example:
bad_extensions: .foo .bar .bad
bad_querystr
This is a list of CGI query strings to be excluded from indexing. This can be used in conjunction with CGI-generated portions of a website to control which pages are indexed.
Type: string
Default:
Example:
bad_querystr: forum=private section=topsecret&passwd=required
max_hop_count
Instead of limiting the indexing process by URL pattern, it can also be limited by the number of hops or clicks a document is removed from the starting URL. The starting page will have hop count 0.
Type: number
Default: 999999
Example:
max_hop_count: 4
check_external
If set to 'true', htcheck check if external Urls exist or not. An external Url is an Url which doesn't match limit configuration attributes. External URLs aren't parsed.
Type: boolean
Default: true
Example:
check_external: false
disable_cookies
If set to 'true', htcheck will disable the HTTP cookies management.
Type: boolean
Default: false
Example:
disable_cookies: true
db_name
This is the list of URLs that will be used to start a dig when there was no existing database. Note that multiple URLs can be given here.
Type: string
Default: htcheck
(or defined by the --with-db-name
configure option)
Example:
db_name: test
mysql_conf_file_prefix
Prefix for the MySQL configuration file to be searched. Default is 'my' and
The file searched is usually ~/.my.cnf
(suggested).
If it is not found the /etc/.my.cnf
file is searched.
For its syntax, look at the 'Option File' contents inside the MySQL
documentation.
Type: string
Default: my
Example:
mysql_conf_file_prefix: htcheck
mysql_conf_group
Group to be searched inside the .my.cnf file of MySQL for getting the settings for the connection to the server. In other words, it's the section marked with [<group>] inside the MySQL option file (default is [client]).
Type: string
Default: client
Example:
mysql_conf_group: htcheck
optimize_db
Optimize the database tables at the end of the crawl. Disable it if the database server doesn't support it.
Type: boolean
Default: false
Example:
optimize_db: true
sql_big_table_option
Enable or disable this option that is useful when performing huge queries. Otherwise, sometimes when it's not set, the MySQL db server may return a 'table is full' error.
Type: boolean
Default: true
Example:
sql_big_table_option: false
url_index_length
This number specifies the length of the index of the Url field in the Schedule and Url tables of the database. You can set different values depending on the average length of the URLs that htcheck can find in your sites. If you don't want to set any limitation, just put a '-1' value. This now allows the user to control the length of the index for the Url field in the Schedule and Url tables. This attribute may affect the performance of the crawls, as long as the length of a index can either slow down or speed up the spidering process.
Type: number
Default: 64
Example:
url_index_length: -1
user_agent
This allows customization of the user_agent: field sent when the digger requests a file from a server.
Type: string
Default: ht://Check
Example:
user_agent: htcheck-crawler
persistent_connections
If set to true, when servers make it possible, htdig can take advantage of persistent connections, as defined by HTTP/1.1 (RFC2616). This permits to reduce the number of open/close operations of connections, when retrieving a document with HTTP.
Type: boolean
Default: true
Example:
persistent_connections: false
head_before_get
This option works only if we take advantage of persistent connections (see persistent_connections attribute). If set to true an HTTP/1.1 HEAD call is made in order to retrieve header information about a document. If the status code and the content-type returned let the document be parsable, then a following 'GET' call is made.
Type: boolean
Default: true
Example:
head_before_get: false
timeout
Specifies the time the digger will wait to complete a network read. This is just a safeguard against unforeseen things like the all too common transformation from a network to a notwork.
The timeout is specified in seconds.
Type: number
Default: 30
Example:
timeout: 42
authorization
This tells htcheck to send the supplied username:password with each HTTP request. The credentials will be encoded using the "Basic" authentication scheme. There must be a colon (:) between the username and password.
Type: string
Default:
Example:
authorization: myusername:mypassword
max_retries
This option set the maximum number of retries when retrieving a document fails (mainly for reasons of connection).
Type: number
Default: 3
Example:
max_retries: 6
tcp_max_retries
This option set the maximum number of attempts when a connection raises a timeout. After all these retries, the connection attempt results timed out.
Type: number
Default: 1
Example:
tcp_max_retries: 6
tcp_wait_time
This attribute sets the wait time after a connection fails and the timeout is raised.
Type: number
Default: 5
Example:
tcp_wait_time: 10
http_proxy
When this attribute is set, all HTTP document retrievals will be done using the HTTP-PROXY protocol. The URL specified in this attribute points to the host and port where the proxy server resides.
The use of a proxy server greatly improves performance of the indexing process.
Type: string
Default:
Example:
http_proxy: http://proxy.bigbucks.com:3128
http_proxy_exclude
When this is set, URLs matching this will not use the proxy. This is useful when you have a mixture of sites near to the digging server and far away.
Type: string
Default:
Example:
http_proxy_exclude: http://intranet.foo.com/
accept_language
This attribute allows to restrict the set of natural languages that are preferred as a response to an HTTP request performed by the digger. This can be done by putting one or more language tags (as defined by RFC 1766) in the preferred order, separated by spaces. By doing this, when the server performs a content negotiation based on the 'accept-language' given by the HTTP user agent, a different content can be shown depending on the value of this attribute. If set empty, no language will be sent and the server default will be returned.
Type: string
Default:
Example:
accept_language: en-us en it
max_doc_size
This is the upper limit to the amount of data retrieved for documents. This is mainly used to prevent unreasonable memory consumption since each document will be read into memory by htcheck.
Type: number
Default: 100000
Example:
max_doc_size: 5000000
store_only_links
If set to false
, htcheck will store in the DB every
tag he finds in every document it crawls.
If set to true
, htcheck stores only those Html attributes
and statements that produce a link or set an anchor
(identified by the pair tag: A, attribute: name).
Type: boolean
Default: false
Example:
store_only_links: false
summary_anchor_not_found
Enable or disable the show of the summary of the HTML anchors that have not been found.
Type: string
Default: true
Example:
summary_anchor_not_found: false