Next Previous Contents

3. How it works

ht://Check is essentially a web spider, or robot or crawler. As well as a search engine (like ht://Dig) indexes words from the Internet, ht://Check stores HTML statements such as tags and attributes, links, URL information, and more.

At the moment, ht://Check supports only HTTP/1.1 (and HTTP/1.0 also): future plans regard enabling the FTP, NNTP, HTTPS and also local files checks.

Everything is stored in a MySQL database, created from scratch by the application itself. You don't need to create it before, just run 'htcheck' and every needed table will be automatically built by the program.

For information regarding the connection to the MySQL database, please consult the MySQL connection settings using the option file section.

3.1 The information retrieval module

ht://Check is made up of two logical "modules", one corcerning the information retrieval, the other one the analysis of the performed crawl.

The first step, which is the most important also, is completely performed by the 'htcheck' program; depending on the values set in the configuration file, htcheck starts retrieving the URL defined in the 'start_url' configuration attribute; the crawling process is limited in several ways, most of which regard the URL domain (like ' limit_urls_to ', 'limit_normalized', ' exclude_urls ') or the distance from the starting URL ('max_hop_count'), etcetera.

When htcheck retrieves the first document, it checks the answer that the server gave back; if the document exists (HTTP 200 status code is returned), and the Content-Type is text/html, htcheck starts parsing the document, and retrieves and stores at least all of the HTML tags and attributes that create a link (it can store all of them if you set 'store_only_links' to false).

htcheck can also manage HTTP redirection (created by header "Location" sent by the remote HTTP server) and cookies (as defined by http://www.netscape.com/newsref/std/cookie_spec.html).

In a few words that's the main mechanism regarding the information retrieval module, but -believe me- it is not as easy as it seems! But, as far as you are concerned, I think that's enough for now.

3.2 The tables of a ht://Check database

First of all, you don't need to create a database for ht://Check; indeed htcheck will do it for you!

However, ht://Check creates a database which is made up of these tables:

The main task of the Schedule table is to manage the crawling system: by querying this table, htcheck knows which URLs need to be retrieved, or just checked if they exist.

The Url table contains info about those URLs that have been retrieved (either successfully or not): here you can find the HTTP status code returned and its reason phrase, its size, the last access time and modification time too, and more.

The Server table contains information about the HTTP servers that have been encountered during the crawling process.

The HtmlStatement table contains information about the HTML statements found in each URL; every one of them contains one and only one HTML tag, but can also contain one or more HTML attributes inside. These ones are stored in the HtmlAttribute table.

The Link table let us find and locate every link instantiated by HTML statements (or by HTTP redirections too), so we can have a referencing as well as a referenced URL, and know precisely which HTML attribute created this link.

The Cookies table is handled since version 1.1 and stores all the cookies that have been retrieved during the crawl and their related information.

The htCheck contains general info such as start and finish time, number of connections, etcetera.

3.3 Getting the information stored

Our starting point is that we now have a database full of information, because htcheck has already finished to crawl through the web.

The very first way to get reports from a crawl, is to run htcheck with the '-s' option, which let it produce summaries (see the Getting Started section).

The other way given by ht://Check is to use the PHP interface, which is really simple and easy to use (for installation and settings see the Installing PHP scripts section).

As the database is now a common MySQL database, you can use whatever you want in order to to retrieve the information stored in it (Perl, C/C++ programs, JSP). You can also get them on Windows systems, just download MyODBC. You got lots of choices, as you can see!


Next Previous Contents