Website inventory with CSE?

For technical support for all editions of CSE HTML Validator. Includes bug reports.

Website inventory with CSE?

Postby mdevette » Tue Oct 09, 2012 3:53 pm

Can we perform a full website content inventory with CSE HTML Validator (latest version)? If so, how?
Its to run on a live public website.

Hi was hoping that CSE HTML Validator would also do something link this online tool does: http://www.pagetrawler.com.

Thank you.
mdevette
Rank 0 - Newcomer
Rank 0 - Newcomer
 
Posts: 2
Joined: Tue Oct 09, 2012 3:49 pm

Re: Website inventory with CSE?

Postby Albert Wiersch » Wed Oct 10, 2012 9:14 am

Hello,

Thank you. I've tried that site and looked at the CSV file. It looks like it is just a summary of page titles and headers on the pages, with some additional information.

It may be possible to do something similar in CSE HTML Validator v12 using the customization ability and the Batch Wizard to crawl a site.

If you'd like me to investigate this, and are willing to work with me and try the v12 BETA as this is developed and tested, then please let me know.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Website inventory with CSE?

Postby mdevette » Wed Oct 10, 2012 2:02 pm

Yes sure, I'm willing too help on this, just let me know.

Marco
mdevette
Rank 0 - Newcomer
Rank 0 - Newcomer
 
Posts: 2
Joined: Tue Oct 09, 2012 3:49 pm

Re: Website inventory with CSE?

Postby Albert Wiersch » Fri Oct 12, 2012 6:09 am

mdevette wrote:Yes sure, I'm willing too help on this, just let me know.

Marco


Great, thanks. I'll get back to you. I have a few things I need to complete first.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Website inventory with CSE?

Postby Albert Wiersch » Fri Oct 12, 2012 6:16 pm

Hi Marco,

I hope to be able to look into this next week. Can you provide specific details on what you'd like the website inventory to include. I assume you want a CSV (comma separated value) file. Exactly what fields and what information did you want in there? Please be as specific as possible.

Thanks.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Website inventory with CSE?

Postby Lou » Fri Oct 12, 2012 8:55 pm

Albert Wiersch wrote:Exactly what fields and what information did you want in there? Please be as specific as possible.


Albert a quick run of pageTrawler and I see two things:
  • It would be nice to be able to exclude directories.
    When I ran it, it wasted time running through the phpBB subdirectory, which gives me no information.
  • It would be nice to know where/how a file was found.
    pageTrawler found two 404 references. There is no clue where these links are.
    I have not seen any indication of a 404 before. Of course just because I don't find them doesn't mean they are not there.
User avatar
Lou
Rank IV - Intermediate
Rank IV - Intermediate
 
Posts: 180
Joined: Fri Jul 29, 2005 5:55 pm
Location: MD

Re: Website inventory with CSE?

Postby Albert Wiersch » Tue Oct 16, 2012 12:45 pm

Hi Lou and all,

If you'd like to try making a "website inventory/content map" using CSE HTML Validator, then please download and install the latest v12 PUBLIC BETA (BETA 5), which I just released here:
http://www.htmlvalidator.com/freebeta/

You'll also need this 'user functions' file:
http://www.htmlvalidator.com/user-funct ... entmap.cfg

The 'user functions' file adds this additional functionality. It creates a simple CSV (comma separated value) file with the document location, document title, document keywords, h1 text, and h2 text for every document that is validated. You can configure it differently if you want, but will have to edit the programming in the user functions file.

One important thing you will have to do is to change the hardcoded filename to what you want (where you want the CSV data stored). Currently it is set to "T:\content_map.csv". To do this, simply edit the user functions config file with a text editor and make the change where this code is:
Code: Select all
 $cmap.filename='T:\content_map.csv';


In the Validator Engine Options, Validator Engine->Config File page, set the 'user functions' file to the user functions config file and check the 'Enable potentially destructive functions' option because this feature will need writeFile() to write the data. If it is not checked, then no data will be written to the CSV file.

Finally, be sure to reload the configuration or restart CSE HTML Validator for the new 'user functions' file to take effect.

If everything is working well, then every time you validate an HTML/XHTML document (using the editor or the Batch Wizard), it should append a line to the CSV file. It will just keep appending data, so you'll need to delete or move or rename the file when you want to start a new file.

This is an example of the improved customization ability in CSE HTML Validator v12. It will let you add your own fields and other information you want. However, not everything may be available, like link checking status. If there is anything that you'd really like to see added, then please let me know.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Website inventory with CSE?

Postby Albert Wiersch » Tue Oct 16, 2012 12:50 pm

Lou wrote:Albert a quick run of pageTrawler and I see two things:
  • It would be nice to be able to exclude directories.
    When I ran it, it wasted time running through the phpBB subdirectory, which gives me no information.


Hi Lou, if you use the Batch Wizard, then you can exclude folders from being checked using the 'Don't process these targets' option in the Target List Options Tab of the Batch Wizard.

Lou wrote:
  • It would be nice to know where/how a file was found.
    pageTrawler found two 404 references. There is no clue where these links are.
    I have not seen any indication of a 404 before. Of course just because I don't find them doesn't mean they are not there.


  • The current CSV config file doesn't support link information. Of course you could do a link check using the Batch Wizard and the link reports should tell you which documents contain bad links. Due to the current architecture of CSE HTML Validator, adding link information to the CSV file is not simple, because validation is a separate process from link checking.
    Image
    Albert Wiersch
    User avatar
    Albert Wiersch
    Site Admin
    Site Admin
     
    Posts: 2361
    Joined: Sat Dec 11, 2004 10:23 am
    Location: Near Dallas, TX

    Re: Website inventory with CSE?

    Postby Lou » Tue Oct 16, 2012 8:40 pm

    Albert Wiersch wrote:Hi Lou and all,

    If everything is working well,

    Albert, how can we figure out why everything is not working well.
    • I downloaded the new files
    • edited the new config file to the output file location
    • made changes to the CSE HTML configuration.

    I don't get an output file when validating a single file or using the batch wizard.

    Your instructions seem straight forward, but...
    User avatar
    Lou
    Rank IV - Intermediate
    Rank IV - Intermediate
     
    Posts: 180
    Joined: Fri Jul 29, 2005 5:55 pm
    Location: MD

    Re: Website inventory with CSE?

    Postby Albert Wiersch » Tue Oct 16, 2012 10:01 pm

    Lou wrote:I don't get an output file when validating a single file or using the batch wizard.

    Your instructions seem straight forward, but...


    Hi Lou,

    Please:

    1. Make sure CSE HTML Validator has write access (permission) to the file & folder that you've set.
    2. Make sure you checked the 'Enable potentially destructive functions' option.

    Can you confirm the above two conditions are met?
    Image
    Albert Wiersch
    User avatar
    Albert Wiersch
    Site Admin
    Site Admin
     
    Posts: 2361
    Joined: Sat Dec 11, 2004 10:23 am
    Location: Near Dallas, TX

    Re: Website inventory with CSE?

    Postby Lou » Tue Oct 16, 2012 11:06 pm

    So when you change a config file, you need to reload the program - right?

    dah All seems to work fine now.
    User avatar
    Lou
    Rank IV - Intermediate
    Rank IV - Intermediate
     
    Posts: 180
    Joined: Fri Jul 29, 2005 5:55 pm
    Location: MD

    Re: Website inventory with CSE?

    Postby Albert Wiersch » Tue Oct 16, 2012 11:36 pm

    Lou wrote:So when you change a config file, you need to reload the program - right?


    Yep! Sorry, I should have explicitly said that. I've updated the directions.

    Lou wrote:All seems to work fine now.


    Great!
    Image
    Albert Wiersch
    User avatar
    Albert Wiersch
    Site Admin
    Site Admin
     
    Posts: 2361
    Joined: Sat Dec 11, 2004 10:23 am
    Location: Near Dallas, TX

    Re: Website inventory with CSE?

    Postby Lou » Wed Oct 17, 2012 6:13 am

    Albert Wiersch wrote:Yep! Sorry, I should have explicitly said that. I've updated the directions.

    Some days I'm a bit slower than others. Now I need to work on fixing the (lack of) content this add-on reveals.
    User avatar
    Lou
    Rank IV - Intermediate
    Rank IV - Intermediate
     
    Posts: 180
    Joined: Fri Jul 29, 2005 5:55 pm
    Location: MD


    Return to CSE Tech Support

    Who is online

    Users browsing this forum: No registered users and 1 guest