Validating only recently published pages

For technical support for all editions of CSE HTML Validator. Includes bug reports.

Validating only recently published pages

Postby jscar » Tue May 22, 2012 8:57 am

Hi
with batch Wizard is there any way of only validating pages with a modified date after a certain (configurable) date? Maybe using <meta name="Date.Modified" content="2012-05-18T15:23:02Z" scheme="W3CDTF" />. This would save a lot of time spidering our site.

Thanks

Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Tue May 22, 2012 1:58 pm

Hi Julian,

Are you validating local file targets or URL targets? If local file targets, then perhaps you can use a folder target and then use the 'Limit to Age' properties (in the folder's target properties). If URL targets, then CSE HTML Validator may still needs to parse/check the document in order to extract the links to it can properly 'spider' them (even if it was possible to not check certain links - something to keep in mind).

Does your site consist of a lot of pages and it takes a long time to validate them all?
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Validating only recently published pages

Postby jscar » Wed May 23, 2012 2:09 am

Hi Albert
we're validating URls and on reflection you're right about it having to spider everything to check for the modified date :/
The problem wqe're having is that CSE doesn't complete it's batch job everytime. We have a fairly big site (it's a local government one) and some nights CSE is having to spider, validate and link check 2000 pages. We have split the site up inot several .lst files and run each of these on a seperate night of the week.

It's hard to know if something is interfering with CSE ands causing it to bomb out but nothing seems to be running at the times that CSE is. I'm sorry but I don't have anything to hand which will give an indication to you of what's happening with CSE. It may be the same issue as a previous post of mine butr since I'm on v11 now not v9 that seems unlikley.

Anyway, what I'm trying to do is to reduce the amount of validation that CSE does. If I can get it to only validate recently updated pages that might be a help.

Cheers

Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Wed May 23, 2012 8:27 am

jscar wrote:we're validating URls and on reflection you're right about it having to spider everything to check for the modified date :/
The problem wqe're having is that CSE doesn't complete it's batch job everytime. We have a fairly big site (it's a local government one) and some nights CSE is having to spider, validate and link check 2000 pages. We have split the site up inot several .lst files and run each of these on a seperate night of the week.


Hello Julian,

Are you using the enterprise edition? It may work better for you as it includes additional Batch Wizard capabilities to handle larger jobs:
http://www.htmlvalidator.com/htmlval/co ... chart.html

jscar wrote:Anyway, what I'm trying to do is to reduce the amount of validation that CSE does. If I can get it to only validate recently updated pages that might be a help.


It may be possible to write a tag name program to check a meta tag for a date and if it is with x days or so, then only display error messages. That would require less memory (because less messages would be displayed) and still allow the links to be extracted. If you are interested in this, then please provide the meta tags that would need to be examined and I'll see if it's possible to do this.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Validating only recently published pages

Postby jscar » Wed May 23, 2012 8:52 am

Hi,

we're using Professional not Enterprise. I'm afraid I don't really have sufficent budget for the upgrade:( Even if we could handle larger jobs wouldn't memory still be a limiting factor? Or are you saying that the Enterprise version can do more with the same amount of RAM?

If you could write something to check the <meta name="Date.Modified" content="yyyy-mm-ddThh:mm:ssZ" scheme="W3CDTF" /> element and see if the date value is within the last month that would be fantastic. Even better if we could configure the date from which it checks. Is the functionality in CSE to write these programs ourselves?

Many thanks

Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Wed May 23, 2012 10:07 am

jscar wrote:Or are you saying that the Enterprise version can do more with the same amount of RAM?


Yes, the Enterprise edition can process more targets with the same amount of memory by using a temporary folder and storing data there instead of in memory.

jscar wrote:If you could write something to check the <meta name="Date.Modified" content="yyyy-mm-ddThh:mm:ssZ" scheme="W3CDTF" /> element and see if the date value is within the last month that would be fantastic. Even better if we could configure the date from which it checks. Is the functionality in CSE to write these programs ourselves?


Yes! You can write custom "tag name programs" and "user functions" yourself. Here is more information:
http://www.htmlvalidator.com/htmlval/v1 ... ctions.htm

I will investigate this and see if I can provide something for you. I'll try to get back to you today or tomorrow by posting another reply.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Validating only recently published pages

Postby jscar » Wed May 23, 2012 10:23 am

OK, thanks for that. Really appreciate your help on this.

What language are those functions written in?

Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Wed May 23, 2012 1:04 pm

jscar wrote:OK, thanks for that. Really appreciate your help on this.


You're welcome!

jscar wrote:What language are those functions written in?


Just a simple/basic one that I created for CSE HTML Validator. I just call it the "tag name programming language". :D

Please try this.

1. Create a file named "jscar.cfg" (or pick your own filename).

2. Copy this to it:
Code: Select all
function onStartTag_meta() {
 if hasAttWithStringValue('name','Date.Modified') {
  if hasAtt('content') {
   $contentvalue=getAttValue(getAttIndex('content'));
   $docyear=getMidString($contentvalue,0,4);
   $docmonth=getMidString($contentvalue,5,2);
   $docday=getMidString($contentvalue,8,2);
   #year=getValueInt(100);
   #month=getValueInt(101);
   #day=getValueInt(102);
   #docdate=(#docyear*365)+(#docmonth*31)+#docday;
   #todaydate=(#year*365)+(#month*31)+#day;
   if (#todaydate-#docdate)>30 {
    MessageEx(1033,MSG_MESSAGE,'Document date is '+$docyear+'-'+$docmonth+'-'+$docday+', which is old enough to be checked.');
   }
   else {
    MessageEx(1033,MSG_MESSAGE,'Document date is '+$docyear+'-'+$docmonth+'-'+$docday+', which is new enough not to be checked.');
    setFlag(2,2,0); // don't show future comments
    setFlag(2,8,0); // don't show future warnings
    setFlag(2,0x400,0); // don't show future messages
    setFlag(2,0x1000000,0); // don't show future errors
   }
  }
 }
}


3. Goto Options->Validator Engine Options->Options and then Validator Engine->Config File. Set the user function file to jscar.cfg.

4. Reload the configuration and it should be active.

What it should do is just disable future comments, warnings, and messages if the document is less than about 30 days old.

If you're interested, this can be improved and I'd be happy to work on this if you'd like to test a BETA version of v12. If so, then let me know how you'd like to see this improved. Some of my thoughts are easier to use functions to get the current year, month, and day. There could also be a new function to clear the messages that have been generated up to the point where the future ones are disabled, because currently it doesn't affect the messages that have already been generated (if any).
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Validating only recently published pages

Postby jscar » Thu May 24, 2012 2:58 am

HI Albert,
again much appreciated. I'm out of the office but will try this on Monday. I would be interested in seeing this developed and would be prepared to do some beta testing. I'll have a play with what you've done and think of ways to improve from my perspective. I might even try and learn TNPL ;-)

Thanks again

Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Thu May 24, 2012 8:09 am

jscar wrote:HI Albert,
again much appreciated. I'm out of the office but will try this on Monday. I would be interested in seeing this developed and would be prepared to do some beta testing. I'll have a play with what you've done and think of ways to improve from my perspective. I might even try and learn TNPL ;-)


Great. I have already added some functionality in v12 based on what you need. When you'd like a BETA version to try, please send me an email to support at htmlvalidator dot com or a private message through the forum.

TNPL is what I abbreviate the language to... I think that's what I'll call it more often now... TNPL. :D

I enjoy helping with real-world solutions to problems as it is a great way to decide what to add to CSE HTML Validator.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Validating only recently published pages

Postby jscar » Tue May 29, 2012 3:09 am

Hi Albert,
pages are generating an error:

The onStartTag_meta() function generated an error while executing: 98081402: misplaced or missing comma where token 'x400' found for setFlag() (source: ...show future warnings setFlag(2,0-=>x400<=-,0); // don't show future messages ...)
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Tue May 29, 2012 8:52 am

jscar wrote:Hi Albert,
pages are generating an error:

The onStartTag_meta() function generated an error while executing: 98081402: misplaced or missing comma where token 'x400' found for setFlag() (source: ...show future warnings setFlag(2,0-=>x400<=-,0); // don't show future messages ...)


Please try updating to the latest version and see if that fixes the problem. Support for hexadecimal numbers was a relatively recent addition.
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Re: Validating only recently published pages

Postby jscar » Tue May 29, 2012 9:18 am

Yep, that worked. Thanks. will update soon with an idea of whether this solves (lessens?) problems with batch wizard not completing.

Thanks
Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby jscar » Wed Jun 06, 2012 6:05 am

Hi Albert

the following is an Important Message from last night's run:

The request for "http://www.eden.gov.uk/democracy/decision-making/" was redirected to "http://www.eden.gov.uk/your-council/how-decisions-are-made/". The parent target is "http://www.eden.gov.uk/democracy/".
*** An exception occurred (2011033101) in ProcessTargetValidate(). jobconfig: ptr:48838784. Exception: *** An exception occurred (2011041111). Exception: *** An exception occurred (2011041129). Exception: *** An exception occurred (2011041128). Exception: *** An exception occurred (2011062801). Exception: *** An exception occurred (2011041179). Exception: *** An exception occurred (2011041180). Exception: *** An exception occurred (2011041181). Exception: *** An exception occurred (2011041301) in InterpreterW::getInteger(). Token is "if". Exception: *** An exception occurred (2011041248). Exception: *** An exception occurred (2011041180). Exception: *** An exception occurred (2011041181). Exception: *** An exception occurred (2011041301) in InterpreterW::getInteger(). Token is "{". Exception: *** An exception occurred (2011041180). Exception: *** An exception occurred (2011041181). Exception: *** An exception occurred (2011041301) in InterpreterW::getInteger(). Token is "if". Exception: *** An exception occurred (2011041248). Exception: *** An exception occurred (2011041180). Exception: *** An exception occurred (2011041181). Exception: *** An exception occurred (2011041301) in InterpreterW::getInteger(). Token is "{". Exception: *** An exception occurred (2011041180). Exception: *** An exception occurred (2011041181). Exception: *** An exception occurred (2011041301) in InterpreterW::getInteger(). Token is "$". Exception: *** An exception occurred (2011041236). Exception: *** An exception occurred (2011041200). Exception: *** An exception occurred (2011041308). Exception: *** An exception occurred (2011041181). Exception: *** An exception occurred (2011041301) in InterpreterW::getInteger(). Token is ")". Exception: *** An exception occurred (2011041201) in InterpreterW::processDollarInt(): EAccessViolation: Access violation at address 03196AAA in module 'csevalidatorV110.dll'. Read of address 00000023. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***. ***


We've not seen anything like this before. Is this a result of the TNPL or is the work we're doing just exposing something that was hidden before. (It wasn't seen in other reports that have been genrated since jscar.cfg was set up)

Thanks

Julian
jscar
Rank II - Novice
Rank II - Novice
 
Posts: 47
Joined: Mon Jul 16, 2007 3:12 am

Re: Validating only recently published pages

Postby Albert Wiersch » Wed Jun 06, 2012 7:46 am

jscar wrote:We've not seen anything like this before. Is this a result of the TNPL or is the work we're doing just exposing something that was hidden before. (It wasn't seen in other reports that have been genrated since jscar.cfg was set up)


Hello,

I'm sorry for the trouble. I'll look at the code later today but I suspect this may be a lack of memory/resource issue. How big was the job you were processing? Also, have you found any page that you can reproduce this with?
Image
Albert Wiersch
User avatar
Albert Wiersch
Site Admin
Site Admin
 
Posts: 2361
Joined: Sat Dec 11, 2004 10:23 am
Location: Near Dallas, TX

Next

Return to CSE Tech Support

Who is online

Users browsing this forum: Yahoo [Bot] and 1 guest