Link Checking

For technical support and bug reports for all editions of CSS HTML Validator, including htmlval for Linux and Mac.
ksoutherland
Rank 0 - Newcomer
Posts: 3
Joined: Tue Jul 06, 2021 1:57 pm

Link Checking

Post by ksoutherland »

I am creating custom config files for crawling sites and currently have something like this.

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<csebatchwizardtargetlist version="8">
 <options htmlreportfilename="C:\Reports\HTMLValidator\site.com\index.html" optionflags="131070" excludestrings="*.js;*.js*;*g=*;*m=*;*t=*;f=*;search_text=*;type=*;csid=*;start=*;*bootstrap*;*jquery.*;*jquery-ui*;*animate.css" />
 <target flags="1069547668" target="https://site.com/" fllimitto="https://site.com/">
  <fiec flags="0" agent="ISC" url="https://site.com/" />
 </target>
</csebatchwizardtargetlist>
Currently, it will not leave the main domain to crawl other links per the "fllimitto". Is there a way to add a second URL to the "fllimitto" without removing all limits? I do still want to limit the crawl, just to the main URL but with a subdomain.
Thank you.
User avatar
Albert Wiersch
Site Admin
Posts: 3785
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX

Re: Link Checking

Post by Albert Wiersch »

Hello,

Sorry, you can't add another URL to "fllimitto" but you can add another "target" like this:

Code: Select all

 <target flags="1069547668" target="https://subdomain.site.com/" fllimitto="https://subdomain.site.com/">
  <fiec flags="0" agent="ISC" url="https://subdomain.site.com/" />
 </target>
I hope this helps. Please let me know if you have any more questions.
Albert Wiersch, CSS HTML Validator Developer • Download CSS HTML Validator FREE Trial
ksoutherland
Rank 0 - Newcomer
Posts: 3
Joined: Tue Jul 06, 2021 1:57 pm

Re: Link Checking

Post by ksoutherland »

Unfortunately, that is not the use case I have, so this would not work. Would you know of a solution to that subdomain being used as a CDN? So, the content on the subdomain itself doesn't actually link out to all of its content; rather, it is referenced on the primary URL, which is why I was hoping to allow the crawl to hit those subdomains URLs without the subdomain being a target.
User avatar
Albert Wiersch
Site Admin
Posts: 3785
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX

Re: Link Checking

Post by Albert Wiersch »

Hello,

You could try removing all limits by setting fllimitto to an empty string. Then, in the Batch Wizard (after opening the target list), in the Target List Options tab, you can set the 'Process ONLY these targets' option to something like this:

Code: Select all

https://site.com*;https://subdomain.site.com*
There's also a similar option to limit what links are checked by the link checker.

Does this do what you want?
Albert Wiersch, CSS HTML Validator Developer • Download CSS HTML Validator FREE Trial
ksoutherland
Rank 0 - Newcomer
Posts: 3
Joined: Tue Jul 06, 2021 1:57 pm

Re: Link Checking

Post by ksoutherland »

With some tinkering, this was able to work for me. I could not use Target List -> 'Process ONLY these targets'; rather going into the .lst file and changing it to include the processonlystrings keyword with the sites listed as shown below.

Code: Select all

<options htmlreportfilename="C:\Reports\HTMLValidator\site.com\index.html" optionflags="131070" excludestrings="*.js;*.js*;*g=*;*m=*;*t=*;f=*;search_text=*;type=*;csid=*;start=*;*bootstrap*;*jquery.*;*jquery-ui*;*animate.css" processonlystrings="*://site.com*;*://subdomain.site.com*" />
 <target flags="1069547668" target="https://site.com/" fllimitto="">
  <fiec flags="0" agent="ISC" url="https://site.com/" />
This may just be a user error, but every time I removed the string inside fllimitto to look like this fllimitto="" if I saved 'Process ONLY these targets' from the Batch Wizard UI, it would overwrite my fllimitto setting with whatever the target site was, which is why I had to edit the configuration all manually through the .lst file.
I did have some odd behavior, which was that, prefixed to site and subdomain.site, I could not use https://site.com; instead, having to switch to *://site.com.
User avatar
Albert Wiersch
Site Admin
Posts: 3785
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX

Re: Link Checking

Post by Albert Wiersch »

Hello,

I'm glad you got it to work.

I'm sorry about the issue you encountered. I researched this and found a design issue/bug that was causing the Batch Wizard (under certain circumstances) to reset the fllimitto text to the default if it was empty. I think this may be the issue you were encountering. This will be fixed in the next major release (2022/v22).
ksoutherland wrote: Wed Sep 15, 2021 3:16 pm I did have some odd behavior, which was that, prefixed to site and subdomain.site, I could not use https://site.com; instead, having to switch to *://site.com.
I tried to reproduce this issue but could not. The only difference between https://site.com and *://site.com is that *://site.com covers both http and https. If you'd like me to look into this further then I'd be happy to if you can provide more detail.
Albert Wiersch, CSS HTML Validator Developer • Download CSS HTML Validator FREE Trial