Making CSE catch high ascii characters from MS Word

For technical support for all editions of CSS HTML Validator. Includes bug reports.
cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Making CSE catch high ascii characters from MS Word

Post by cdwz » Fri Oct 12, 2012 8:44 am

We get a lot of MS Word docs to convert to web pages, and in the newest version we've moved to (Office 2010), Word uses a lot of high ascii characters that don't appear until you view the pages in a browser. We'll get hyphens becoming Euro symbols, some blanks becoming the "A" with the accent above it, etc.

Is there any way to make CSE check for these?

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Thu Nov 15, 2012 9:10 am

I still sort of need a solution for this.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Thu Nov 15, 2012 9:33 am

Hello,

I'm sorry for the delay. I must have missed your original message.

It sounds like an encoding issue. Are you saving Word documents as HTML in Word 2010? I would think Office 2010 would save them using UTF-8 so this wouldn't be an issue, but I am not that familiar with how Word works in this regard.

Can you send a sample document that I can use to reproduce the problem to support at htmlvalidator dot com? Also, if there is a public URL that I can access that also shows the problem, then that would be helpful too.
Image
Albert Wiersch

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Thu Nov 15, 2012 11:25 am

Next time we have an affected file, I'll set it aside.

We're basically taking documents that were created in MS Word 2010 and pasting them into the design view of Dreamweaver, then switching to code view to clean up the code. The funny part is that one of my colleagues has an older version of CSE (version 8 ) and his picks up these characters just fine. My version 10 does not.

I've tried changing some of my setting, but if I make it any more sensitive, I get slammed with warnings about table tags not having a summary attribute. Since we use the caption tag, I think the summary is redundant. I wish I could turn that setting off!

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Thu Nov 15, 2012 12:10 pm

cdwz wrote:Next time we have an affected file, I'll set it aside.
OK.
cdwz wrote:The funny part is that one of my colleagues has an older version of CSE (version 8 ) and his picks up these characters just fine. My version 10 does not.
It's possible that the legacy 'high ASCII check' is not being turned off in version 8 but is in 10 because it is detecting a Unicode document, but I'd have to see the actual document to determine exactly what's happening.
cdwz wrote:I've tried changing some of my setting, but if I make it any more sensitive, I get slammed with warnings about table tags not having a summary attribute. Since we use the caption tag, I think the summary is redundant. I wish I could turn that setting off!
Have you tried disabling the message? Please see:
http://www.htmlvalidator.com/htmlval/v1 ... ssages.htm

You should be able to right click on it when it's displayed in CSE's editor and disable the message.
Image
Albert Wiersch

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Wed Feb 20, 2013 2:03 pm

Just an update to this.

I finally got a file full of these high ascii characters, so I uploaded it to my web server to use as a sample. There's only one problem - the characters don't show up. So I did a little experimenting.

The server at my office is a ColdFusion machine. When I have a file with the high ascii characters (they display as umlauts, euro symbols, etc), it's a .cfm file. When I uploaded the file to my web server, I named it .html so that my CMS wouldn't see it. When I viewed it, there were no visible ascii characters.

So I renamed the file on the server at work from .cfm to .html, the characters vanished. But when I rename the file on my own server to .cfm, there are still no visible ascii characters.

So this seems to point to something unique to the server at my office. I'm not sure how to handle that.

Could the encoding be set wrong somewhere?

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Wed Feb 20, 2013 11:21 pm

It sounds like an encoding issue to me.

If you use File->Open from the Web and look at the HTTP headers that are shown in the "Open Progress" window, then you might be able to see what encoding the server is telling the client the document is with the "Content-Type" header. If you can't decipher it, then please copy and paste the HTTP headers here.

For example, there might be a line like this:

Code: Select all

Header> Content-Type: text/html; charset=utf-8
While tells the client that the encoding is utf-8.

To make sure everything is right, make sure the document really is encoded in utf-8, and any encoding specified in the document itself (like in a "meta" tag) is also utf-8. For encodings other than utf-8, the same applies... just make sure everything matches.

It's also possible that the encoding may not be specified in the HTTP headers, which may or may not be an issue... but if it is specified, then I believe it should take precedence over any encoding specified in the document.

I hope this helps.
Image
Albert Wiersch

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Tue Feb 26, 2013 7:33 am

Sorry for the delay. It was a busy weekend!

Here's what the header says:

Code: Select all

Header> HTTP/1.1 200 OK
Header> Content-Type: text/html; charset=UTF-8
Header> Content-Language: en-US
Header> Server: Microsoft-IIS/7.5
Header> X-Powered-By: ASP.NET
Header> Date: Tue, 26 Feb 2013 13:31:05 GMT
Header> Connection: close
Here's an example (from CSE's code view) of a problem file:

Code: Select all

<h1>Table of Contents</h1> 
<p><a href="Toc332984661">I.     Summary. 4</a></p> 
<p><a href="Toc332984662">II.    Background. 4</a></p> 
<p><a href="Toc332984663">III.   Presentation and Discussion Highlights. 5</a></p> 
<p><a href="Toc332984664">IV.  Conclusion. 18</a></p> 
<p><a href="Toc332984665">Appendix A: May 2012 Webinar Summary. 19</a></p> 
<p><a href="Toc332984666">Appendix B: Webinar and Workshop Agendas. 20</a></p> 
<p><a href="Toc332984667">Appendix C: Workshop Attendees. 23</a></p> 
<p><a href="Toc332984668">Appendix D: Brainstorm Anywhere Responses. 27</a></p>
And this:
Several members of DRCOG’s Board of Directors also attended.
These things don't show up in Dreamweaver's code view, and if I press F6 in CSE to validate the current document, they're not caught.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Tue Feb 26, 2013 1:15 pm

Thanks. The header indicates UTF-8. Is there an encoding specified in the document itself, like in a "meta" tag? If so, then it should be UTF-8. If not, then please try adding "<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">" to the document (in the "head" section of course).

What happens if you open the document using File->Open with Encoding and open the document explicitly with "Unicode (UTF-8)" specified as the encoding?

If the above doesn't help you find the problem, then is it possible to put the document on a public server so that I can try to reproduce the issue on my end?

Also, if it's possible to see what encoding Dreamweaver is using to open the document, then that might help as well.
Image
Albert Wiersch

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Wed Feb 27, 2013 7:54 am

There is a meta tag on the page specifying UTF-8.

Specifying the encoding in Dreamweaver doesn't change anything in the existing file, and neither does specifying the encoding on a new file before pasting the contents from MS Word. There are high ascii values that Word is using to make "fancy quotes" and fancy other things, so they exist in the original document.

I've uploaded a file to my personal website. Disclaimer: This is an untouched copy that hasn't had any cleanup, so there will be dozens of HTML errors.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Wed Feb 27, 2013 9:13 am

Hello,

Please remember that the "high ASCII character check" does not apply to Unicode documents, and UTF-8 is a Unicode encoding, so as long as you're dealing with documents that call themselves UTF-8, CSE HTML Validator should not be doing this check ('legacy' versions might though, but they shouldn't be). It's OK to have these characters as long as as the encoding is correct and maintained.

I also suggest upgrading CSE HTML Validator, especially from version 8. I can't recall exactly, but I'm pretty sure that Unicode support has been improved in more recent version of CSE HTML Validator.

I could not find any encoding issues with the file/URL you provided. The characters seem to display correctly and everything seems to be UTF-8. Is this not the case for you? If not, then can you describe in detail how to reproduce the issue you have?

If you are copying and pasting from Word to Dreamweaver, then I wonder if that could be the source of the encoding problems? Also, if you are loading and editing documents in an editor that doesn't properly support UTF-8, then that could be causing corruption as well. The UTF-8 encoding must be recognized and maintained.
Image
Albert Wiersch

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Wed Feb 27, 2013 11:39 am

I'm using CSE version 10.02.

There don't appear to be any problems with the document I linked to, but if I copy that very document to our web server and give it a .cfm extension, all the spaces in the headings between for instance the "I" and "Summary" become Â. Some of the quotes become Euro symbols. They aren't visible in Dreamweaver, but they're invisible on the rendered page when you browse it.

I really think this is a Word 2010 issue rather than a Dreamweaver issue, because we've been pasting docs from Word into DW since 2001, and not until the office upgraded to Office 2010 did we have this problem.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Wed Feb 27, 2013 1:26 pm

Are you transferring the document to the web server using a binary method (and not "ASCII") that doesn't change the file in any way?

Can you open it using CSE HTML Validator (after changing the extension to ".cfm") and make sure the headers still show it as UTF-8? I also wonder whether it's the CFML processor that might be affecting it.

It just seems like something along the line is corrupting the document. The key might be that it happens when you change the filename to a "cfm" extension. You might also try changing it to a .txt extension and see what you get back.
Image
Albert Wiersch

cdwz
Rank II - Novice
Rank II - Novice
Posts: 33
Joined: Tue Sep 02, 2008 10:42 am
Location: Washington DC
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by cdwz » Wed Feb 27, 2013 7:32 pm

We use the ascii transfer method on cfm files because using binary causes them to be double-spaced. Not that it matters - I noticed the funky characters while using binary too.

I'm starting to think there's some sort of encoding going on in the ColdFusion server, because the characters don't show up if I change the file to an .html extension. The characters are still there, of course, but they don't appear in a browser. They just appear as spaces, but if you search on that "space" with a Find or Replace, it finds only those and ignores the regular spaces.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3412
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Making CSE catch high ascii characters from MS Word

Post by Albert Wiersch » Wed Feb 27, 2013 8:03 pm

OK. Please keep me informed of what you find. I don't think I will be able to assist further unless you can provide further steps to try on my end - so that I can reproduce the problem.

I have never used or run a ColdFusion server.
Image
Albert Wiersch

Post Reply