charset

For general web development questions that are not specifically related to CSS HTML Validator. This includes (but is not limited to) general HTML, CSS, Accessibility, JavaScript, and SEO questions.
Post Reply
paross
Rank 0 - Newcomer
Posts: 8
Joined: Mon Sep 11, 2006 4:08 pm
Contact:

charset

Post by paross »

I'm looking to learn more about character sets because Tidy is giving me fits. It changes the special characters in my code, and seems to disagree with HTML CSE Validator as to whether the code is valid.

Can you suggest a link or two to help straighten me out. I'm a web designer of moderate ability, so I'm really just looking for some basic rules. I don't have the extra brain cells to digest the whole scheme.

Phil
paross
Rank 0 - Newcomer
Posts: 8
Joined: Mon Sep 11, 2006 4:08 pm
Contact:

Post by paross »

I'm evaluating HTML editors and find that Tidy in various editors behaves differently - radically differently. And I've fussed with the settings, but the truth is that I don't really know what I'm doing.

In particular, I perfer using special characters like:

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>

  <title></title>
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>

<body>
&copy;
&emdash;
&endash;
&reg;
&trade;
&

</body>

</html>
But Tidy in HTML Validator doesn't recognize &emdash; or &endash;.

In UltraEdit Tidy changes the charset:

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <meta name="generator" content="HTML Tidy for Windows (vers 12 April 2005), see www.w3.org" />
    <title></title>
    <meta http-equiv="content-type" content="text/html; charset=us-ascii" />
  </head>
  <body>
    &copy; &emdash; &endash; &reg; &trade; &
  </body>
</html>
Messing up the em- and en-dashes.

And in HTML PAd Tidy renders the file:

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <title></title>
  <meta http-equiv="content-type" content="text/html; charset=us-ascii" />
</head>

<body>
  © &emdash; &endash; ® ™ &
</body>
</html>
It also changes the charset, but renders the special characters differently.

I know that Tidy can be configured in a mirade of ways, but surely there is some best standard for average designers.

Phil
User avatar
Albert Wiersch
Site Admin
Posts: 3785
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Post by Albert Wiersch »

Hi Phil,

I would not recommend using UTF-8 unless you actually need it.

A good source for information on character sets is http://www.wikipedia.org/. You can lookup UTF-8 and get some good info on it.

Even if you don't use UTF-8, you can still use those character entities you mentioned. They should render fine in browsers -
&copy;
&reg;
&trade;
&

However, &emdash; and &endash; should be &mdash; and &ndash;.
Last edited by Albert Wiersch on Tue Sep 12, 2006 3:05 pm, edited 1 time in total.
Albert Wiersch, CSS HTML Validator Developer • Download CSS HTML Validator FREE Trial
User avatar
MikeGale
Rank VI - Professional
Posts: 726
Joined: Mon Dec 13, 2004 1:50 pm
Location: Tannhauser Gate

Post by MikeGale »

Hi Phil,

Utf-8 is a great idea and will, I believe, ultimately become very common.

At present the world is full of editors and text changing tools that either do not understand it or do it wrong (as you've found).

I find it perfectly possible to use utf-8 where the content is not subject to a variety of editors and you have control.

In areas where different tools are used and maybe several people do what they want you are likely in trouble. (I've built software tools that take such material from a limited number of sources and make it right. For casual use I would not suggest going down that road, yet.)

You've got to go in with your eyes open, it's not for everybody.

I have run a lot of tests over a variety of tools, one conclusion from that was that numeric entities (as opposed to the alphabetic entities that you mention) are more robust and less prone to being messed about. (It's unfortunate because the alphabetically named entities are easier to read.)

As for Tidy. I haven't tested the latest version, though I tested what I view as the first four or five generations before I gave up. I found that it was actually impossible to achieve some effects that I wanted, 'cause one setting did more than one thing. (I now use other tools, but for a big chunk of really junky markup (extremely common) I sometimes use Tidy, which then needs fixing up to remove the un-tidy that it emits!!) If using the tool, get to know the settings and make sure that everything you use has those right.

That's enough for now. Summary: I find you can use utf-8 but you do need to rigorously prevent some tools from touching your code. (Most people don't have the determination and temperament to bother.)
Post Reply