mixed utf-8 and 8bit charset foo
Deinterlacing utf-8 from a 8bit encoding (say it's latin1, but it works with all 8bits encoding the same), and that could have multiple utf-8 rencoding, there is a very simple tool to write for the task.
Let's assume that you have somewhere a function void charset_to_utf8(FILE *, int c) that takes a FILE * it writes the utf-8 encoded character c. Then write something that roughly looks like that:
static int utf8_wclen(const unsigned char *s, int maxlen)
{
static char const utf8_len[32] = {
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0,
};
int trail = utf8_len[(*s++) / 8];
if (trail > maxlen)
return 0;
switch (trail) {
case 4: if ((*s++ & 0xc0) != 0x80) return 0;
case 3: if ((*s++ & 0xc0) != 0x80) return 0;
case 2: if ((*s++ & 0xc0) != 0x80) return 0;
case 1: return trail;
default: return 0;
}
}
int charset_utf8_deinterlace(FILE *f, void *data, int dlen)
{
const unsigned char *s = data;
int pos = 0, res = 0;
while (pos < dlen) {
int wclen = utf8_wclen(s + pos, dlen - pos);
if (wclen) {
fwrite(s, wclen, 1, f);
pos += wclen;
res = 1;
} else {
/* assume its $charset */
charset_to_utf8(f, s[pos++]);
}
}
return res;
}
You can easily base a tool that mmaps a file passed as an argument, and prints an utf-8 clean file to stdout, and that exits with a specific code when it met something that looked like valid utf-8[1]. let's call that tool recode_to_utf8, then if you fear you have multiple reencoding of your data, you need to do that:
#! /bin/sh
wrap_recode() {
recode_to_utf8 "$1" > "$2"
case $? in)
0) true;;
$still_utf8) false;;
*) echo "WOOPS IO Error" 1>&2; exit 1;;
esac
}
cp your_source dirty while ! wrap_recode dirty utf8_clean; do iconv -f utf8 -t $charset utf8_clean -o dirty done rm -f dirty # result in utf8_clean
This is all very sketchy, but I've never found a tool that does the job properly, and it's quite simple to derive tools from the methods above. Note that it assumes that it's highly unlikely that a valid sequence of your original charset can form a valid utf-8 codepoint, which for text is usually true (at least in latin1).
Notes
[1] you want the code to be specific to catch IO errors that I ignored for the sake of shortness, the post being quite long already

Commentaires
Aucun commentaire pour le moment.
Ajouter un commentaire