MadBlog
Monday 25 February 2008

Dear John…

wrt your issue yes it's true that for short series of patches, it's often asked to rebase it on a clean state if it doesn't merges fully, though if you're doing a complicated work, meaning dozens of patches e.g., it's usually two things:

  • you're a regular contributor ;
  • you spend quite some time working on it.

When this happens, upstreams are usually okay with merging from a public repository that you would set-up. Though, it's usually a tiny more work for the upstreams to work with a new remote repository, and those are only used for these cases.

Also note that upstream could really fake the same work branching off from the point you branched off, and using git-am from that point instead of git am -3 on the top of the current devel branch (which is similar to a rebase, hence creates new sha's).

IOW, it's not really a git deficiency (even if git-format-patch could maybe annotate _more_ where it comes from, and git-am grok that to re-create a topic branch from that) as git has the features, just that it probably doesn't make it easy enough, and that people usually don't care enough for very short series.

Tuesday 5 February 2008

mixed utf-8 and 8bit charset foo

Deinterlacing utf-8 from a 8bit encoding (say it's latin1, but it works with all 8bits encoding the same), and that could have multiple utf-8 rencoding, there is a very simple tool to write for the task.

Let's assume that you have somewhere a function void charset_to_utf8(FILE *, int c) that takes a FILE * it writes the utf-8 encoded character c. Then write something that roughly looks like that:

   static int utf8_wclen(const unsigned char *s, int maxlen)
   {
       static char const utf8_len[32] = {
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0,
       };
   
       int trail = utf8_len[(*s++) / 8];
       if (trail > maxlen)
           return 0;
       switch (trail) {
         case 4: if ((*s++ & 0xc0) != 0x80) return 0;
         case 3: if ((*s++ & 0xc0) != 0x80) return 0;
         case 2: if ((*s++ & 0xc0) != 0x80) return 0;
         case 1: return trail;
         default: return 0;
       }
   }
   
   int charset_utf8_deinterlace(FILE *f, void *data, int dlen)
   {
       const unsigned char *s = data;
       int pos = 0, res = 0;
   
       while (pos < dlen) {
           int wclen = utf8_wclen(s + pos, dlen - pos);
           if (wclen) {
               fwrite(s, wclen, 1, f);
               pos += wclen;
               res = 1;
           } else {
               /* assume its $charset */
               charset_to_utf8(f, s[pos++]);
           }
       }
       return res;
   }

You can easily base a tool that mmaps a file passed as an argument, and prints an utf-8 clean file to stdout, and that exits with a specific code when it met something that looked like valid utf-8[1]. let's call that tool recode_to_utf8, then if you fear you have multiple reencoding of your data, you need to do that:

 #! /bin/sh
 wrap_recode() {
   recode_to_utf8 "$1" > "$2"
   case $? in)
     0) true;;
     $still_utf8) false;;
     *) echo "WOOPS IO Error" 1>&2; exit 1;;
   esac
 }
 cp your_source dirty
 while ! wrap_recode dirty utf8_clean; do
   iconv -f utf8 -t $charset utf8_clean -o dirty
 done
 rm -f dirty
 # result in utf8_clean

This is all very sketchy, but I've never found a tool that does the job properly, and it's quite simple to derive tools from the methods above. Note that it assumes that it's highly unlikely that a valid sequence of your original charset can form a valid utf-8 codepoint, which for text is usually true (at least in latin1).

Notes

[1] you want the code to be specific to catch IO errors that I ignored for the sake of shortness, the post being quite long already