MadBlog
Monday 25 February 2008

Dear John…

wrt your issue yes it's true that for short series of patches, it's often asked to rebase it on a clean state if it doesn't merges fully, though if you're doing a complicated work, meaning dozens of patches e.g., it's usually two things:

  • you're a regular contributor ;
  • you spend quite some time working on it.

When this happens, upstreams are usually okay with merging from a public repository that you would set-up. Though, it's usually a tiny more work for the upstreams to work with a new remote repository, and those are only used for these cases.

Also note that upstream could really fake the same work branching off from the point you branched off, and using git-am from that point instead of git am -3 on the top of the current devel branch (which is similar to a rebase, hence creates new sha's).

IOW, it's not really a git deficiency (even if git-format-patch could maybe annotate _more_ where it comes from, and git-am grok that to re-create a topic branch from that) as git has the features, just that it probably doesn't make it easy enough, and that people usually don't care enough for very short series.

Tuesday 19 February 2008

git branches

Eddy, in git branches are lightweight not because you can switch between them in the same working copy, but because they are a cheap operation, not in the SVN sense at all, but because a branch in git is a name, and 40 hexadecimal bytes, aka the sha1 of the commit object the branch is at.

Branching is just sticking a new name to a node of your commits DAG[1]. Branches are stickers, only that. Once you know that, and it's central in git, then you'll easily understand that commit-ing is just adding a new object to the DAG, and move the sticker to that new position. Merging is just adding a new "void[2]" object that has two parents, and moving your sticker onto it. And so on …

Of course, to prevent you from shooting yourself in the foot, git high level commands ensure that the kind of moves you force your "sticker" to do are legit ones, aka ones that are moving from a position that is a parent of the new one, else you're not creating a continuous history but making up parallel worlds :)

IOW git branch new-branch basically does:

   $ git rev-parse HEAD > .git/refs/heads/new-branch

git-rev-parse HEAD is answering the sha1 of your current HEAD of course :) I bet you cannot be more lightweight than that, and no compared to that, svn branches are monsters, they require a central server :D

Notes

[1] Direct Acyclic Graph

[2] unless there is a conflict of course

Tuesday 5 February 2008

mixed utf-8 and 8bit charset foo

Deinterlacing utf-8 from a 8bit encoding (say it's latin1, but it works with all 8bits encoding the same), and that could have multiple utf-8 rencoding, there is a very simple tool to write for the task.

Let's assume that you have somewhere a function void charset_to_utf8(FILE *, int c) that takes a FILE * it writes the utf-8 encoded character c. Then write something that roughly looks like that:

   static int utf8_wclen(const unsigned char *s, int maxlen)
   {
       static char const utf8_len[32] = {
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0,
       };
   
       int trail = utf8_len[(*s++) / 8];
       if (trail > maxlen)
           return 0;
       switch (trail) {
         case 4: if ((*s++ & 0xc0) != 0x80) return 0;
         case 3: if ((*s++ & 0xc0) != 0x80) return 0;
         case 2: if ((*s++ & 0xc0) != 0x80) return 0;
         case 1: return trail;
         default: return 0;
       }
   }
   
   int charset_utf8_deinterlace(FILE *f, void *data, int dlen)
   {
       const unsigned char *s = data;
       int pos = 0, res = 0;
   
       while (pos < dlen) {
           int wclen = utf8_wclen(s + pos, dlen - pos);
           if (wclen) {
               fwrite(s, wclen, 1, f);
               pos += wclen;
               res = 1;
           } else {
               /* assume its $charset */
               charset_to_utf8(f, s[pos++]);
           }
       }
       return res;
   }

You can easily base a tool that mmaps a file passed as an argument, and prints an utf-8 clean file to stdout, and that exits with a specific code when it met something that looked like valid utf-8[1]. let's call that tool recode_to_utf8, then if you fear you have multiple reencoding of your data, you need to do that:

 #! /bin/sh
 wrap_recode() {
   recode_to_utf8 "$1" > "$2"
   case $? in)
     0) true;;
     $still_utf8) false;;
     *) echo "WOOPS IO Error" 1>&2; exit 1;;
   esac
 }
 cp your_source dirty
 while ! wrap_recode dirty utf8_clean; do
   iconv -f utf8 -t $charset utf8_clean -o dirty
 done
 rm -f dirty
 # result in utf8_clean

This is all very sketchy, but I've never found a tool that does the job properly, and it's quite simple to derive tools from the methods above. Note that it assumes that it's highly unlikely that a valid sequence of your original charset can form a valid utf-8 codepoint, which for text is usually true (at least in latin1).

Notes

[1] you want the code to be specific to catch IO errors that I ignored for the sake of shortness, the post being quite long already

Monday 4 February 2008

New git and CLI

Yes Junichi, this new release explicitely deprecates dashed form of git commands, as they will probably go away from the $PATH for many good reasons:

  • the shell completions are way better hence the need for git-* being in /usr/bin has gone for quite some time now;
  • the amount of git-<tab> completions scares the hell out of new users, let's do like bzr (that has more commands than git, and many undocumented ones unlike git whatever they say, they just hide them) and don't publicize our guts for nothing (though everything will always remain documented);
  • there are some issues with dashed commands being in the $PATH that generate misfeatures, for example if you have a ssh command in $HOME/bin it won't be picked because git puts the path where dashed command live front (to avoid versions mismatch) hence you pick /usr/bin/ssh again (hi dato !);
  • we may not have a 1:1 relation between git commands and dashed version for the rest of git's life;
  • aliases (that you can configure in $HOME/.gitconfig already don't work with the dashed form for obvious reasons, so it's better not using the dash for consistency's sake.

So we gave our users a pretty good timeframe to adapt their scripts, hence the deprecation notice.

Also note that best practices about scripting are now documented in the gitcli(5) man page I wrote.