Wednesday, 27 February 2019

awk - git commit trigger to block byte order mark



I'm on Windows and sometimes edit files with Notepad, which likes to put a BOM at the start of the file (EF BB BF). It's easy overlook this in the diff and commit to Git a Python file with such a BOM, which I've found will not work on Mac.



I want to create a commit trigger that removes the BOM before committing. Or at least rejects the commit.



The best I've come up with is the script below which I put in 'pre-commit'. It removes any BOM but only after the commit, so then I have to make a second commit.



#!/bin/sh

git diff --cached --diff-filter=ACMR --name-only -z *.py | xargs -0 -n 1 sh -c '
for FILE; do
sed -b -i -e "1s/^\xEF\xBB\xBF//" "$FILE"
done
' sh


I tried to use commands and 'q' like this so the exit code would be 1 if it matched, but it doesn't work.



#!/bin/sh

git diff --cached --diff-filter=ACMR --name-only -z *.py | xargs -0 -n 1 sh -c '
for FILE; do
sed -b -i -e "1 /^\xEF\xBB\xBF/ {s/^\xEF\xBB\xBF//;q1};q0" "$FILE"
done
' sh


Can someone help fix it?


Answer



You're on the right track.




A good general rule for a pre-commit hook is not to modify the index contents (i.e., "don't change the commit or work dir, nor even try") but rather just to fail the commit, so your second block of code is probably closer—but you're still modifying the files. You can do this if you want, and you can even git add them as well if you really want. It's just generally not a great idea: it tends to be too surprising, and it does unexpected things with carefully staged versions that deliberately differ from the work-directory versions (as produced by git add -p for instance).



You also have two options here: you could inspect only new and modified files (which is what your --diff-filter is for); or you could inspect every file in the index. If you'd like to allow any existing (but unmodified) file to retain an existing Unicode-BOM you definitely want the new-and-modified-only method, so let's stick with that. I'll retain the *.py as well, but we want to protect it from the shell so that it uses git's idea of files whose name ends with .py, not the shell's. In particular, that means that if some .py files exist in the index—and will therefore be committed, if the commit proceeds—but are not in the work directory, they will get checked.



We can simplify the diff filter a bit, by adding --no-renames to the diff command so that R status cannot occur. We also know that C should not occur since we did not supply any -C or --find-copies-harder options. Thus, we start with:



git diff --cached --no-renames --diff-filter=AM --name-only -- '*.py'



I've taken out the -z: -z is good if we can use xargs -0, but I'm planning to read the file names one at a time instead, since most of these commands really only work on one file at a time. (It's possible to do that with xargs too, but if none of your file names contains a newline, we'll be OK without it.) The -- separates diff options from paths (this seems like it should not be required, but see comments below; and it's generally a good idea anyway).



This produces a list of files to be inspected, so now let's inspect (but not edit) them. If you're on Windows, you may need to modify the below to use whatever limited tools you have; since I'm always on a Linux or Unix box I use head -1 to get the first line, and grep to check for the BOM:



#! /bin/sh
git diff --cached --no-renames --diff-filter=AM --name-only -- '*.py' |
(status=0; while IFS= read path; do
if git show ":$path" | head -1 | grep $'^\xEF\xBB\xBF' >/dev/null; then
echo "Error: file '$path' starts with Unicode BOM.'"
status=1

fi
done
exit $status)


Here are the various tricks:




  • We set IFS to nothing during the read, to allow other kinds of white space. (For methods that work with -z, and hence handle newlines too, see Etan Reisner's comments below.)

  • We use git show ":$path" to extract the version of the file that's in the index. This may (as with git add -p, for instance) differ from the version of the file in the work-directory.


  • We use head -1 to discard all but the first line.

  • We use grep to check for the BOM, which we make with a shell string expansion ($'...'), with grep's output directed to /dev/null so that it doesn't show up (grep -q also works but only if that particular grep supports -q).

  • We go on to check all listed files, even if some have a BOM.

  • To work around the shell's subshell-action with pipes (cmd | while ... runs the while in a sub-shell), we set the status in an explicit (parenthesized) sub-shell and exit that sub-shell with that status. That propagates the sub-shell's status—success if no BOMs, failure if some—up to the main shell, where it can become the result of the git hook.



Note: the above is not tested as a complete hook (though I believe it's correct).


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...