In this article I describe a (bash & Perl) script I’ve recently created to
perform a git blame
on the lines matching a git grep
I am performing, and
concepts, pitfalls, and solutions I’ve found along the way.
The tool below kinda does what git grep --blame
ought to do if it existed; or
in other words does git grep with blame.
As an aside, if you’re trying to find whose beautiful code it was you’re
looking for, you might prefer to create an alias for git praise
, i.e.:
$ git config --global alias.praise blame # Who wrote this most beautiful piece of.. code?
… and then start git praise
‘ing away. It’s a nice feeling.
What I want to do:
$ git grepblame '2019'
f0f0f0f0f0 year.txt 4 (Marco Fontani 2019-01-01 00:00:01 UTC) 2019
f0f0f0f0f0 copyright.txt 4 (Marco Fontani 2019-01-01 00:00:01 UTC) Copyright (c) 2019 mf@marcofontani.it
$ git grepblame 'no critic'
f0f0f0f0f0 foo.pl 4 (Marco Fontani 2019-01-01 00:00:01 UTC) ## no critic (ProhibitBacktickOperators)
… which is a combination of git grep; and then git blame on those lines.
At a first approximation, I would want the script to:
git grep
for the pattern (and with options) given, with line number- analyse the output (file name, number, text at that line), and run a
git blame
on that file, that line - print (the file and line number and) all those
git blame
lines
Simple, right? And so thought I.
I’ve launched a podman
container to create a new Git repository so I could
play with things. You can do something similar if you’d like to follow along.
I’ve just installed git
in it, and configured it to use my own system’s Git
configuration for user name and email; created a Git repository in /tmp/repo
,
and added a text file.
$ podman run -it --rm debian:sid-slim bash -c "(apt update;apt install -y git;git config --global user.email '$(git config user.email)';git config --global user.name '$(git config user.name)')>/dev/null 2>&1; bash"
root@8f27bd46a4fd:/# mkdir -p /tmp/repo
root@8f27bd46a4fd:/# cd !$
cd /tmp/repo
root@8f27bd46a4fd:/tmp/repo# git init .
Initialized empty Git repository in /tmp/repo/.git/
root@8f27bd46a4fd:/tmp/repo# echo foo | tee foo.txt
foo
root@8f27bd46a4fd:/tmp/repo# git add !$
git add foo.txt
root@8f27bd46a4fd:/tmp/repo# git commit -am "add !$"
git commit -am "add foo.txt"
[master (root-commit) 8de25cc] add foo.txt
1 file changed, 1 insertion(+)
create mode 100644 foo.txt
root@8f27bd46a4fd:/tmp/repo#
The first step is running git grep
and analysing its output, without making
mistakes: I need to get the file name, the line number, and optionally the
content on that file & line.
That’s easy enough: git grep --line-number PATTERN
gets us output which
contains the file name; a colon (:
); the line number; another colon (:
),
and lastly the content:
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
foo.txt:1:foo
So one could simply split that output “on :
”, and the first item would be the
file name; the second the line number; and then the content. Easy enough with
awk
:
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <foo.txt> Line <1> Content <foo>
Well, maybe not like that if you care about capturing the content, as the
content could well contain a :
…
root@8f27bd46a4fd:/tmp/repo# echo foo:bar | tee foobar.txt
foo:bar
root@8f27bd46a4fd:/tmp/repo# git add !$
git add foobar.txt
root@8f27bd46a4fd:/tmp/repo# git commit !$ -m 'add file with colon in contents'
git commit foobar.txt -m 'add file with colon in contents'
[master 7b3a9f2] add file with colon in contents
1 file changed, 1 insertion(+)
create mode 100644 foobar.txt
… at which point the previous awk
command would output the wrong thing:
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
foo.txt:1:foo
foobar.txt:1:foo:bar
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <foo.txt> Line <1> Content <foo>
File <foobar.txt> Line <1> Content <foo>
… but the content doesn’t matter, as all that is required in order to run
git blame
in order to get the git blame
for that file at that line is
just the file name and file number.
But UNIX file names can contain all sorts of things: they can contain anything except:
- a forward slash,
/
- as it’s used to separate directories in UNIX - a null byte, let’s call it/show it as
^@
(even though it’s two characters) - as that’s what C uses to terminate strings
This means that in a Git repo there could well be a file whose name contains a
colon, and that throws a wrench into the above awk
call, too:
root@8f27bd46a4fd:/tmp/repo# echo foo | tee "foo:bar.txt"
foo
root@8f27bd46a4fd:/tmp/repo# git add !$
git add "foo:bar.txt"
root@8f27bd46a4fd:/tmp/repo# git commit !$ -m 'add file with colon in name'
git commit "foo:bar.txt" -m 'add file with colon in name'
[master 4bb03ee] add file with colon in name
1 file changed, 1 insertion(+)
create mode 100644 foo:bar.txt
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
foo.txt:1:foo
foo:bar.txt:1:foo
foobar.txt:1:foo:bar
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <foo.txt> Line <1> Content <foo>
File <foo> Line <bar.txt> Content <1>
File <foobar.txt> Line <1> Content <foo>
As one can see above, the “split on :
” can’t work if the file name contains a
colon!
Not only that, but.. given that they can contain any character but slash and NULL, file names can contain also newlines, spaces, quotes and double quotes which might be not-so-trivial things to look out for:
root@8f27bd46a4fd:/tmp/repo# echo foo | tee "$(printf "bar\nbaz.txt")"
foo
root@8f27bd46a4fd:/tmp/repo# git add !$
git add "$(printf "bar\nbaz.txt")"
root@8f27bd46a4fd:/tmp/repo# git commit !$ -m 'add file with newline in name'
git commit "$(printf "bar\nbaz.txt")" -m 'add file with newline in name'
[master d1b7f21] add file with newline in name
1 file changed, 1 insertion(+)
create mode 100644 "bar\nbaz.txt"
root@8f27bd46a4fd:/tmp/repo# ls
'bar'$'\n''baz.txt' foo.txt foo:bar.txt foobar.txt
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
bar
baz.txt:1:foo
foo.txt:1:foo
foo:bar.txt:1:foo
foobar.txt:1:foo:bar
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <bar> Line <> Content <>
File <baz.txt> Line <1> Content <foo>
File <foo.txt> Line <1> Content <foo>
File <foo> Line <bar.txt> Content <1>
File <foobar.txt> Line <1> Content <foo>
In fact, a file whose name contains newlines even throws a little wrench in
git grep
’s output, too - as the output isn’t newline-separated any more!
What we need is a method to clearly distinguish, in git grep
’s output:
- the file name, regardless of whether it contains spaces, dashes, quotes, double quotes or indeed one or more newlines
- the line number
- (optionally) the content, which (so long as it’s text-like) at least does end in a newline (but could contain
\r
if it’s been saved on Windows)
… noting that the content coud well contain any character whatsoever. Spaces, dashes, quotes, even null characters.. but only one newline.
Many standard UNIX tools allow their output to be separated by the NULL
character, ^@
, as that’s the only character that can be used to output a
field separator-separated list of file names with path (as the only other
character which can’t be used in a file name is the /
, which is used as the
directory separator).
Luckily git grep
has such an option:
-z, --null
Output \0 instead of the character that normally follows a file name.
The \0
(or ^@
) record separator is, “luckily”, also output between the
file name and the line number, if the --line-number
option is used:
root@8f27bd46a4fd:/tmp/repo# git grep --line-number --null f
bar
baz.txt^@1^@foo
foo.txt^@1^@foo
foo:bar.txt^@1^@foo
foobar.txt^@1^@foo:bar
… but if we have the above output we can’t use awk
anymore, as the records
aren’t just split by NULL bytes, but also by newlines…
Note: one could well work with NULL-separated records in mawk
and gawk
using awk 'BEGIN{RS="\0}; { ... }
, but that’s doesn’t work for all types of
awk
.
We need something that’s just a little bit “smarter” to know that fields are separated by a couple NULL bytes, but then it’s any character until the newline; then a new record starts again.
I, as it often happens, chose Perl for this.
The first record looks something like this (using a verbose and commented regular expression):
my $first_record = qr{
\A # The start of the whole input
(?<filename>[^\0]+) # The filename, which is "anything but a NULL byte"
\0 # ... up to the null byte used as separator
(?<lineno>[1-9][0-9]*) # The line number. There is no line 0.
\0 # The null byte used as separator
(?<contents>[^\n]+) # The contents: any non-newline up to...
\n # ... the newline which ends the record
}xms;
… and assuming we then restart matching where the last match left off, all the records which follow will have the same structure.
Having parsed that output, and confident that the filename and line number can
be reliably grabbed with the regular expression above, all that remains is to
call git blame
with the proper parameters - that is, the line number and the
file name.
The git blame
command luckily accepts a -L MIN,MAX
parameter to restrict
the output to the MIN-MAX
lines given, so we can use -L N,N
to restrict the
output to the one line we’re interested in.
By default, git blame
does not output the file name it’s producing the name
of - but we need it as we need to know which file (and line) matched; one can
use the --show-name
and --show-number
options to show the file name and
file number, respectively.
All that remains is to figure out how to properly pass a file name to it, without breaking things when (not if) the file name contains character which would otherwise have a different meaning: quotes, double quotes, newlines, etc.
As filenames can start with a dash (-
), which is also a character which can
be used to start an option, it’s a good idea to indicate the end of the
options, and the start of the file’s name, with a double dash (--
). A double
dash is a standard method for saying “options are now finished; what follows
are file names.
Some tools allow one to kind of bypass this problem by ensuring that the file
names start with ./
(i.e. “the current directory”) but that won’t work well
in this case, as git-grep
doesn’t output (and therefore “pass on”) the
leading ./
.
A double dash is neater anyway.
As to the format in which to pass the file name… It’s usually a good idea to
quote a filename in single quotes, so that the shell won’t perform variable
expansion in it even if the content of the string contain a dollar (i.e.
foo=bar echo "'$foo'"
echoes back 'bar'
; whereas foo=bar echo '"$foo"'
outputs "$foo"
). So a single quote string it must be.
But the file name may well contain a quote character, and the shell will complain (i.e. “file not found”) if we don’t quote the file name properly.
A lone quote character in a singly quoted string should be quoted in a bit of
an odd way: '\''
. It means “end the singly quoted string; then output an
escaped quote character; then restart the singly quoted string”:
root@8f27bd46a4fd:/tmp/repo# echo 'It'\''s alive!'
It's alive!
How about newlines? bash
allows a newline to be similarly substituted (i.e.
via '$'\n''
):
root@8f27bd46a4fd:/tmp/repo# cat 'bar'$'\n''baz.txt'
foo
… but a standard sh
shell doesn’t; “luckily” a standard shell can just use
the actual newline character, too:
root@8f27bd46a4fd:/tmp/repo# cat "$(printf "bar\nbaz.txt")"
foo
… so we can just not care about escaping newlines after the double dash
inside the singly quoted string which contains the file name we want to run
git blame
on. Phew, that was a mouthful.
What remains when creating the script is to ensure that all the parameters
given to it are properly passed to the git grep
call, so that a user can
choose whether or not to ignore case (-i
); whether to exclude binary files
(-I
), and to ensure that any spaces or metacharacters used in the git grep
call (such as a regexp) are properly passed.
Luckily, that’s simple - and a simple "$@"
properly passes all those things
to the command.
After all that, here’s the git-grepblame
script:
#!/bin/bash
# Copyright 2019 Marco Fontani <mf@marcofontani.it>
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 1. Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# 3. Neither the name of the copyright holder nor the names of its contributors
# may be used to endorse or promote products derived from this software
# without specific prior written permission.
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
function git-grepblame {
local script
script="$(cat <<'EOF'
my $input = do { local $/=undef; <> };
while ($input =~ m!\A(([^\0]+)\0([1-9][0-9]*)\0([^\n]+)\n)!xmsg) {
my ($orig, $filename, $lineno, $line) = ($1, $2, $3, $4);
$input = substr $input, length $orig;
# Escape each single quote "'" in a singly quoted string as: "'\''"
$filename =~ s!'!'\\''!gxms;
# Run "git blame" on the file/line, and show the output.
print qx<
git blame --show-name --show-number -L $lineno,$lineno -- '$filename'
>;
}
EOF
)"
git grep --null --line-number "$@" | perl -e "$script"
}
git-grepblame "$@"
… and here it is in action, finding one specific chunk of text in the Perl repository:
~/GIT/SOURCE/perl blead
$ git grepblame -e 'keywords, are'
3cd7355842a pod/perlfunc.pod 108 (Marco Fontani 2017-12-15 10:16:31 -0700 108) keywords, are described in L<perldiag> and L<warnings>.
Hope this helps!