Git Grep and Blame all-in-one: git-grepblame

In this article I describe a (bash & Perl) script I’ve recently created to perform a git blame on the lines matching a git grep I am performing, and concepts, pitfalls, and solutions I’ve found along the way.

The tool below kinda does what git grep --blame ought to do if it existed; or in other words does git grep with blame.

As an aside, if you’re trying to find whose beautiful code it was you’re looking for, you might prefer to create an alias for git praise, i.e.:

$ git config --global alias.praise blame # Who wrote this most beautiful piece of.. code?

… and then start git praise‘ing away. It’s a nice feeling.

What I want to do:

$ git grepblame '2019'
f0f0f0f0f0 year.txt 4 (Marco Fontani 2019-01-01 00:00:01 UTC) 2019
f0f0f0f0f0 copyright.txt 4 (Marco Fontani 2019-01-01 00:00:01 UTC) Copyright (c) 2019 mf@marcofontani.it
$ git grepblame 'no critic'
f0f0f0f0f0 foo.pl 4 (Marco Fontani 2019-01-01 00:00:01 UTC)    ## no critic (ProhibitBacktickOperators)

… which is a combination of git grep; and then git blame on those lines.

At a first approximation, I would want the script to:

git grep for the pattern (and with options) given, with line number
analyse the output (file name, number, text at that line), and run a git blame on that file, that line
print (the file and line number and) all those git blame lines

Simple, right? And so thought I.

I’ve launched a podman container to create a new Git repository so I could play with things. You can do something similar if you’d like to follow along.

I’ve just installed git in it, and configured it to use my own system’s Git configuration for user name and email; created a Git repository in /tmp/repo, and added a text file.

$ podman run -it --rm debian:sid-slim bash -c "(apt update;apt install -y git;git config --global user.email '$(git config user.email)';git config --global user.name '$(git config user.name)')>/dev/null 2>&1; bash"
root@8f27bd46a4fd:/# mkdir -p /tmp/repo
root@8f27bd46a4fd:/# cd !$
cd /tmp/repo
root@8f27bd46a4fd:/tmp/repo# git init .
Initialized empty Git repository in /tmp/repo/.git/
root@8f27bd46a4fd:/tmp/repo# echo foo | tee foo.txt
foo
root@8f27bd46a4fd:/tmp/repo# git add !$
git add foo.txt
root@8f27bd46a4fd:/tmp/repo# git commit -am "add !$"
git commit -am "add foo.txt"
[master (root-commit) 8de25cc] add foo.txt
 1 file changed, 1 insertion(+)
 create mode 100644 foo.txt
root@8f27bd46a4fd:/tmp/repo#

The first step is running git grep and analysing its output, without making mistakes: I need to get the file name, the line number, and optionally the content on that file & line.

That’s easy enough: git grep --line-number PATTERN gets us output which contains the file name; a colon (:); the line number; another colon (:), and lastly the content:

root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
foo.txt:1:foo

So one could simply split that output “on :”, and the first item would be the file name; the second the line number; and then the content. Easy enough with awk:

root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <foo.txt> Line <1> Content <foo>

Well, maybe not like that if you care about capturing the content, as the content could well contain a :…

root@8f27bd46a4fd:/tmp/repo# echo foo:bar | tee foobar.txt
foo:bar
root@8f27bd46a4fd:/tmp/repo# git add !$
git add foobar.txt
root@8f27bd46a4fd:/tmp/repo# git commit !$ -m 'add file with colon in contents'
git commit foobar.txt -m 'add file with colon in contents'
[master 7b3a9f2] add file with colon in contents
 1 file changed, 1 insertion(+)
 create mode 100644 foobar.txt

… at which point the previous awk command would output the wrong thing:

root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
foo.txt:1:foo
foobar.txt:1:foo:bar
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <foo.txt> Line <1> Content <foo>
File <foobar.txt> Line <1> Content <foo>

… but the content doesn’t matter, as all that is required in order to run git blame in order to get the git blame for that file at that line is just the file name and file number.

But UNIX file names can contain all sorts of things: they can contain anything except:

a forward slash, / - as it’s used to separate directories in UNIX
a null byte, let’s call it/show it as ^@ (even though it’s two characters) - as that’s what C uses to terminate strings

This means that in a Git repo there could well be a file whose name contains a colon, and that throws a wrench into the above awk call, too:

root@8f27bd46a4fd:/tmp/repo# echo foo | tee "foo:bar.txt"
foo
root@8f27bd46a4fd:/tmp/repo# git add !$
git add "foo:bar.txt"
root@8f27bd46a4fd:/tmp/repo# git commit !$ -m 'add file with colon in name'
git commit "foo:bar.txt" -m 'add file with colon in name'
[master 4bb03ee] add file with colon in name
 1 file changed, 1 insertion(+)
 create mode 100644 foo:bar.txt
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
foo.txt:1:foo
foo:bar.txt:1:foo
foobar.txt:1:foo:bar
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <foo.txt> Line <1> Content <foo>
File <foo> Line <bar.txt> Content <1>
File <foobar.txt> Line <1> Content <foo>

As one can see above, the “split on :” can’t work if the file name contains a colon!

Not only that, but.. given that they can contain any character but slash and NULL, file names can contain also newlines, spaces, quotes and double quotes which might be not-so-trivial things to look out for:

root@8f27bd46a4fd:/tmp/repo# echo foo | tee "$(printf "bar\nbaz.txt")"
foo
root@8f27bd46a4fd:/tmp/repo# git add !$
git add "$(printf "bar\nbaz.txt")"
root@8f27bd46a4fd:/tmp/repo# git commit !$ -m 'add file with newline in name'
git commit "$(printf "bar\nbaz.txt")" -m 'add file with newline in name'
[master d1b7f21] add file with newline in name
 1 file changed, 1 insertion(+)
 create mode 100644 "bar\nbaz.txt"
root@8f27bd46a4fd:/tmp/repo# ls
'bar'$'\n''baz.txt'   foo.txt   foo:bar.txt   foobar.txt
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f
bar
baz.txt:1:foo
foo.txt:1:foo
foo:bar.txt:1:foo
foobar.txt:1:foo:bar
root@8f27bd46a4fd:/tmp/repo# git grep --line-number f | awk -F : '{ print "File <" $1 "> Line <" $2 "> Content <" $3 ">" }'
File <bar> Line <> Content <>
File <baz.txt> Line <1> Content <foo>
File <foo.txt> Line <1> Content <foo>
File <foo> Line <bar.txt> Content <1>
File <foobar.txt> Line <1> Content <foo>

In fact, a file whose name contains newlines even throws a little wrench in git grep’s output, too - as the output isn’t newline-separated any more!

What we need is a method to clearly distinguish, in git grep’s output:

the file name, regardless of whether it contains spaces, dashes, quotes, double quotes or indeed one or more newlines
the line number
(optionally) the content, which (so long as it’s text-like) at least does end in a newline (but could contain \r if it’s been saved on Windows)

… noting that the content coud well contain any character whatsoever. Spaces, dashes, quotes, even null characters.. but only one newline.

Many standard UNIX tools allow their output to be separated by the NULL character, ^@, as that’s the only character that can be used to output a field separator-separated list of file names with path (as the only other character which can’t be used in a file name is the /, which is used as the directory separator).

Luckily git grep has such an option:

-z, --null
    Output \0 instead of the character that normally follows a file name.

The \0 (or ^@) record separator is, “luckily”, also output between the file name and the line number, if the --line-number option is used:

root@8f27bd46a4fd:/tmp/repo# git grep --line-number --null f
bar
baz.txt^@1^@foo
foo.txt^@1^@foo
foo:bar.txt^@1^@foo
foobar.txt^@1^@foo:bar

… but if we have the above output we can’t use awk anymore, as the records aren’t just split by NULL bytes, but also by newlines…

Note: one could well work with NULL-separated records in mawk and gawk using awk 'BEGIN{RS="\0}; { ... }, but that’s doesn’t work for all types of awk.

We need something that’s just a little bit “smarter” to know that fields are separated by a couple NULL bytes, but then it’s any character until the newline; then a new record starts again.

I, as it often happens, chose Perl for this.

The first record looks something like this (using a verbose and commented regular expression):

my $first_record = qr{
    \A                      # The start of the whole input
    (?<filename>[^\0]+)     # The filename, which is "anything but a NULL byte"
    \0                      # ... up to the null byte used as separator
    (?<lineno>[1-9][0-9]*)  # The line number. There is no line 0.
    \0                      # The null byte used as separator
    (?<contents>[^\n]+)     # The contents: any non-newline up to...
    \n                      # ... the newline which ends the record
}xms;

… and assuming we then restart matching where the last match left off, all the records which follow will have the same structure.

Having parsed that output, and confident that the filename and line number can be reliably grabbed with the regular expression above, all that remains is to call git blame with the proper parameters - that is, the line number and the file name.

The git blame command luckily accepts a -L MIN,MAX parameter to restrict the output to the MIN-MAX lines given, so we can use -L N,N to restrict the output to the one line we’re interested in.

By default, git blame does not output the file name it’s producing the name of - but we need it as we need to know which file (and line) matched; one can use the --show-name and --show-number options to show the file name and file number, respectively.

All that remains is to figure out how to properly pass a file name to it, without breaking things when (not if) the file name contains character which would otherwise have a different meaning: quotes, double quotes, newlines, etc.

As filenames can start with a dash (-), which is also a character which can be used to start an option, it’s a good idea to indicate the end of the options, and the start of the file’s name, with a double dash (--). A double dash is a standard method for saying “options are now finished; what follows are file names.

Some tools allow one to kind of bypass this problem by ensuring that the file names start with ./ (i.e. “the current directory”) but that won’t work well in this case, as git-grep doesn’t output (and therefore “pass on”) the leading ./.

A double dash is neater anyway.

As to the format in which to pass the file name… It’s usually a good idea to quote a filename in single quotes, so that the shell won’t perform variable expansion in it even if the content of the string contain a dollar (i.e. foo=bar echo "'$foo'" echoes back 'bar'; whereas foo=bar echo '"$foo"' outputs "$foo"). So a single quote string it must be.

But the file name may well contain a quote character, and the shell will complain (i.e. “file not found”) if we don’t quote the file name properly.

A lone quote character in a singly quoted string should be quoted in a bit of an odd way: '\''. It means “end the singly quoted string; then output an escaped quote character; then restart the singly quoted string”:

root@8f27bd46a4fd:/tmp/repo# echo 'It'\''s alive!'
It's alive!

How about newlines? bash allows a newline to be similarly substituted (i.e. via '$'\n''):

root@8f27bd46a4fd:/tmp/repo# cat 'bar'$'\n''baz.txt'
foo

… but a standard sh shell doesn’t; “luckily” a standard shell can just use the actual newline character, too:

root@8f27bd46a4fd:/tmp/repo# cat "$(printf "bar\nbaz.txt")"
foo

… so we can just not care about escaping newlines after the double dash inside the singly quoted string which contains the file name we want to run git blame on. Phew, that was a mouthful.

What remains when creating the script is to ensure that all the parameters given to it are properly passed to the git grep call, so that a user can choose whether or not to ignore case (-i); whether to exclude binary files (-I), and to ensure that any spaces or metacharacters used in the git grep call (such as a regexp) are properly passed.

Luckily, that’s simple - and a simple "$@" properly passes all those things to the command.

After all that, here’s the git-grepblame script:

#!/bin/bash
# Copyright 2019 Marco Fontani <mf@marcofontani.it>
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 1. Redistributions of source code must retain the above copyright notice,
#    this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright notice,
#    this list of conditions and the following disclaimer in the documentation
#    and/or other materials provided with the distribution.
# 3. Neither the name of the copyright holder nor the names of its contributors
#    may be used to endorse or promote products derived from this software
#    without specific prior written permission.
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

function git-grepblame {
  local script
  script="$(cat <<'EOF'
  my $input = do { local $/=undef; <> };
  while ($input =~ m!\A(([^\0]+)\0([1-9][0-9]*)\0([^\n]+)\n)!xmsg) {
    my ($orig, $filename, $lineno, $line) = ($1, $2, $3, $4);
    $input = substr $input, length $orig;
    # Escape each single quote "'" in a singly quoted string as: "'\''"
    $filename =~ s!'!'\\''!gxms;
    # Run "git blame" on the file/line, and show the output.
    print qx<
      git blame --show-name --show-number -L $lineno,$lineno -- '$filename'
    >;
  }
EOF
  )"
  git grep --null --line-number "$@" | perl -e "$script"
}

git-grepblame "$@"

… and here it is in action, finding one specific chunk of text in the Perl repository:

~/GIT/SOURCE/perl blead
$ git grepblame -e 'keywords, are'
3cd7355842a pod/perlfunc.pod 108 (Marco Fontani 2017-12-15 10:16:31 -0700 108) keywords, are described in L<perldiag> and L<warnings>.

Hope this helps!