DISCLAIMER. English language used here only for compatibility (ASCII only), so any suggestions about my bad grammar (and not only it) will be greatly appreciated.

понедельник, 3 декабря 2012 г.

git filter-branch '--subdirectory-filter' preserving '--no-ff' merges.

(update2, 3/6/2013)
Well, in fact, the name is a bit misleading, because '--subdirectory-filter'
does not preserve merge commits created after fast-forward merge (with
'--no-ff' option). I think, this is because such commits are empty and do not
change any files in subdirectory (so, this is not a bug).

Anyway, i want to split subdirectory into separate project, preserving --no-ff
merge commits. And here is the path i will try to get over: 

1. Track each file from subdirectory (for all refs) through all renames up to
   its real origin.
2. Merge all files tracks into single "track file". This will be the list of
   all files i need to keep at each commit.
3. Run 'filter-branch' with following '--tree-filter':
    - If subdirectory does not exist at given commit, remove all files not
      listed in the single track file.
    - Otherwise (if subdirectory exist), remove all files outside subdirectory
      not listed in the single track file and then move all files from
      subdirectory to the top repository dir. Any filename conflict during
      move should abort rewriting (except ones explicitly handled by
      tree-filter).

Table of contetns:
    1. Track renames of a file.
    2. Renames tracking side-effect.
    3. Running examples.
    4. Why not use 'git log --follow' to track renames of a file?
    5. Note about removing files from subdirectory.
    Appendix A. Script 'track_renames.sh'.
    Appendix B. Script 'examples/jp_prepare.sh'.
    Appendix C. Script 'examples/jp_gen_track.sh'.
    Appendix D. Script 'examples/jp_rewrite.sh'.
    Appendix E. Script 'examples/jp_tree_filter.sh'.
    Appendix F. Script 'examples/jp_finalize.sh'.
    Appendix G. Script 'examples/jp_check.sh'.
    Appendix H. Script 'examples/ex_index_filter/sh'.

1. Track renames of a file.

The greatest challenge for me becomes tracking file through all renames up to
its real origin(s). This is done by 'track_renames.sh' script.  All commands
below are (almost) literally copied from this script.

Each track consists from several continuous history pieces. I start at some
commit c0, which have file f. Then i find end point for given _continuous_
file f history piece (i traverse history in backward direction, i.e. by going
from children commit to parents):

    git log --diff-filter=A --pretty='format:%H' -1 c0 -- f

Note '-1' in the command above. It ensures, that history is continuous, i.e.
file f exists at every commit on the ancestry chain between result of the
above command (commit c1) and commit c0. Then i obtain entire ancestry chain
(all commits that are both descendants of c1 and ancestors of c0):

    git log --pretty='format:%H' --ancestry-path c1..c0 --

Note, that i can't do this in one run, because 'git log -- f' will not show
commits, which does not change file f, but i need them too (otherwise
tree-filter script removes file f at those commits!). And then i combine
commits and corresponding filenames into pairs.

Then i should find origins of file f at its history end point (where file f
has been added into repository) - commit c1. I will use 'git blame' for this:

    git blame -C -C --incremental 'c1^!' -- f

Note, that i use two '-C' options at the commit, which creates file f. This
means, that 'blame' will look for lines origins in _all_ files at commit c1.

This may result in:
    - The same file f from the same commit c1. This happens, when some lines
      were added to file f at the commit c1 (i.e. the origin of these lines is
      file f itself). I should filter out such results, because they will
      cause infinity loop and because i already write file f at the commit c1
      into track during previous "git log" step.
    - Some other file(s) g, h, .. from parent commit(s) c1^X.  This happens,
      when some lines were copied or moved from other files to file f.

Why lines attributed to other files may not originate from commit c1?
Lines origin commit is always the one, which have last modified them in file
they attributed to. But if lines attributed to other file G, matched lines in
file G may not be added at commit c1 - they must exist before (i.e. at commit
c1^).  Because if that's not the case and the same lines were added to both
file f and file G, file G is not an origin for them. In other words, 'blame'
takes file G version from commit c1^, when searching for lines origin, and if
it finds match, revision specification 'c1^!' forces 'blame' to not search
further down the history where these lines have been last modified, but just
blame boundary commit c1^ for them.

So, i only intrested in other files, which are origins for file f. And, to
summarize, file f track looks like tree, where
    - Root is file f at some ref (commit c0).
    - Node (file) history tracked by 'git log' (from the commit, where file
      (corresponding to node) was 'blame'-d to be an origin, to the commit,
      where file was last (more recently) added into repository).
    - Node split performed by 'git blame' searching for file lines origins.

                                            git log
                                . (c1^, h) ----------> ..
             git log           . git blame
    (c0, f) ---------> (c1, f).
                            A  . git blame  git log
                                . (c1^, g) ----------> ..

This tree may (and most likely will) have identical branches, which will be
tracked independently. In other words, my algorithm is inefficient, because if
at some point i encounter pair (commit c, file f), which already have been
tracked down, resulting track of this pair will be exactly the same as before,
but current implementation does not reuse already generated tracks and will do
all over again.

Tracks are written to file, and filename (see below) echo-ed to stdout:

    ./track_$(date ..)-$c0_$(echo "$f" | sed -e's:/:_:g').txt

Script 'examples/jp_gen_track.sh' generates tracks for all requested files at
all requested refs (just calls 'track_renames.sh') and combines them together
removing duplicates into single track file.


2. Renames tracking side-effect.

Though, all may seem simply and correct, history rewrite (using filter-branch)
preserving only files listed in the single track file (see
'examples/ex_index_filter.sh'), generated for every file at every ref, may
still leave project in inconsistent state at some commits: with some files
deleted resulting in project no longer compiles or making commit messages
(which mention deleted files) confusing. In other words, rewritten history
most likely will _not_ be identical to original.

The reason for this is that all files, which no longer exist at the branch's
tip, may be included into single track file only as (some generation) origin
for some file existing at the tip and only at the point, where these origins
will be searched for (by 'git blame'). In other words only _after_ commit
(remember, i traverse history in reverse), where now tracked file (either
existing at branch's tip or its N-generation origin) have been (most recently)
added into repository.

Let's consider example project with one branch and one file f at the tip of
that branch, where
    - File f is tracked file. File 'f1' denotes different state (at different
      commit) of file f, but not new file.
    - File g is origin of file f. Files 'g1', 'g2', 'g3' are also different
      states (at different commits) of file g, but not new files.
    - Commit P is branch head.
    - Commit Q is where file g has been deleted.
    - Commit R is where file g has been added.
    - Commit S is where file f has been added.
    - Letter 'D' under file name denotes file deleted at this commit.
    - Letter 'A' under file name denotes file added at this commit.
    - Letter 'o' under commit name denotes commit.
    - Dash ('-') denotes commit ancestory chain (more recent on the left).
    - Dot denotes file (or line(s)) history.

There is three possible endings of history rewrite:

1. Rewritten history will not contain file g at all. Hence, commits modifying
   only file g become empty and will be deleted with '--prune-empty' option:

(history shown in reverse, more recent commits on the left!)
    P      Q       R    S                     P'               S'
    o ---> o ----> o -> o                     o -------------> o
       (git log -- f)              (filter-branch)
    f1 ................ f               ==>   f1 ............. f
                        A
           g1  ... g
           D       A

2. Rewritten history will be identical to original:

(history shown in reverse, more recent commits on the left!)
                       (Q)
    P                   S    S^    R          P'               S'   S'^    R'
    o ----------------> o -> o --> o          o -------------> o -> o  --> o
       (git log -- f)             (filter-branch)
    f1 ................ f.              ==>   f1 ............. f 
                        A . (git blame)                        A  
                           .                                       
                       g2 . g1 ... g                           g2 . g1 ... g
                       D           A                           D           A

3. Rewritten history will contain only part of file g history. As well as in
   case 1, commits modifying only file g after commit S (descendants of commit
   S, i.e. on the ancestry chain S..Q), become empty and will be deleted with
   '--prune-empty' option:

(history shown in reverse, more recent commits on the left!)
    P      Q            S    S^    R          P'               S'   S'^    R'
    o ---> o ---------> o -> o --> o          o -------------> o -> o  --> o
       (git log -- f)             (filter-branch)
    f1 ................ f.              ==>   f1 ............. f 
                        A . (git blame)                        A  
                           .                                       
           g3 ........ g2 . g1 ... g                           g2 . g1 ... g
           D                       A                           D           A

Here is several examples of the above history rewrites. In all examples i will
rewrite history preseving all files and branches. Also i will use some
commands from 'track_renames.sh' script.

Example 1.

Case 3 in linear history: origin of tracked file deleted at all commits where
tracked file still exists.

Original history:
{{{
    | * 419ca9a Delete experimental files.
    | | D       show_words/src/multiline.hs
    | | M       show_words/src/testSgfList.hs
    | | D       show_words/src/tf.hs
    | * 9bb2a68 Finally fix foldrMerge. Add newtype ZipList'. Add some test for SgfList.
    | | M       show_words/src/SgfList.hs
    | | A       show_words/src/testSgfList.hs
    | * fd11529 Replace eq with (Eq a) in transp.
    | | M       show_words/src/SgfList.hs
}}}

Rewritten history:
{{{
    | * 2fcaa27 Delete experimental files.
    | | M       show_words/src/testSgfList.hs
    | * 091d23e Finally fix foldrMerge. Add newtype ZipList'. Add some test for SgfList.
    | | M       show_words/src/SgfList.hs
    | | D       show_words/src/multiline.hs
    | | A       show_words/src/testSgfList.hs
    | * 2205617 Replace eq with (Eq a) in transp.
    | | M       show_words/src/SgfList.hs
}}}

As you can see, in rewritten history file 'multiline.hs' deleted earlier (at
commit "Finally fix.."), than in original history (at commit "Delete.."),
making commit name "Delete experimental files." a bit confusing.

Why have this happened? File 'multiline.hs' was brought into single track file
by file 'testSgfList.hs', which was added in the repo at commit "Finall fix..":

    $ git blame -C -C --incremental '9bb2a68^!' -- show_words/src/testSgfList.hs \
        | sed -ne '/^[0-9abcdef]\{40\} /{ s/ .*//; h; }; /^filename /{ H; g; s/\nfilename / /p; };' \
        | uniq
    9bb2a6842fbf995f24460812cb34ddbd2cdb864a show_words/src/testSgfList.hs
    fd115298dab19b959d0ace361f068d7197c5c796 show_words/src/multiline.hs

but at commit "Delete.." 'testSgfList.hs' still exist (or "already" exist, but
because i traverse history in reverse, i prefer "still"), hence, i don't even
think about its origins. At commit "Finall fix..", when 'testSgfList.hs' have
been added, i search for its origins, but _still_ do not need them, because
'testSgfList.hs' _still_ exist. And only at commit "Replace eq.." - "Finally
fix.." 's parent - i need its origin. In other words, i notice origin
('multiline.hs') only, when reach child's "birth" commit.  And i want origin
to exist only before (earlier) child have "born".

Example 2.

Case 3 in non-linear history: origin of tracked file deleted at other history
branch (i mean another chain of commits, not ref).

Original history:
{{{
    | * 4e2d0f9 Support multiline input.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/multiline.hs
    | | D       show_words/src/test_1.txt
    | | D       show_words/src/test_2.txt
    | | D       show_words/src/test_3.txt
    | | D       show_words/src/test_words.txt
    | | A       show_words/test/1.txt
    | | A       show_words/test/2.txt
    | | A       show_words/test/3.txt
    | | A       show_words/test/words_jp_ru.txt
    | *   42ab9af Merge branch 'show_words' into show_words_multiline
    | |\  
    | | * b57f90f Move all generic list functions into SgfList.
    | | | A     show_words/src/SgfList.hs
    | | | D     show_words/src/SgfListIndex.hs
    | | | M     show_words/src/SgfOrderedLine.hs
    | | | M     show_words/src/ShowWords.hs
    | * | ff1f738 Rewrite and rename a little.
    | | | M     show_words/src/multiline.hs
    | * | 1d2fe0c Two multiline implementations. Both works.
    | |/  
    | |   A     show_words/src/multiline.hs
    | * 7ff496a Add fixmes.
    |/  
    |   M       show_words/src/ShowWords.hs
    *   20ea746 Merge branch 'master' into show_words
}}}

Rewritten history:
{{{
    | * a29ccf8 Support multiline input.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/multiline.hs
    | | D       show_words/src/test_1.txt
    | | D       show_words/src/test_2.txt
    | | D       show_words/src/test_3.txt
    | | D       show_words/src/test_words.txt
    | | A       show_words/test/1.txt
    | | A       show_words/test/2.txt
    | | A       show_words/test/3.txt
    | | A       show_words/test/words_jp_ru.txt
    | *   271e25e Merge branch 'show_words' into show_words_multiline
    | |\  
    | | * 11bb38f Move all generic list functions into SgfList.
    | | | A     show_words/src/SgfList.hs
    | | | D     show_words/src/SgfListIndex.hs
    | | | M     show_words/src/SgfOrderedLine.hs
    | | | M     show_words/src/ShowWords.hs
    | * | f3601db Rewrite and rename a little.
    | | | M     show_words/src/multiline.hs
    | * | d30b3d7 Two multiline implementations. Both works.
    | |/  
    | |   D     show_words/src/SgfListIndex.hs
    | |   A     show_words/src/multiline.hs
    | * d00c0b5 Add fixmes.
    |/  
    |   M       show_words/src/ShowWords.hs
    *   52b84e9 Merge branch 'master' into show_words
}}}

As you can see, in rewritten history at commit "Two multiline.." deleted file
'SgfListIndex.hs', and, hence, at commit "Rewrite and rename.." it also does
not exist.  Without this file project will not compile.

Why have this happened? File 'SgfList.hs' have been tracked by 'git log' up to
"Move all.." commit, where it first appears:

    $ git log --diff-filter=A --oneline -1 4e2d0f9  -- show_words/src/SgfList.hs
    b57f90f Move all generic list functions into SgfList.

then 'git blame' is asked where lines from 'SgfList.hs' came from, and it
points to 'SgfListIndex.hs' file from "Add fixmes" commit, because it is
"Move all.." parent:

    $ git blame -C -C --incremental 'b57f90f^!' -- show_words/src/SgfList.hs \
        | sed -ne '/^[0-9abcdef]\{40\} /{ s/ .*//; h; }; /^filename /{ H; g; s/\nfilename / /p; };' \
        | uniq
    b57f90f50b176532717d10b9e5ac92a521d90b21 show_words/src/SgfList.hs
    7ff496a4acfbcb20fedb4167134b120bdb468a20 show_words/src/SgfListIndex.hs
    7ff496a4acfbcb20fedb4167134b120bdb468a20 show_words/src/ShowWords.hs

Commits "Two multiline.." and "Rewrite and rename.." even not reachable from
commit "Move all..", so they can't be origin for any line in any case.

So, commits "Rewrite and rename.." and "Two multiline.." have been added into
single track file by other files tracks (not 'SgfList.hs' 's). I can check
track files produced by 'examples/jp_gen_track.sh' (precisely, by
'track_renames.sh' called for each branch and file by
'examples/jp_gen_track.sh') to find out which tracks reference these two
commits and which files they reference on these commits:

    $ grep -R -e ff1f738 .  | head -n5
    ./track_20121116_221210-f15c70a5b6fc847403f62595cb21d95e8b7a44a2_show_words_src_SgfOrderedLine.hs.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/src/SgfOrderedLine.hs
    ./track_20121116_221215-a5ddf92c8c499833c755af9ea058fb5ad2914c48_show_words_README.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/README
    ./track_20121116_221216-a5ddf92c8c499833c755af9ea058fb5ad2914c48_show_words_tests_testSgfList.hs.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/src/multiline.hs
    ./track_20121116_221213-0d44c0a5b9ade4becc4be793d586aa74a4da039e_.gitignore.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c .gitignore
    ./track_20121116_221213-dd2e8267e848c779501411fd4584e42dcbad875e_show_words_tests_words_words_jp_ru.txt.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/src/test_words.txt

but 'SgfListIndex.hs' does not referenced by any of these track files:

    $ grep -R -e ff1f738 .  | grep SgfListIndex
    $ 
    $ grep -R -e 1d2fe0c . | grep SgfListIndex
    $

and therefore 'SgfListIndex.hs' have been deleted at these two commits.

Example 3.

Case 1: file does not appear in rewritten history at all (completely deleted
from new repository), because it is not origin for any tracked file. And,
hence, some commits become empty and have been deleted as well (by
'--prune-empty' option).

Original history:
{{{
    | * 132e21f Column equality function in config. Finally implement zipFoldM.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/SgfOrderedLine.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/ShowWordsConfig.hs
    | | M       show_words/src/ShowWordsOutput.hs
    | | D       show_words/src/zipFold.hs
    | * b658f1a Fix monadic version of generalized list eq.
    | | M       show_words/src/zipFold.hs
    | * 37e9130 Generalized list Eq for some eq function. Draft.
    | | M       show_words/src/SgfList.hs
    | | A       show_words/src/zipFold.hs
}}}

Rewritten history:
{{{
    | * ad31503 Column equality function in config. Finally implement zipFoldM.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/SgfOrderedLine.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/ShowWordsConfig.hs
    | | M       show_words/src/ShowWordsOutput.hs
    | * 52e81cc Generalized list Eq for some eq function. Draft.
    | | M       show_words/src/SgfList.hs
}}}

As you can see, file 'zipFold.hs' has not appeared in rewritten history,
because it is not origin for any tracked file. And, hence, rewritten commit
"Fix monadic.." became empty and has been deleted entirely.  This may be
checked by searching through track files:

    $ grep -R -e zipFold .
    $


3. Running examples.

So, when all implications of running 'track_renames.sh' are considered,
writing other supplementary scripts is fairly easy, and i will not describe
them here (see the source and comments). Here is correct order in which
example scripts must be run (note, though, that all of them assume my
repository and pathes, hence, they'll not "just work" - you need to write your
own using these as examples only):

    1. Prepare repository: clone, recreate (local) branches you want to have
       in new repository from remotes and delete remotes:

        $ sh ./jp_prepare.sh

    2. Generate single track file for all files from subdirectory at all refs:

        $ cd jp
        $ sh ../jp_gen_track.sh '' 'subdir' 'show_words'

    3. Rewrite history: 'git filter-branch' will use
       'examples/jp_tree_filter.sh' as '--tree-filter'; tree filter script and
       single track file expected to be one level higher in directory tree
       (i.e. in '..'):

        $ sh ../jp_rewrite.sh

    4. Clean up: reset working tree to HEAD and remove all untracked files,
       remove backup refs, expire all reflogs, remove all unreferenced
       objects:

        $ sh ../jp_finalize.sh

    5. Check, that rewrite went well (don't forget to check generated diffs to
       be sure, that nothing wrong happens):

        $ sh ../jp_check.sh


4. Why not use 'git log --follow' to track renames?

Because '--follow' does not always work. And because it's not the right way to
do. Here you can find more explanations:

    Directory renames without breaking git log

    (from the last message of Junio C Hamano:
    {{{
            But another thing I should mention in this context is that you should not
            take --follow option (at least in the current form) too seriously.

            I see it's been a while --- the last time I did this was October 2006 if I
            am not mistaken.  It's time of the year I should point at one of the most
            important articles ever written on this mailing list:

                http://thread.gmane.org/gmane.comp.version-control.git/27/focus=217

            After understanding what Linus envisioned back then, why what he said are
            important are important and why what he dismissed as uninteresting are
            indeed uninteresting, things to think about now are:

             - "blame" (especially with -C -C), as you found out already, does answer
               the more important question "where did this come from"; and

             - the question "log --follow $this_file" is asking is exactly "where did
               this file come from".  Remember what adjective was used for the
               question in the article?

            I also should mention that --follow was done by Linus as a hack with
            known limitations.

            Potential improvements to follow possible renames fully would involve:

             * allow not just a single path but a set of pathspecs to be recorded
               during --follow traversal;

             * allow the above information be associated with individual commits, not
               as a single global state in the traversal machinery;

             * enhance the logic to update the pathspecs information kept above when
               you hit renames while traversing the history.  An important part of
               this job involves inferring a wholesale rename of a directory by
               looking at many files moved from one place to another, which we
               currently do not do anywhere in git.
    }}}

And here is mentioned above Linus message

    Re: Merge with git-pasky II.

    {{{
    From: Linus Torvalds <torvalds <at> osdl.org>
    Subject: Re: Merge with git-pasky II.
    Newsgroups: gmane.comp.version-control.git
    Date: 2005-04-15 15:32:46 GMT (7 years, 31 weeks, 1 day, 3 hours and 13 minutes ago)

    On Fri, 15 Apr 2005, David Woodhouse wrote:
    > 
    > And you're right; it shouldn't have to be for renames only. There's no
    > need for us to limit it to one "source" and one "destination"; the SCM
    > can use it to track content as it sees fit.

    Listen to yourself, and think about the problem for a second.

    First off, let's just posit that "files" do not matter. The only thing
    that matters is how "content" moved in the tree. Ok? If I copy a function
    from one fiel to another, the perfect SCM will notice that, and show it as
    a diff that removes it from one file and adds it to another, and is
    _still_ able to track authorship past the move. Agreed?

    Now, you basically propose to put that information in the "commit" log, 
    and that's certainly valid. You can have the commit log say "lines 50-89 
    in file kernel/sched.c moved to lines 100-139 in kernel/timer.c", and then 
    renames fall out of that as one very small special case.

    You can even say "lines 50-89 in file kernel/sched.c copied to.." and 
    allow data to be tracked past not just movement, but also duplication.

    Do you agree that this is kind of what you'd want to aim for? That's a 
    winning SCM concept.

    How do you think the SCM _gets_ at this information? In particular, how 
    are you proposing that we determine this, especially since 90% of all 
    stuff comes in as patches etc? 

    You propose that we spend time when generating the tree on doing so. I'm 
    telling you that that is wrong, for several reasons:

     - you're ignoring different paths for the same data. For example, you 
       will make it impossible to merge two trees that have done exactly the 
       same thing, except one did it as a patch (create/delete) and one did it 
       using some other heuristic.

     - you're doing the work at the wrong point. Doing it _well_ is quite 
       expensive. So if you do it at commit time, you cannot _afford_ to do it 
       well, and you'll always fall back to doing an ass-backwards job that 
       doesn't really get you to the good state, and only gets you to a 
       not-very-interesting easy 1% of the solution (ie full file renames).

     - you're doing the work at the wrong point for _another_ reason. You're 
       freezing your (crappy) algorithm at tree creation time, and basically 
       making it pointless to ever create something better later, because even 
       if hardware and software improves, you've codified that "we have to
       have crappy information".

    Now, look at my proposal: 

     - the actual information tracking tracks _nothing_ but information. You 
       have an SCM that tracks what changed at the only level that really 
       matters, namely the whole project. None of the information actually 
       makes any sense at all at a smaller granularity, since by definition, a
       "project" depends on the other files, or it wouldn't be a project, it
       would be _two_ projects or more.

     - When you're interested in the history of the information, you actually 
       track it, and you try to be _intelligent_ about it. You can actually do 
       a HELL of a lot better than whet you propose if you go the extra mile. 
       For example, let's say that you have a visualization tool that you can 
       use for finding out where a line of code came from. You start out at 
       some arbitrary point in the tree, and you drill down. That's how it 
       works, right?

       So how do you drill down? You simply go backwards in history for that 
       project, tracking when that file+line changed (a "file+line" thing is 
       actually a "sensible" tracking unit at this point, because it makes
       sense within the query you're doing - it's _not_ a sensible thing to
       track at "commit" time, but when you ask yourself "where did this line
       come from", that _question_ makes it sensible. Also note that "where 
       did this _file_ come from is not a sensible question, since the file 
       may have been the combination (or split) of several files, so there is
       no _answer_ to that question"

       So the question then becomes: "how can you reasonably _efficiently_
       find the history of one particular line", and in fact it turns out that 
       by asking the question that way, it's pretty obvious: now that you
       don't have to track the whole repository, you can always try to 
       minimize the thing you're looking for.

       So what you do is walk back the history, and look at the tree objects 
       (both sides when you hit a merge), eand see if that file ever changes. 
       That's actually a very efficient operation in GIT - it matches
       _exactly_ how git tracks things anyway. So it's not expensive at all.

       When that file changes, you need to look if that _line_ changed (and 
       here is where it comes down to usability: from a practical standpoint
       you probably don't care about a single line, you really _probably_ want
       to see changes around it too). So you diff the old state and the new 
       state, and you see if you can still find where you were. If you still 
       can, and the line (and a few lines around it) is still the same, you 
       just continue to drill down. So that's not the interesting case.

       So what happens when you found "ok, that area changed"? Your 
       visualization tool now shows it to the user, AND BECAUSE IT SEES THE 
       WHOLE TREE DIFF, it also shows where it probably came from. At _that_ 
       point, it is actually very trivial to use a modest amount of CPU time, 
       and look for probable sources within that diff. You can do it on modern 
       hardware in basically no time, so your visualization tool can actually 
       notice that

            "oops, that line didn't even exist in the previous version, BUT I
             FOUND FIVE PLACES that matched almost perfectly in the same diff,
             and here they are"

       and voila, your tool now very efficiently showed the programmer that
       the source of the line in question was actually that we had merged 5 
       copies of the same code in different archtiectures into one common
       helper function.

       And if you didn't find some source that matched, or if the old file was
       actually very similar around that line, and that line hadn't been
       "totally new"? That's the easy case again - you show the programmer the
       diff at that point in time, and you let him decide whether that diff 
       was what he was looking for, or whether he wants to continue to "zoom
       down" into the history.

    The above tool is (a) fairly easy to write for git (if you can do 
    visualization tools and (b) _exactly_ what I think most programmers 
    actually want. Tell me I'm wrong. Honestly..

    And notice? My clearly _superior_ algorithm never needed any rename
    information at all. It would have been a total waste of time. It would
    also have hidden the _real_ pattern, which was that a piece of code was
    merged from several other matching pieces of code into one new helper
    function. But if it _had_ been a pure rename, my superior tool would have
    trivially found that _too_. So rename infomation really really doesn't
    matter.

    So I'm claiming that any SCM that tries to track renames is fundamentally
    broken unless it does so for internal reasons (ie to allow efficient
    deltas), exactly because renames do not matter. They don't help you, and 
    they aren't what you were interested in _anyway_.

    What matters is finding "where did this come from", and the git
    architecture does that very well indeed - much better than anything else
    out there. I outlined a simple algorithm that can be fairly trivially
    coded up by somebody who really cares. Sure, pattern matching isn't
    trivial, but you start out with just saying "let's find that exact line,
    and two lines on each side", and then you start improving on that.

    And that "where did this come from" decision should be done at _search_ 
    time, not commit time. Because at that time it's not only trivial to do, 
    but at that time you can _dynamically_ change your search criteria. For 
    example, you can make the "match" algorithm be dependent on what you are 
    looking at.

    If it's C source code, it might want to ignore vairable names when it
    searches for matching code. And if it's a OpenOffice document, you might
    have some open-office-specific tools to do so. See? Also, the person doing 
    the searches can say whether he is interested in that particular line (or 
    even that particial _identifier_ on a line), or whether he wants to see 
    the changes "around" that line.

    All of which are very valid things to do, and all of which my world-view
    supports very well indeed. And all of which your pitiful "files matter" 
    world-view totally doesn't get at all.

    In other words, I'm right. I'm always right, but sometimes I'm more right 
    than other times. And dammit, when I say "files don't matter", I'm really 
    really Right(tm).

    Please stop this "track files" crap. Git tracks _exactly_ what matters, 
    namely "collections of files". Nothing else is relevant, and even 
    _thinking_ that it is relevant only limits your world-view. Notice how the 
    notion of CVS "annotate" always inevitably ends up limiting how people use 
    it. I think it's a totally useless piece of crap, and I've described 
    something that I think is a million times more useful, and it all fell out 
    _exactly_ because I'm not limiting my thinking to the wrong model of the 
    world.

                            Linus
    }}}

5. Note about removing files from subdirectory.

If i want to completely remove directory from repository, including origins of
files from this directory, i have two options:
    - Delete this directory at all kept branches tip. Then this directory
      files will not be included in track file, and if none of these files or
      their origins will be blamed as origin (N-th generation) of one of the
      tracked files, this directory and all its origins will complete gone.
    - Or i can track all files from this directory and then remove tracked
      files (instead of keeping them).

Usually, first approach is better, because with (index or tree) filter
removing files from track, all files, which were missed from track on some
commits due to above reasons (see chapter "2.  Renames tracking
side-effect."), will be revived on exactly these commits!

If i consider again "three possible endings of history rewrite" schemes
described in chapter 2, then:
    - In case 1 file g will be present in rewritten history, because it is not
      in track at all.
    - In case 3, file g will be added at commit S' and deleted at commit Q'
      (to which commit Q will be rewritten), because file g on the ancestry
      chain S..Q is not on track.

Appendix A. Script 'track_renames.sh'.
{{{
#!/bin/sh

# Track specified file starting at specified commit down the history to its
# real origins.
# Arguments:
# 1 - start commit sha1.
# 2 - filename.
# Result:
#   in track file.
# Stdout:
#   track file name.

set -euf

readonly newline='
'
readonly ret_success=0
readonly ret_error=1
OIFS="$IFS"

readonly full_history='1'         # Option (set == non empty).

# FIXME: Change order of date and commit/filename in tack filename. Probably,
# commit/filename should be first? If many logs generated, they'll have
# slightly different times.
# FIXME: Or accept track file prefix through cmd to make different invocations
# to be grouped together.
readonly track_prefix="./track_$(date '+%Y%m%d_%H%M%S')-"
readonly track_suffix='.txt'
track_file=''

set_track_file()
{
    # Set track_file variable (must be declared!) to track file name. Old
    # track file with the same name will be deleted and new one created.
    if [ "x${track_file+x}" = 'x' ]; then
        echo "get_track_file(): track_file variable not set" 1>&2
        return $ret_error
    fi
    track_file="${track_prefix}${1}${track_suffix}"
    rm -f "$track_file" && touch "$track_file"
}

hist_step()
{
    # Track down history of file f starting at commit c0. Result will be
    # commit and filename pairs separated by space for either each commit in
    # continuous piece of file f history or for last commit only (where file f
    # have been added).
    # 1 - start commit sha1.
    # 2 - filename.
    # Result:
    #   track file - (maybe) current history and its end point.
    # Pipe (stdout):
    #   start points for new history.
    if [ $# -lt 2 -o ! -f "$track_file" ]; then 
        echo "hist_step(): Too few arguments or track file does not exist." 1>&2
        exit $ret_error
    fi
    local OIFS="$IFS"

    local c0="$1"       # Start commit (sha1) for file f history.
    local f="$2"        # Filename, which history i will look for.
    local h1=''         # End point (commit and filename) of file f history.
    local c1=''         # End point commit (sha1).
    local hs=''         # Full file f history (commit and filename pairs).
    local h0s=''        # Start points (commit and filename pairs) for
                        # file f 's origins history.
    local sha1_rx='[0-9abcdef]\{40\}'

    # stdout is pipe, so do not write!
    IFS="$newline"
    # I work only with continuous history pieces, hence, file f must exist at
    # commit c0.
    if [ "$(git cat-file -t "${c0}:$f")" != 'blob' ]; then
        echo "hist_step(): File '$f' does not exist at commit '$c0'." 1>&2
        return $ret_error
    fi
    # Take latest file f addition into git as end point to ensure history
    # continuity.
    c1="$(git log --diff-filter=A --pretty='format:%H' -1 "$c0" -- "$f")"
    h1="$c1 $f"
    if [ -n "${full_history:-}" ]; then
        hs="$(git log --pretty='format:%H' --ancestry-path "${c1}..$c0" -- \
                | sed -e"a\\$f" \
                | sed -e'N; s/\n/ /;'
            )"
    fi
    hs="${hs:+${hs}$newline}$h1"
    # Write file f history directly into resulting file.
    echo "$hs" >>"$track_file"

    # Find start points (commit and new filename) for file f 's origins
    # history (probably, consequence of file f rename). I must ensure, that
    # file f 's history end point (both commit and filename match) will not be
    # included as one of start points. This is possible, if some lines to file
    # f were added at the commit, which creates file f.
    h0s="$(git blame -C -C --incremental "${c1}^!" -- "$f" \
                | sed -ne"
                        /^${sha1_rx} /{ s/ .*//; h; };
                        /^filename /{ H; g; s/\nfilename / /p; };
                    " \
                | ( grep -F -v -e "$h1" || true )
        )"
    # Lines origin commit is always one, which have last modified them in file
    # they attributed to. But if lines attributed to other file G, matched
    # lines in file G may not be added at commit c1 - they must exist before
    # (at commit c1^).  Hence, boundary commit c1^ will be blamed for these
    # lines. I just want to be sure, that i'm right.
    if echo "$h0s" | grep -q "^$c1 "; then
        echo "hist_step(): Some lines attributed to other file at end point commit." 1>&2
        echo "hist_step(): This should never happen." 1>&2
        echo "hist_step(): And means some critical flaw in algorithm." 1>&2
        return $ret_error
    fi
    # Result may contain duplicates. Duplicates will go sequentially,
    # because 'git blame --incremental' outputs by commits, not by lines
    # (all lines, which one commit blamed for, then all lines, which other
    # commit blamed for, etc).
    echo "$h0s"
    IFS="$OIFS"
}

rename_hist()
{
    # Track down history of file f starting at commit c0. Unlike hist_step(),
    # this function tracks down to where file f or _all_ of its origins really
    # have first appeared in repository. It combines continuous history pieces
    # for all origins into single track. Hence, some commits, which may be
    # common in histories of several origins, may appear in the track several
    # times.  Resulted track file will contain date and file f name in its
    # filename.
    # 1 - start commit sha1.
    # 2 - filename.
    # Result:
    #   track file - created by hist_step().
    if [ $# -lt 2 ]; then
        echo "rename_hist(): Too few arguments." 1>&2
        return $ret_error
    fi
    local OIFS="$IFS"

    local c0="$1"       # Start commit (sha1) for file f history.
    local f="$2"        # Filename, which history i will now look for.
    local hs=''         # Start points (commit and filename pairs) for
                        # file f 's origins history.
    local h0s=''        # All not yet tracked down start points.

    IFS="$newline"
    set_track_file "${c0}_$(echo "$f" | sed -e's:/:_:g')"
    h0s="$c0 $f"
    while [ -n "$h0s" ]; do
        set -- $h0s
        c0="${1%% *}"
        f="${1#* }"
        shift
        h0s="$*"
        hs="$(hist_step "$c0" "$f")" # If call in 'set --' errexit won't work.
        set -- $hs $h0s
        # 'uniq' is enough, because duplicates, if any, go sequentially (see last
        # comment in hist_step()).
        h0s="$(echo "$*" | uniq)"
    done
    IFS="$OIFS"
}

rename_hist "$@"
echo "$track_file"

exit 0

}}}

Appendix B. Script 'examples/jp_prepare.sh'.
{{{
#!/bin/sh

# Prepare 'jp' repository for git filter-branch rewriting: clone original
# repository, recreate local branches tracking corresponding remotes (ones,
# which i want to preserve) and remove remote.

set -euf

newline='
'
OIFS="$IFS"

keep_branches='rewriteSplitBy
show_words
show_words_build_for_HP2010.2
show_words_index_by_Writer
show_words_readme'
orig_repo_path='/home/sgf/Documents/jp'
repo_path='/home/sgf/tmp/t'
repo='jp'

rm -rvf "$repo_path/$repo"
cd "$repo_path"
git clone "$orig_repo_path" "$repo"
cd "$repo"
IFS="$newline"
for b in $keep_branches; do
    git branch -t "$b" origin/"$b" || true
done
IFS="$OIFS"
git remote rm origin
git gc --aggressive --prune=now

}}}

Appendix C. Script 'examples/jp_gen_track.sh'.
{{{
#!/bin/sh

# Generate rename history (using track_renames.sh) for 'jp' repository. Script
# must be run in the repository's top directory ('jp' in this case). Script
# 'track_renames.sh' expected to be found one level higher in directory tree
# (i.e.  at '../track_renames.sh').
# Arguments:
# 1 - ref to rewrite.  You may specify either one particular ref or empty ref
# for rewriting all refs under refs/heads.
# 2 - mode ('subdir' or 'files').
# >=3 - for 'subdir' mode subdirectories, which contain files to track, or
# files itself for 'files' mode.
# Result:
#   track file one level higher in directory tree (in ../).

set -euf

readonly newline='
'
readonly ret_success=0
readonly ret_error=1
OIFS="$IFS"

refs="refs/heads${1:+/$1}"
subdir=''
keep_files=''
c0=''
f=''
tf=''
# Combine tracks for all requested files into single track file, (sort and)
# remove duplicates, and move it one directory upper at the end.
track_file='jp_track.txt'
# Defaults depending on the number of arguments:
# 0 or 1 - subdir mode.
# 2 - subdir mode with 'show_words' as subdir.
# >2 - depending on mode and other args.
IFS="$newline"
rm -rf "$track_file" && touch "$track_file"
echo "Generate for ref(s):$newline$(git for-each-ref "$refs")"
if [ $# -lt 2 -o "${2:-}" = 'subdir' ]; then
    echo "Subdir mode."
    case $# in
        0 | 1 ) subdir='' ;;
        2 ) subdir='show_words' ;;
        * ) shift 2; subdir="$*" ;;
    esac
    echo "Generating for subdir(s)${subdir:+:${newline}}${subdir:- all.}"
    for c0 in $(git for-each-ref --format='%(objectname) %(refname)' "$refs"); do
        echo "For ref '${c0#* }' .."
        c0="${c0%% *}"
        for f in $(git ls-tree -r --name-only "$c0" $subdir); do
            echo "    .. file $f"
            tf="$(../track-renames.sh "$c0" "$f")"
            cat "$track_file" "$tf" | sort -u >"${track_file}.tmp"
            mv -T "${track_file}.tmp" "$track_file"
        done
    done
elif [ $# -gt 2 -a "${2:-}" = 'files' ]; then
    # At least one filename required.
    echo "Files mode."
    shift 2
    keep_files="$*"
    echo "Generate for files:${newline}$keep_files"
    for c0 in $(git for-each-ref --format='%(objectname)' "$refs"); do
        echo "$c0"
        all_files="$(git ls-tree -r --name-only "$c0")"
        for f in $(echo "${all_files}${keep_files:+${newline}${keep_files}}" \
                    | sort \
                    | uniq -d);
        do
            echo "$f"
            tf="$(../track-renames.sh "$c0" "$f")"
            cat "$track_file" "$tf" | sort -u >"${track_file}.tmp"
            mv -T "${track_file}.tmp" "$track_file"
        done
    done
else
    echo "Incorrect mode '$2' or too few arguments."
fi
IFS="$OIFS"
mv -T "$track_file" ../"$track_file"
echo "Resulting track file: '../$track_file'"

}}}

Appendix D. Script 'examples/jp_rewrite.sh'.
{{{
#!/bin/sh

# Just call filter-branch with --tree-filter for 'jp' repository. Script must
# be run in the repository's top directory. Tree filter script and track file
# expected to be found one level higher in directory tree (at
# '../jp_tree_filter.sh' and '../jp_track.txt'). Will use tmp directory under
# '/tmp' for filter-branch. Make sure, that tmpfs is mounted there (otherwise,
# tree filter will run very slow).

set -euf
tmp_dir='/tmp/jp_rewrite'
tree_filter="$(cat ../jp_tree_filter.sh)"
rm -rf "$tmp_dir"
# Note, that when using tmp directory for filter-branch, real working tree
# remains unchanged, and, hence, after history rewrite you need to remove
# _all_ files, except .git folder, from working tree and reset --hard to some
# ref.
git filter-branch   --prune-empty -d "$tmp_dir" \
                    --tag-name-filter cat \
                    --tree-filter "$tree_filter" -- --all
}}}

Appendix E. Script 'examples/jp_tree_filter.sh'.
{{{

# Tree filter script for 'jp' reposirory, which
#   - if directory '$subdir' does not exist at the current commit, removes all
#   files, except listed in the track file.
#   - otherwise (if '$subdir' exist), removes all files, except ones from
#   '$subdir' or ones listed in the track file. Then move all files from
#   '$subdir' to the top repository dir. Any name conflict is fatal. So, be
#   sure to resolve them manually (see below).

newline='
'


subdir='show_words'     # Will keep all files from this subdir.
track_file='/home/sgf/tmp/t/jp_track.txt'       # Will use this track file.
# Keep all files from track.
keep_files="$(sed -ne"s/^$GIT_COMMIT //p" "$track_file")"
# Keep all files from subdir.
keep_subdir="$(git ls-tree -r --name-only "$GIT_COMMIT" "$subdir")"
all_files="$(git ls-tree -r --name-only "$GIT_COMMIT")"

# Remove trailing slashes ('/' will be reduced to empty).
subdir="$(echo "$subdir" | sed -ne'1s:/*$::p')"
# Add files from '$subdir' to list of kept files.
keep_files="$(
    echo "${keep_files:-$keep_subdir}${keep_subdir:+${newline}$keep_subdir}" \
        | sort -u
    )"
# Remove only files not listed in $keep_files.
echo "${all_files:-$keep_files}${keep_files:+${newline}$keep_files}" \
    | sort \
    | uniq -u \
    | xargs -r -d'\n' rm -v

if [ -d "$subdir" ]; then
    # Hardlink files from $subdir to top project dir. If some filename already
    # exists, it'll not be hardlinked and then following `find` fails. I.e. i use
    # hardlinks as flag to indicate whether file sucessfully copied or not.
    cp -PRln "$subdir" -T .
    # Resolve known name conflicts manually.
    if [ -f ".gitignore" -a -f "$subdir/.gitignore" ]; then
        cp -Plv --remove-destination "$subdir/.gitignore" -T '.gitignore'
    fi
    find -P "$subdir" -depth -links '+1' -delete -o -print
fi

}}}

Appendix F. Script 'examples/jp_finalize.sh'.
{{{
#!/bin/sh

# Finalize git filter-branch rewrite of 'jp' repository: reset working tree
# (remove _all_ files, except git repository, and hard reset HEAD), remove
# backup refs, expire all reflogs and remove unrefenced objects to reduce .git
# folder size.

set -euf

newline='
'
repo_path='/home/sgf/tmp/t/jp'

cd "$repo_path"
# Remove _all_ files, except git repository itself. This is needed, because if
# tree-filter uses temp directory, working tree still contains old data.
# Moreover, it most likely still contains track files, generated by
# 'jp_gen_track.sh'.
find -mindepth 1 -maxdepth 1 -name '.git' -prune -o -exec rm -rf {} \;
git reset --hard
git for-each-ref --format='%(refname)' 'refs/original' \
    | xargs -r -d'\n' -n1 git update-ref -d
git reflog expire --expire=now --all
git gc --aggressive --prune=now

}}}

Appendix G. Script 'examples/jp_check.sh'.
{{{
#!/bin/sh

# Generate diffs between "what new history should looks like" and real new
# history. If rewrite went well, these diffs (for each kept branch) should be
# either very small or no diff at all.

set -euf

newline='
'
OIFS="$IFS"

orig_repo='/home/sgf/Documents/jp'
new_repo='/home/sgf/tmp/t/jp'
log_dir='/home/sgf/tmp/t/jp'

# Branches, which i have kept during rewrite.
keep_branches='rewriteSplitBy
show_words
show_words_build_for_HP2010.2
show_words_index_by_Writer
show_words_readme'
# Subdir, which files were moved to the top repository dir.
subdir='show_words'
# FIFO to use, when comparing diffs.
cmp_diff_fifo='cmp_diff.fifo'
new_branches=''
b=''
orig_log=''
new_log=''
diff_log=''
prev_diff_log=''
diffs_are_different=''  # Flag used by cmp_diff(). Empty, when diffs are
                        # identical.

# Remove trailing slashes ('/' will be reduced to empty).
subdir="$(echo "$subdir" | sed -ne'1s:/*$::p')"

gen_log()
{
    # Generate log suitable for gen_diff() for specified branch.
    # Arguments:
    # 1 - branch name.
    git log --pretty='format:%s' --numstat "$1" --
}

gen_diff()
{
    # Remove leading 'subdir' (if present) from all filenames in 1st file,
    # containing `git log --numstat` output. Then generate diff.
    # Arguments:
    # 1 - 1st filename (original history generated by gen_log() expected).
    # 2 - 2nd filename (new history generated by gen_log() expected).
    sed "$1" -e "s@^\(\([[:digit:]]\+\t\)\{2\}\)$subdir/@\1@" \
        | ( diff -u - "$2" || true )
}

reduce_diff()
{
    # Remove some parts from diff to make it suitable for comparison by
    # cmp_diffs().
    # Arguments:
    # 1 - filename (unified diff expected).
    tail -n'+3' "$1" | sed -ne'/^@@ /!p'
}

cmp_diffs()
{
    # Compare two diffs. I will use global variables from caller's
    # environment.
    if [ -n "$prev_diff_log" -a -z "$diffs_are_different" ]; then
        if [ ! -p "$cmp_diff_fifo" ]; then
            echo "cmp_diff(): FIFO does not exist" 1>&2
            exit 1
        fi
        reduce_diff "$prev_diff_log" > "$cmp_diff_fifo" &
        if ! reduce_diff "$diff_log" | diff -q "$cmp_diff_fifo" - ; then
            echo "cmp_diff(): Files '$prev_diff_log' and '$diff_log' differs."
            diffs_are_different=1
        fi
    fi
}

# Check, that rewritten repository has exactly all kept branches (no more, no
# less).
if [ -z "$keep_branches" ]; then
    echo "No branches were kept." 1>&2
    exit 1
fi
cd "$new_repo"
new_branches="$(git for-each-ref --format='%(refname)' | sed -e's:.*/::')"
b="$(echo "$keep_branches${new_branches:+${newline}$new_branches}" \
        | sort \
        | uniq -u
    )"
if [ -n "$b" ]; then
    echo "Following branches missed either from kept branches list of from rewritten repository:" 1>&2
    echo "$b" 1>&2
    exit 1
fi
echo "Branches matched."

# Generate history for each branch in original and new (rewritten) repository.
# Then compare transformed original history (with 'subdir' removed) with real
# new history.  Since this transformation is exactly, what i expect
# tree-filter have done, the diff should be either very small or no diff at
# all.
IFS="$newline"
mkfifo "$cmp_diff_fifo"
for b in $keep_branches; do
    echo "Checking $b.."
    orig_log="$log_dir/orig_$b.log"
    new_log="$log_dir/new_$b.log"
    diff_log="$log_dir/$b.diff"
    cd "$orig_repo"
    gen_log "$b" > "$orig_log"
    cd "$new_repo"
    gen_log "$b" > "$new_log"
    gen_diff "$orig_log" "$new_log" > "$diff_log"
    cmp_diffs
    prev_diff_log="$diff_log"
done
rm -vf "$cmp_diff_fifo"
diffs_are_different="${diffs_are_different:+Some diffs are different.}"
diffs_are_different="${diffs_are_different:-All diffs are identical.}"
echo "Diffs generated. $diffs_are_different"

exit 0

}}}

Appendix H. Script 'examples/ex_index_filter/sh'.
{{{
# Example of index filter, which removes all files, except listed in the track
# file.

newline='
'
track_file='/home/sgf/tmp/t/jp_track.txt'
keep_files="$(sed -ne"s/^$GIT_COMMIT //p" "$track_file")"
all_files="$(git ls-files -c)"
# Remove only files not listed in $keep_files.
echo "${all_files:-$keep_files}${keep_files:+${newline}${keep_files}}" \
    | sort \
    | uniq -u \
    | xargs -r -d'\n' git rm --cached
}}}

Update1. Add 'examples/ex_index_filter.sh'. Several small fixes.
Update2. Add note about removing files from subdirectory.

Комментариев нет:

Отправить комментарий