From time to time it happens that a bug is accidentally introduced and we realize it several versions later. If the cause of the bug is not clear, one good strategy is to find the piece of code change that introduced the bug, and engineer a test and fix from that change. If we have the entire history of changes of the project, we can then extract this information from the commits.
In this post, we will show how we can apply a bisection of Pharo builds easily using the Pharo Launcher to find the cause of a bug. In particular, I wanted to show the case of a real bug, from which you’ll find the issue here https://github.com/pharo-project/pharo/issues/6012: the code completion menu was not being closed when clicking outside of it.
There is git bisect…
Git provides a pretty useful command called
git bisect that helps you at finding the culprit commit. Git bisect implements a binary search on commits: it proposes you commits that you have to test and mark as good or bad. Based on how you tag a commit it will look for another commit and eventually find you the exact commit that introduced the problem.
Git bisect can be pretty handy at finding bugs, but it can be pretty heavy when on each step you need to do a long build process just to test it. This is exactly our case when bisecting the pharo history: we need to build an image.
We are not going to go into much details with git bisect, but if you want to see some more docs on it, you can take a look at the official docs in here: https://git-scm.com/docs/git-bisect.
Image bisection with the Pharo launcher
The Pharo launcher has a super fancy feature that can be used for bisection: it allows downloading any previous build of Pharo that is stored in the Pharo file server. This saves us from building an image for each version we are digging in! It is important to know at this point that the Pharo file server stores all succeeding builds of Pharo, which are almost all of them, and that there is a build per PR. So this will save us some time at attacking the issue but it will be a bit less precise because a PR can contain many commits. However, once the PR is identified, in general the commits in it will be all related.
Once we know this we can do the bisect ourselves. For example, if we want to test the entire set of 748 builds, we will first test 748 / 2 = build #374. If it is broken, it means that the problem was introduced in between builds #1 and #374, and we need to continue testing with 374 / 2 = build #187. Otherwise the bug was introduced between build #374 and build #748 and we should test with 748 + 374 / 2 = build #561. We can continue like that until we find the build X where X is working and X+1 is broken.
The advantage of doing it as a binary search comes from the fact that we cut the space search by 2 every time. This makes the search a log2 operation. In practical terms: if we have 1000 commits, we will have to do log2 1000 = ~10 searches to find the culprit. Which is far better than linearly searching the 1000 commits one by one :).
Finding the problematic PR
The issue we were interested in did not exist on build #162 and it stopped working in build #163. The next step is to understand what was introduced in build #163. Once we have the breaking build, we need to obtain the PR that lead to the change. We can obtain the PR commit from the image file name given by the launcher, or we can get it from the about and help dialogs in Pharo.
Once we have the commit, the next step is to look for it in git in your favorite tool: the command line, iceberg, any external GUI based git tool, or even github. Since there are no integrations in Pharo that are not pull requests, each build commit will effectively be a PR merge commit. In our real bug example, the problematic commit was this one, which was the integration of a PR I did myself ( 🙂 ).
Now that we have the problem, we can engineer a fix for it.
Analyzing the bug
So again, this was working on build #162. It stopped working with my changes in #163. To understand the issue my strategy was the following: compare how the execution flows in both the working and non working builds.
My first step was to understand how the correct case was working in build #162. I added a breakpoint in the closing of the code completion menu, tried to auto-complete something and clicked outside. The stack trace looked as follows:
We can see that the closing of the code completion menu happens when it loses the keyboard focus. Looking at the variables in the stack, the focus before the click was:
And the click is requesting focus on
But going a bit up in the stack, the code that produces the change in the focus is
NECMenuMorph >> mouseDown: anEvent ... self flag: #pharoFixMe "ugly hack". engine editor morph owner owner <-------- takeKeyboardFocus; handleMouseDown: evt.
A more insidious question: why does the NECMenuMorph receives the click? If I clicked outside of it!!! That is because the menu morph requests “mouse focus” when it is shown
NECMenuMorph >> show self resize. self activeHand newMouseFocus: self. self changed.
Comparing to the parent commit
A similar analysis in build 163 shows:
- NECMenuMorph does not receive the mouseDown event
- The NECMenuMorph never becomes mouseFocus (I added a traceCr on mouse focus change, see below)
mouseFocus: aMorphOrNil aMorphOrNil traceCr. ...
3. The code of
show was changed (by myself) to not change the focus if no hand is available.
show self resize. self activeHand ifNotNil: [ :hand | hand newMouseFocus: self ]. self changed.
The problem was that the
NECMenuMorph was trying to access the hand before being installed in the world! And the current hand depends on the world where we are installed. Otherwise, we need global state, which I was trying to minimize :)…
The solution I implemented was to call
show once we are sure the morph is in the world.
openInWorld super openInWorld. self show
And avoid calling show before it:
narrowCompletion self selected: 0. firstVisible := 1. context narrowWith: context completionToken. (context entries size = 1 and: [ context entries first contents = context completionToken ]) ifTrue: [ self delete. ^ false ]. context hasEntries ifTrue: [ self selected: 1 ]. ^ true
This would mean that we can inline show in the
openInWorld method and then remove the conditional. We can also argue that
show is not morphic vocabulary…
openInWorld super openInWorld. self resize. self activeHand newMouseFocus: self. self changed.
In this post we have seen how we can chase a regression in Pharo by bisecting Pharo builds. Once the culprit build is identified, we can navigate from there to the pull request and thus the commits that cause the problem.
We have also shown how to compare two executions to understand the differences in behaviour, and finally how this was fixed in a real case.