Bisecting Pharo versions to find regressions

From time to time it happens that a bug is accidentally introduced and we realize it several versions later. If the cause of the bug is not clear, one good strategy is to find the piece of code change that introduced the bug, and engineer a test and fix from that change. If we have the entire history of changes of the project, we can then extract this information from the commits.

In this post, we will show how we can apply a bisection of Pharo builds easily using the Pharo Launcher to find the cause of a bug. In particular, I wanted to show the case of a real bug, from which you’ll find the issue here https://github.com/pharo-project/pharo/issues/6012: the code completion menu was not being closed when clicking outside of it.

There is git bisect…

Git provides a pretty useful command called git bisect that helps you at finding the culprit commit. Git bisect implements a binary search on commits: it proposes you commits that you have to test and mark as good or bad. Based on how you tag a commit it will look for another commit and eventually find you the exact commit that introduced the problem.

Git bisect can be pretty handy at finding bugs, but it can be pretty heavy when on each step you need to do a long build process just to test it. This is exactly our case when bisecting the pharo history: we need to build an image.

We are not going to go into much details with git bisect, but if you want to see some more docs on it, you can take a look at the official docs in here: https://git-scm.com/docs/git-bisect.

Image bisection with the Pharo launcher

The Pharo launcher has a super fancy feature that can be used for bisection: it allows downloading any previous build of Pharo that is stored in the Pharo file server. This saves us from building an image for each version we are digging in! It is important to know at this point that the Pharo file server stores all succeeding builds of Pharo, which are almost all of them, and that there is a build per PR. So this will save us some time at attacking the issue but it will be a bit less precise because a PR can contain many commits. However, once the PR is identified, in general the commits in it will be all related.

In the Pharo9.0 template category we have listed all its builds with number and associated commit

Once we know this we can do the bisect ourselves. For example, if we want to test the entire set of 748 builds, we will first test 748 / 2 = build #374. If it is broken, it means that the problem was introduced in between builds #1 and #374, and we need to continue testing with 374 / 2 = build #187. Otherwise the bug was introduced between build #374 and build #748 and we should test with 748 + 374 / 2 = build #561. We can continue like that until we find the build X where X is working and X+1 is broken.

The advantage of doing it as a binary search comes from the fact that we cut the space search by 2 every time. This makes the search a log2 operation. In practical terms: if we have 1000 commits, we will have to do log2 1000 = ~10 searches to find the culprit. Which is far better than linearly searching the 1000 commits one by one :).

Finding the problematic PR

The issue we were interested in did not exist on build #162 and it stopped working in build #163. The next step is to understand what was introduced in build #163. Once we have the breaking build, we need to obtain the PR that lead to the change. We can obtain the PR commit from the image file name given by the launcher, or we can get it from the about and help dialogs in Pharo.

The about and help dialogs have precise information of how the image was built.

Once we have the commit, the next step is to look for it in git in your favorite tool: the command line, iceberg, any external GUI based git tool, or even github. Since there are no integrations in Pharo that are not pull requests, each build commit will effectively be a PR merge commit. In our real bug example, the problematic commit was this one, which was the integration of a PR I did myself ( 🙂 ).

The problematic commit is always a PR integration in Pharo

Now that we have the problem, we can engineer a fix for it.

Analyzing the bug

So again, this was working on build #162. It stopped working with my changes in #163. To understand the issue my strategy was the following: compare how the execution flows in both the working and non working builds.

My first step was to understand how the correct case was working in build #162. I added a breakpoint in the closing of the code completion menu, tried to auto-complete something and clicked outside. The stack trace looked as follows:

SmallInteger(Object)>>halt
CompletionEngine>>closeMenu
...
RubEditingArea(Morph)>>announceKeyboardFocusChange:
RubEditingArea(Morph)>>keyboardFocusChange:
RubEditingArea(RubAbstractTextArea)>>keyboardFocusChange:
HandMorph>>newKeyboardFocus:
RubTextScrollPane(Morph)>>takeKeyboardFocus
NECMenuMorph>>mouseDown:
NECMenuMorph(Morph)>>handleMouseDown:
MouseButtonEvent>>sentTo:
NECMenuMorph(Morph)>>handleEvent:
NECMenuMorph(Morph)>>handleFocusEvent:
...

We can see that the closing of the code completion menu happens when it loses the keyboard focus. Looking at the variables in the stack, the focus before the click was:

a RubEditingArea(583942144)

And the click is requesting focus on

a RubTextScrollPane(76486144)

But going a bit up in the stack, the code that produces the change in the focus is

NECMenuMorph >> mouseDown: anEvent
        ...
        self flag: #pharoFixMe "ugly hack".
	engine editor morph owner owner   <--------
		takeKeyboardFocus;
		handleMouseDown: evt.

A more insidious question: why does the NECMenuMorph receives the click? If I clicked outside of it!!! That is because the menu morph requests “mouse focus” when it is shown

NECMenuMorph >> show
	self resize.
	self activeHand 
		newMouseFocus: self.
	self changed.

Comparing to the parent commit

A similar analysis in build 163 shows:

  1. NECMenuMorph does not receive the mouseDown event
  2. The NECMenuMorph never becomes mouseFocus (I added a traceCr on mouse focus change, see below)
mouseFocus: aMorphOrNil
  aMorphOrNil traceCr.
  ...

3. The code of show was changed (by myself) to not change the focus if no hand is available.

show
	self resize.
	self activeHand ifNotNil: [ :hand | hand newMouseFocus: self ].
	self changed.

The problem was that the NECMenuMorph was trying to access the hand before being installed in the world! And the current hand depends on the world where we are installed. Otherwise, we need global state, which I was trying to minimize :)…

A solution

The solution I implemented was to call show once we are sure the morph is in the world.

openInWorld

	super openInWorld.
	self show

And avoid calling show before it:

narrowCompletion

	self selected: 0.
	firstVisible := 1.
	context narrowWith: context completionToken.
	(context entries size = 1 and: [ context entries first contents = context completionToken ]) ifTrue: [
		self delete.
		^ false ].
	context hasEntries ifTrue: [ self selected: 1 ].
	^ true

This would mean that we can inline show in the openInWorld method and then remove the conditional. We can also argue that show is not morphic vocabulary…

openInWorld
	super openInWorld.
        self resize.
	self activeHand newMouseFocus: self.
	self changed.

Conclusion

In this post we have seen how we can chase a regression in Pharo by bisecting Pharo builds. Once the culprit build is identified, we can navigate from there to the pull request and thus the commits that cause the problem.

We have also shown how to compare two executions to understand the differences in behaviour, and finally how this was fixed in a real case.

Published by Guille Polito

Pharo dev. Researcher, engineer and father. > If it ain't tested, it does not exist.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: