Live Debugging

Tue, March 13, 2012, 07:33 PM under Random | SoftwareProcess

Based on my classification of diagnostics, you should know what live debugging is NOT about - at least according to me :-) and in this post I'll share how I think of live debugging.

These are the (outer) steps to live debugging

  1. Get the debugger in the picture.
  2. Control program execution.
  3. Inspect state.
  4. Iterate between 2 and 3 as necessary.
  5. Stop debugging (and potentially start new iteration going back to step 1).

Step 1 has two options: start with the debugger attached, or execute your binary separately and attach the debugger later. You might say there is a 3rd option, where the app notifies you that there is an issue, referred to as JIT debugging. However, that is just a variation of the attach because that is when you start the debugging session: when you attach. I'll be covering in future posts how this step works in Visual Studio.

Step 2 is about pausing (or breaking) your app so that it makes no progress and remains "frozen". A sub-variation is to pause only parts of its execution, or in other words to freeze individual threads. I'll be covering in future posts the various ways you can perform this step in Visual Studio.

Step 3, is about seeing what the state of your program is when you have paused it. Typically it involves comparing the state you are finding, with a mental picture of what you thought the state would be. Or simply checking invariants about the intended state of the app, with the actual state of the app. I'll be covering in future posts the various ways you can perform this step in Visual Studio.

Step 4 is necessary if you need to inspect more state - rinse and repeat. Self-explanatory, and will be covered as part of steps 2 & 3.

Step 5 is the most straightforward, with 3 options: Detach the debugger; terminate your binary though the normal way that it terminates (e.g. close the main window); and, terminate the debugging session through your debugger with a result that it terminates the execution of your program too. In a future post I'll cover the ways you can detach or terminate the debugger in Visual Studio.

I found an old picture I used to use to map the steps above on Visual Studio 2010. It is basically the Debug menu with colored rectangles around each menu mapping the menu to one of the first 3 steps (step 5 was merged with step 1 for that slide). Here it is in case it helps:

live debugging

Stay tuned for more...


The way I think about Diagnostic tools

Tue, March 13, 2012, 06:47 PM under Random | SoftwareProcess

Every software has issues, or as we like to call them "bugs". That is not a discussion point, just a mere fact. It follows that an important skill for developers is to be able to diagnose issues in their code. Of course we need to advance our tools and techniques so we can prevent bugs getting into the code (e.g. unit testing), but beyond designing great software, diagnosing bugs is an equally important skill.

To diagnose issues, the most important assets are good techniques, skill, experience, and maybe talent. What also helps is having good diagnostic tools and what helps further is knowing all the features that they offer and how to use them.

The following classification is how I like to think of diagnostics. Note that like with any attempt to bucketize anything, you run into overlapping areas and blurry lines. Nevertheless, I will continue sharing my generalizations ;-)

It is important to identify at the outset if you are dealing with a performance or a correctness issue.

  1. If you have a performance issue, use a profiler. I hear people saying "I am using the debugger to debug a performance issue", and that is fine, but do know that a dedicated profiler is the tool for that job. Just because you don't need them all the time and typically they cost more plus you are not as familiar with them as you are with the debugger, doesn't mean you shouldn't invest in one and instead try to exclusively use the wrong tool for the job. Visual Studio has a profiler and a concurrency visualizer (for profiling multi-threaded apps).
  2. If you have a correctness issue, then you have several options - that's next :-)

This is how I think of identifying a correctness issue

  1. Do you want a tool to find the issue for you at design time? The compiler is such a tool - it gives you an exact list of errors. Compilers now also offer warnings, which is their way of saying "this may be an error, but I am not smart enough to know for sure". There are also static analysis tools, which go a step further than the compiler in identifying issues in your code, sometimes with the aid of code annotations and other times just by pointing them at your raw source. An example is FxCop and much more in Visual Studio 11 Code Analysis.
  2. Do you want a tool to find the issue for you with code execution? Just like static tools, there are also dynamic analysis tools that instead of statically analyzing your code, they analyze what your code does dynamically at runtime. Whether you have to setup some unit tests to invoke your code at runtime, or have to manually run your app (and interact with it) under the tool, or have to use a script to execute your binary under the tool… that varies. The result is still a list of issues for you to address after the analysis is complete or a pause of the execution when the first issue is encountered. If a code path was not taken, no analysis for it will exist, obviously. An example is the GPU Race detection tool that I'll be talking about on the C++ AMP team blog. Another example is the MSR concurrency CHESS tool.
  3. Do you want you to find the issue at design time using a tool? Perform a code walkthrough on your own or with colleagues. There are code review tools that go beyond just diffing sources, and they help you with that aspect too. For example, there is a new one in Visual Studio 11 and searching with my favorite search engine yielded this article based on the Developer Preview.
  4. Do you want you to find the issue with code execution? Use a debugger - let’s break this down further next.

This is how I think of debugging:

  1. There is post mortem debugging. That means your code has executed and you did something in order to examine what happened during its execution. This can vary from manual printf and other tracing statements to trace events (e.g. ETW) to taking dumps. In all cases, you are left with some artifact that you examine after the fact (after code execution) to discern what took place hoping it will help you find the bug. Learn how to debug dump files in Visual Studio.
  2. There is live debugging. I will elaborate on this in a separate post, but this is where you inspect the state of your program during its execution, and try to find what the problem is. More from me in a separate post on live debugging.
  3. There is a hybrid of live plus post-mortem debugging. This is for example what tools like IntelliTrace offer.

If you are a tools vendor interested in the diagnostics space, it helps to understand where in the above classification your tool excels, where its primary strength is, so you can market it as such. Then it helps to see which of the other areas above your tool touches on, and how you can make it even better there. Finally, see what areas your tool doesn't help at all with, and evaluate whether it should or continue to stay clear. Even though the classification helps us think about this space, the reality is that the best tools are either extremely excellent in only one of this areas, or more often very good across a number of them. Another approach is to offer a toolset covering all areas, with appropriate integration and hand off points from one to the other.

Anyway, with that brain dump out of the way, in follow-up posts I will dive into live debugging, and specifically live debugging in Visual Studio - stay tuned if that interests you.


Filing bugs

Sat, November 13, 2010, 07:14 PM under SoftwareProcess

I've previously wrote a lengthy post on Bug Triage, and I just came across a related lengthy post by Eric that is definitely worth reading: how to file bugs.


Bug Triage

Sun, December 6, 2009, 07:56 PM under SoftwareProcess
In this blog post brain dump, I'll attempt to describe the process my team tries to follow when dealing with new bug reports (specifically, code defect reports). This is not official Microsoft policy, just the way we do things… if you do things differently and want to share, you can do so at the bottom in the comments (or on your blog).

Feature Triage Team
A subset of the feature crew, the triage team (which has representations from the PM, Dev and QA disciplines), looks at all unassigned bugs at regular intervals. This can be weekly or daily (or other frequency) dependent on which part of the product cycle we are in and what the untriaged bug load looks like. They discuss each bug considering the evidence and make a decision of whether the bug goes from Not Yet Assigned to Assigned (plus the name of the DEV to fix this) or whether it goes from Active to Resolved (which means it gets assigned back to the requestor for closure or further debate if they were not present at the triage meeting). Close to critical milestones, the feature triage team needs to further justify bugs they take to additional higher-level triage teams.

Bug Opened = Not Yet Assigned
Someone (typically an SDET from the QA team) creates the bug item (e.g. in TFS), ensuring they populate all the relevant fields including: Title, Description, Repro Steps (including the Actual Result at the end of the steps), attachments of code and/or screenshots, Build number that they observed the issue in, regression details if applicable, how it was found, if a test case exists or needs to be created etc. They also indicate their opinion on the Priority and Severity. The bug status is left as Not Yet Assigned.

"Issue" versus "Fix for issue"
The solution to some bugs is easy to determine, e.g. "bug: the column name is misspelled". Obviously the fix is to correct the spelling – still, the triage team should be explicit and enter the correct spelling in the bug's Description. Note that a bad bug name here would be "bug: fix the spelling of the column" (it describes the solution, rather than the problem).

Other solutions are trickier to establish, e.g. "bug: the column header is not accessible (can only be clicked on with the mouse, not reached via keyboard)". What is the correct solution here? The last thing to do is leave this undetermined and just assign it to a developer. The solution has to be entered in the description. Behind this type of a bug usually hides a spec defect or a new feature request.

The person opening the bug should focus on describing the issue, rather than the solution. The person indicates what the fix is in their opinion by stating the Expected Result (immediately after stating the Actual Result). If they have a complex suggested solution, that should be split out in a separate part, but the triage team has the final say before assigning it. If the solution is lengthy/complicated to describe, the bug can be assigned to the PM. Note: the strict interpretation suggests that any bug with no clear, obvious solution is always a hole in the spec and should always go to the PM. This also ensures the spec gets updated.

Not Yet Assigned -> Not Yet Assigned (on someone else's plate)
If the bug is observed in our feature, but the cause is actually another team, we change the Area Path (which is the way we identify teams in TFS) and leave it as Not Yet Assigned. The triage team may add more comments as appropriate including potentially changing the repro steps. In some cases, we may even resolve the bug in our area path and open a new bug in the area path of the other team.

Even though there is no action on a dev on the team, the bug still needs to be tracked. One way of doing this is to implement some notification system that informs the team when the tracked bug changed status; another way is to occasionally run a global query (against all area paths) for bugs that have been opened by a member of the team and follow up with the current owners for stale bugs.

Not Yet Assigned -> Resolved
This state transition can only be made by the Feature Triage Team.

0. Sometimes the bug description is not clear and in that case it gets Resolved as More Information Needed, so the original requestor can provide it.
After understanding what the bug item is about, the first decision is to determine whether it needs to go to a dev.

1. If it is a known bug, it gets resolved as "Duplicate" and linked to the existing bug.

2. If it is "By Design" it gets resolved as such, indicating that the triage team does not think this is a bug.

3. If the bug does not repro on latest bits, it is resolved as "No Repro"

4. The most painful: If it is decided that we cannot fix it for this release it gets resolved as "Postponed" or "Won't Fix". The former is typically due to resources and time constraints, while the latter is due to deciding that it is not important enough to consume our resources in any release (yes, not all bugs must be fixed!). For both cases, there are other factors that contribute to the decision such as: existence of a reasonable workaround, frequency we expect users to encounter the issue, dependencies on other team to offer a solution, whether it breaks a core scenario, whether it prohibits customer feedback on a major feature, is it a regression from a previous release, impact of the fix on other partner teams (e.g. User Education, User Experience, Localization/Globalization), whether this is the right fix, does the fix impact performance goals, and last but not least, severity of bug (e.g. loss of customer data, security threat, crash, hang). The bar for fixing a bug goes up as the release date approaches. The triage team becomes hardnosed about which bugs to take, while the developers are busy resolving assigned bugs thus everyone drives for Zero Bug Bounce (ZBB). ZBB is when you have 0 active bugs older than 48 hours.

Not Yet Assigned -> Assigned
If the bug is something we decide to fix in this release and the solution is known, then it is assigned to a DEV. This is either the developer that will do the work, or a Lead that can further assign it to one of his developer team based on a load balancing algorithm of their choosing.

Sometimes, the triage team needs the dev to do some investigation work before deciding whether to take the fix; similarly, the checkin for the fix may be gated on code review by the triage team. In these cases, these instructions are provided in the comments section of the bug and when the developer is done they notify the triage team for final decision.

Additionally, a Priority and Severity (from 0 to 4) has to be entered, e.g. a P0 means "drop anything you are doing and fix this now" whereas a P4 is something you get to after all P0,1,2,3 bugs are fixed.

From a testing perspective, if the bug was found through ad-hoc testing or an external team, the decision is made whether test cases should be added to avoid future regressions. This is communicated to the QA team.

Assigned -> Resolved
When the developer receives the bug (they should be checking daily for new bugs on their plate looking at bugs in order of priority and from older to newer) they can send it back to triage if the information is not clear. Otherwise, they investigate the bug, setting the Sub Status to "Investigating"; if they cannot make progress, they set the Sub Status to "Blocked" and discuss this with triage or whoever else can help them get unblocked. Once they are unblocked, they set the Sub Status to "Working on Solution"; once they are code complete they send a code review request, setting the Sub Status to "Fix Available". After the iterative code review process is over and everyone is happy with the fix, the developer checks it in and changes the state of the bug from Active (and Assigned to them) to Resolved (and Assigned to someone else).

The developer needs to ensure that when the status is changed to Resolved that it is assigned to a QA person. For example, maybe the PM opened the bug, but it should be a QA person that will verify the fix - the developer needs to manually change the assignee in that case. Typically the QA person will send an email to the original requestor notifying them that the fix is verified.

Resolved -> ??
In all cases above, note that the final state was Resolved. What happens after that? The final step should be Closed. The bug is closed once the QA person verifying the fix is happy with it. If the person is not happy, then they change the state from Resolved to Active, thus sending it back to the developer. If the developer and QA person cannot reach agreement, then triage can be brought into it. An easy way to do that is change the status back to Not Yet Assigned with appropriate comments so the triage team can re-review.

It is important to note that only QA can close a bug. That means that if the opener of the bug was a PM, when the bug gets resolved by the dev it may land on the PM's plate and after a quick review, the PM would re-assign to an SDET, which is the only role that can close bugs. One exception to this is if the person that filed the bug is external: in that case, we leave it Resolved and assigned to them and also send them a notification that they need to verify the fix. Another exception is if specialized developer knowledge is needed for verifying the bug fix (e.g. it was a refactoring suggestion bug typically not observable by the user) in which case it is fine to have a developer verify the fix, and ideally a different developer to the one that opened the bug.

Other links on bug triage
A quick search reveals that others have talked about this subject, e.g. here, here, here, here and here.

Your take?
If you have other best practices your team uses to deal with incoming bug reports, feel free to share in the comments below or on your blog.

Customer visits round 2

Tue, March 15, 2005, 04:36 PM under SoftwareProcess
I have returned from another round of customer visits (in the Midlands this time). Lots of positive feedback on IQView and great ideas contributing to our current plans for a new product (can't say much more on that, apart from that it is also on WinCE with CF). In terms of generic shareable takeaways, everything I said last time was re-enforced on this visit too. A couple more:

7. You can learn as much by listening to your client talk about competitive products as you can from when they talk about yours; encourage it!

8. If you are already the UK's market leader, maybe you should try to obtain feedback from other countries [OK, so this is more of a hint to my manager to send me to our customers abroad :-]

Now back to catching up with my spam... later...

The Importance of Customer Interaction

Thu, March 10, 2005, 12:24 PM under SoftwareProcess
Over the last two days we spent time visiting client sites around the country and obtaining feedback on our current products & future thoughts. As much as I was not ecstatic to be offline for two days, I can't stress how important these kinds of events are to a developer. If you have the opportunity to go meet the end users of your magnificent creations, then go do it now. It is invaluable!

I could bore you stiff with the specifics discussed, but I won't. Instead, here are some general thoughts which I think apply to most products.

1. How do you please both the savvy and the newbie? The one that explicitly asks for XML export and the one that says this is the first touch screen device they ever came across!

2. It is hard to draw a line of where the features of one product end and another's begin. What's the point of offering features that drive sales up on this but cut sales from that - dropping the overall profit!

3. Should we abandon PCs and do everything in embedded devices (WinCE, not XPe) or should we dumb them down and have them slaves to PC applications? Everybody can be persuaded one way or the other; the devil's advocate always wins!

4. When you replace an older product, you better make sure *all* features of the original are carried over. There is no feature that "nobody ever used that, it was just there"!

5. Do users understand that this response is bad for my job security: "I love it; it does everything I want; don't change it".

6. Are shortcuts (i.e. having more than one way to do a task) good or bad?

While I ponder on the above (along with 4 A4 pages of notes), please bear with me for any unanswered emails; I will slowly catch up with 5 email accounts and hundreds of blogs, not to mention the newsgroups... hopefully before the next round of customer visits next week.