Studying Programming through Making
Amy J. Ko, Ph.D.
University of Washington
I started as an undergrad in CS and Psychology (1998-2002), and stumbled into research with Margaret Burnett, who was studying spreadsheet testing.
I did my Ph.D. at Carnegie Mellon University, with HCI researcher Brad Myers (2002-2008). Later, I started publishing in Software Engineering with the help of other mentors at CMU, such as Jonathan Aldrich and Gail Murphy.
I then joined UW’s Information School, mostly publishing tools and studies in HCI, SE, and Computing Education, always about programming.
These days, I think of myself as a Computing Education researcher, who applies PL, SE, and HCI methods and ideas to the learning and teaching of computing.
About this talk
How can we learn things about programming through the things we make?
If you have a PL background, you’re probably used to proofs, and perhaps some benchmark studies.
But there are other ways of knowing.
Observing what people do
Asking people about their practices
Theorizing about behavior, cognition, social contexts
Most of academia, including HCI, uses these ways to make progress. PL should too.
“User studies” are about more than checking a box for publication; they’re a way of deciding what to make and understanding the significance of what you’ve made.
One study is not enough to understand something; if it takes dozens to believe a drug or vaccine is safe, why would one be enough in CS?
Study design situated and highly complex. Take it slowly.
HCI is more than user studies—it’s a massive research area with hundreds of subareas and tens of thousands of researchers, inventing the future of interactive computing and understanding its impacts. Studying programming is a sliver of it.
About the rest of this talk
I’ll teach these ideas by telling the story of my dissertation (2002-2008) in five parts:
Part I: “What is debugging???”
Part II: “Aha”
Part III: “How do I know if it works?”
Part IV: “How would this really work…”
Part V: “Does this really work?”
We’ll then return to the learning objectives to reflect, and I’ll share some resources to learn more.
“What is debugging???”
My interests when I started grad school
Broadly, I was vaguely interested in “making programming easier.”
I didn’t really know what that meant. All I knew is that I wanted to build things that helped programmers productively express their ideas.
My advisor suggested I go watch some people program and see if I could find opportunities to make something.
(Strangely, he didn’t suggest I read, but I did anyway—hundreds of papers about the psychology of programming, leveraging a literature review his former Ph.D. student had published in his dissertation. That was equally important.)
My first semester
I decided to watch artists, sound engineers, and developers use Alice—a programming language and IDE for building interactive 3D virtual worlds—try to implement a behavior.
They would write a line of code, test, it wouldn’t work, they’d get frustrated, confused, and lost. No matter how careful they were, their programs were molded through iteration.
A student using Alice to build a Monster’s Inc. interactive game with his team, which included a sound engineering student and an industrial design student.
After about 30 hours of watching people of all kinds try to write Alice programmers, a few things became very clear:
Debugging is not peripheral to programming, it dominates programming
Debugging, at least how my participants did it, was driven by guess work (“maybe this is the problem”)
People’s guesses were usually wrong; it took them ages to find out, and then think of another guess to check.
The only judgements that were sound were whether the output was wrong, and sometimes even those were wrong.
“Wait, why did Sully move…”
“Maybe it was the do together I just wrote?”
“Let me try undoing that…”
[5 minutes later] “No, it still happens”
“Maybe it was another event? Let’s disable…”
[5 minutes later] “No, it still happens…”
Amy Ko (2003). A Contextual Inquiry of Expert Programmers in an Event-Based Programming Environment ACM Conference on Human Factors in Computing Systems (CHI), 1036-1037.
Testing my hypothesis
I decided to watch more closely. I had several students come in and try to build something with Alice, and recorded them.
After analyzing 12 hours of video, the sequence was clear:
Developer creates a defect
Long after they were created, they would notice a failure
Recency bias shaped their guesses, but defects were rarely recent
Eventually, after laborious guesswork, and correction of their mental model of what they’d built, they would discover the defect.
A participant creates a defect without realizing it, notices a failure 30 minutes later, spends the rest of the 90 minute session trying to localize it.
Amy J. Ko, Brad A. Myers (2003)
Development and Evaluation of a Model of Programming Errors.
IEEE Symposia on Human-Centric Computing Languages and Environments (VL/HCC), 7-14.
Note: I hadn’t thought about tools at all yet. In HCI, understanding problems requires stepping back from making, thinking critically about contexts, tasks.
What did all this mean for tools?
Most debugging tools assume that a developer knows what code might be faulty. But I’d observed that this assumption was often wrong.
Debugging tools that facilitate stepping (breakpoints, time travel) are only useful to the effect that a developer has a good hypothesis. They often don’t.
The only thing that developers did know was that some output is wrong. Any tool that assumes a developer knows more than that about a defect would be garbage in, garbage out, only amplifying a bad premise.
What if a debugging tool could start from unwanted output—the only reliable information a developer has—and automatically identify the things that caused it, presenting them that to a developer to inspect?
But how would developers identify the output?
And how would the system identify the causes?
And how would the system present the causes to help a developer carefully analyze them?
My mind spun with ideas about how a developer and a debugging tool might dialog with each other:
“Pointing” to output to indicate it
Asking questions about it: “Why are you here”, “Why didn’t you appear, output?” “Why aren’t you 5 instead of 2?”
The system telling a developer a story, “First, this happened, then this happened, and then this; do any of those look wrong?”
Early prototypes of Whyline “answers”, with wild flailing about filters, temporal masks, and other overly complicated ideas
Looking for ways to make these possible, I remembered a 1994 paper I’d read called Program Slicing, by Mark Weiser. Mark had made similar observations as me, but come to a different “Aha!”
As conceived, it was a batch process, in which one would specify a variable and get a set of lines of code that influence that variable, statically or dynamically.
The handful of studies on it showed that slices were too big, too hard to comprehend. But what if this was just a bad interface for reasoning about slices?
The following design emerged:
Identify lines of code that generate as output.
Create a menu of those lines of code, organized by objects in the Alice world, presenting them as questions.
Allow developers to choose an output statement.
Incrementally compute a dynamic slice on the code using an execution trace
Allow developer to interactively select which parts of the incremental slice seem faulty, using their knowledge of correct program behavior to search through the slice.
My last high fidelity prototype.
Lots of work to do:
I replaced the runtime to generate an execution trace sufficient for slicing
I changed the IDE to expose output statements that had executed in the trace
I created a visualization of the slice to convey causality
I showed it to my lab mates, my advisor, and their reaction was—”Why hasn’t debugging always worked this way?”
The Whyline for Alice, showing a question about why a particular output statement did not execute, and an answer computed as a backwards dynamic slice on the execution of that line and it’s values at a given time.
Notice that I hadn’t thought about evaluation at all yet. Making requires stepping back from evaluating.
“How do I know if it works?”
After building my prototype, I got plenty of informal feedback:
My advisor thought it was “cool”
My lab mates wished they had it for their preferred language and IDE
The Alice team wanted to merge my branch of Alice
All good signs. But would it actually help with debugging? How could I know?
Study idea #1: benchmark comparison
Many prior debugging tools had used this approach. I could
Organize all of the real defects I’d seen developers create in Alice.
Compare the amount of “work” to identify the defect with the Whyline and other methods.
Show that the Whyline takes less “work”
I found this unsatisfying—how could I possibly emulate the work involved in debugging, when I’d demonstrated in prior studies that it was so completely dependent on a developers’ prior knowledge, the sequence of their actions, and other context?
Study idea #2: task-based evaluation
Select a defective program
Run a controlled experiment to compare the time it took for developers’ to localize the defect with and without the Whyline
Show that the Whyline caused significantly faster debugging
This would be fine, and but it would be highly dependent on which task I selected. Moreover, my prior work had shown that debugging time was highly variable, so there was a risk this variation would mask the effects of the tool.
Study idea #3: spec-based evaluation
Give developers a specification with six mostly orthogonal requirements
Let them introduce and debug defects organically, hoping they would introduce similar defec