I’ve picked up this book to continue my reading into our ability to reason and understand the reality around us. I’ve described this journey in my previous posts, e.g. https://metrics.blogg.gu.se/?p=438
This book goes beyond the trust and explores how we can be wrong and how, sometimes, we should actually be wrong. It describes different models of being wrong and also whether it is good to be right or not.
What I find interesting is the systematic overview of the sources of errors and how we deal with it.
I recommend to take a look at the book as a bedtime reading to learn about our view on being wrong. It describes interesting situations and analyzes them from different perspectives.
Working with continuous integration is an exciting new filed. You get your code into the main branch directly. Well, that’s what the theory says. What you really get is feedback directly, at least the feedback from the automated checks for technical debt, testing and similar.
What you do not get quickly is the review of your code by your colleagues. In larger organizations, things like code reviews do not get prioritized. Therefore they tend to slow down software development rather than speed up!
In this paper, we set of to understand how to fix that. We used Gerrit as the tool to extract lines of code to review, instead of reviewing all of the lines. Here is a short video about this: https://play.gu.se/media/t/0_h7hx95d2
The abstract of the paper is included:
Code reviews are one of the first quality assurance tasks in continuous software integration and delivery. The goal of our work is to reduce the need for manual reviews by automatically identify which code fragments should be further reviewed manually. We conducted an action research study with two companies where we extracted code reviews and build machine learning classifiers (AdaBoost and Convolutional Neural Network — CNN). Our results show that the accuracy of recognizing code fragments that require manual review, measured with Matthews Correlation Coefficient, was 0.70 in the combination of our own feature extraction and CNN. We conclude that this way of combining automation with manual code reviews can improve the speed of reviews while providing organizations with the possibility to support knowledge transfer among the designers.
In the process of reviewing code, we can identify refactoring pretty easy. We read the code, try to understand it and provide comments. In the understanding phase we also get ideas about possible alternative – why is this done this way?
Now, when writing the comments, we rarely have the time to refactor the code. In CI, this process of reviewing comes when we commit the code to the main branch and therefore we expect this to be delivered and used soon. So, it’s too late to refactor, we need to do it in the next iteration.
But the next iteration is the same, we need to deliver new functions, not “golden plate” the existing code, deliver it to the main branch, etc. When is the time for refactoring then? How do we document the possibilities and use them when we have a bit of time?
In this work, the authors look at the commit messages and identify refactoring possibilities for that, complementing the static and dynamic analysis of code. The method presented in the paper is based on the analysis of code from open source projects, the refactoring applied to the code and the analysis of the QMOOD quality attributes that were related to these commits.
The following quote from the paper explains a bit how the gist of the extraction of the refactored code works:
Identifying refactoring rationale has two parts. The first part is the detection of the files that are refactored by developers in a commit. The second part is the identification of changes in the QMOOD quality attributes then comparing these changes with the information in the commit message.
For the first part, we used the GitHub API to identify the changed files in each commit. In the second part, we compared the QMOOD quality attribute values before and after the commit to capture the actual quality changes for each file. Once the changed files and quality attributes were identified, we checked if the developers intended to actually improve these files and quality attributes. In fact, we preprocessed the commit messages and we used the names of code elements in the changed files and the changed quality metrics as keywords to match with words in the commit message. Once the refactoring rationale is automatically detected using this procedure, we continue with the next step to find better refactoring recommendations that can fully meet the developer’s intentions and expectations. In case that no quality changes were identified at all then a warning will be generated to developers that the manually applied refactorings are not addressing the quality issues described in his commit message.
As part of my weekend reading I took on a book about cultures. Not a typical reading of mine, but I though I would give it a try. This book is about differences in cultures from the perspective of rules, laws, principles. It is a common knowledge that some societies are very relaxed (e.g. Scandinavia) whereas some are very strict (e.g. Singapore). Commonly, we also know that the loose societies are more creative, whereas the more disciplined ones are very, well, disciplined.
This book also makes a point that it’s not as simple as that. This is not a linear relationship and there is some golden middle. Just being a loose society does not guarantee innovativeness and just being strict does not guarantee discipline. People need a “lagom” (here is a good Swedish word, meaning just right, in the middle, just enough, not too much and not too little) set of rules and looseness in the society.
When I read the book I thought about well-functioning organisations. In the companies and teams that I visit (or used to, before the pandemic), I often saw teams that were working together well with some degree of looseness, but not completely without the rules. They tend to perform well and function well when all team members understand the goals and rules of the game. I’ve also seen teams that are not able to function at all. They do not respect each other, have no respect for rules and provide no support for each other. They are “too loose” and therefore they are groups, but not really teams. At the same time I saw teams where the narcissistic boss controls everything and everything needs to go through the boss. They do not produce much, trust me.
However, there is also one more aspect – it’s where you come from. I, for that matter, cannot work in a self-organising team. I just don’t know how to find my place. One of my friends told me once – “Either you lead, or you follow or you get the hell out of the way”. Well, I’m more for that kind of the rule. Following is nice and I like it, but self-organising is so-so.
I’m actually part of a self-organised team in one of my assignments. I don’t know what to do there, I rely on a friend from the team to tell me when is my turn to speak and if I should say something or not. He also tells me when it’s a good time to do things and when it’s just a discussion. I’m not providing the name of the friend, but I’m super happy that I have him!
To sum up, I really recommend to read this book as it provides a bit wider perspective on rules in societies than the most common books from the organisational theories. It is about a normal person like me trying to find a place in life (well, by now I would actually think that this is easier, but maybe I’m just wiser to realise certain things).
Thanks for listening and tun in for the next blog post.
Assigning defects is a task that is not so much fun. Companies need to do that, but the persons who do it often change as the task is quite labor intensive and tiresome. There is, of course, a significant body of research about this and here is one example of it.
What is interesting in this article is that the authors use temporal data about the defect reports to assign teams. From the abstract: “In this article, we describe a new BA approach that relies on two key intuitions. Similar to traditional BA methods, our method constructs the expertise profile of project developers, based on the textual elements of the bugs they have fixed in the past; unlike traditional methods, however, our method considers only the programming keywords in these bug descriptions, relying on Stack Overflow as the vocabulary for these keywords. The second key intuition of our method is that recent expertise is more relevant than past expertise, which is why our method weighs the relevance of a developer’s expertise based on how recently they have fixed a bug with keywords similar to the bug at hand. “
The method uses text similarity measures to match defects and performs better than existing methods based on the meta-parameters. What it means in practice is that the only thing that is needed is the actual defect description, or actually a failure report in order to make the predictions.
Very interesting work to apply, it seems that the entry level is not that high for new companies.
Recently, I’ve read an article in Empirical Software Engineering about automated code refactoring. I must admit that I do refactoring quite seldom. It’s a tedious task and for the software that I write, quite unnecessary. My software is often a set of scripts to solve a specific task and then the key is to document it, not refactor. A good documentation helps me to understand what I did in that code and how it works. Yes, I know it sounds like a cliché, but that’s how it is for me. I’m switching tasks so often that I forget what the code was doing.
Nevertheless, I recognize the code that is nicely written, formatted and refactored. Therefore, I was on a lookout for a tool that could do something like that for me – suggest a refactoring that I could implement.
So, this is a paper that I found, which I would like to try out. It is a tool that was evaluated through interviews with designers and developers. Although they can recognize that the code was refactored, but they seemed to be happy about it. So, I’m off to try out the tool:)
Abstract: Refactoring is a maintenance activity that aims to improve design quality while preserving the behavior of a system. Several (semi)automated approaches have been proposed to support developers in this maintenance activity, based on the correction of anti-patterns, which are “poor” solutions to recurring design problems. However, little quantitative evidence exists about the impact of automatically refactored code on program comprehension, and in which context automated refactoring can be as effective as manual refactoring. Leveraging RePOR, an automated refactoring approach based on partial order reduction techniques, we performed an empirical study to investigate whether automated refactoring code structure affects the understandability of systems during comprehension tasks. (1) We surveyed 80 developers, asking them to identify from a set of 20 refactoring changes if they were generated by developers or by a tool, and to rate the refactoring changes according to their design quality; (2) we asked 30 developers to complete code comprehension tasks on 10 systems that were refactored by either a freelancer or an automated refactoring tool. To make comparison fair, for a subset of refactoring actions that introduce new code entities, only synthetic identifiers were presented to practitioners. We measured developers’ performance using the NASA task load index for their effort, the time that they spent performing the tasks, and their percentages of correct answers. Our findings, despite current technology limitations, show that it is reasonable to expect a refactoring tools to match developer code. Indeed, results show that for 3 out of the 5 anti-pattern types studied, developers could not recognize the origin of the refactoring (i.e., whether it was performed by a human or an automatic tool). We also observed that developers do not prefer human refactorings over automated refactorings, except when refactoring Blob classes; and that there is no statistically significant difference between the impact on code understandability of human refactorings and automated refactorings. We conclude that automated refactorings can be as effective as manual refactorings. However, for complex anti-patterns types like the Blob, the perceived quality achieved by developers is slightly higher.
I’ve worked with two great students – Peter and Joshua – who wanted to do something interesting. They developed a tool that could replicate a study from other researchers. However, they did it faster and with less data. We also managed to team up with Mirek from Poznan who improved the classification algorithm and asked his colleagues from new, industrial data.
And this is the outcome – a tool that can connect to a git repository and recognise whether your project is well engineered or not. It helps companies to understand whether their teams are working in a structured manner or ad-hoc.
The tool provides the possibility to assess whether a specific repository is in need for maintenance or not.
Context: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets.
Objective: The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way.
Method: This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.
Results: Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies.
Conclusions: It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.
A while back I read an article in ZDNet about Linus Torvalds, the creator of Linux, and his daily work. He was (at the time of reading, which is ca. 2 years back) still working on the code. However, he was mostly working on the design of the system, reviewing patches and supporting younger designers. I’ve also read a number of articles which claimed the importance of code reviews as a way of teaching younger designers about the product and the code base.
In this paper, I’ve found that the support for younger designers is what the elite developers do a lot of. It seems that the communication, organisation and support are the activities that the elite developers find important. It’s aligned with what we do at the universities as well. The most elite professors work with students, show them how to program and how to structure their code. Seems like this is a very good way of continuing your career – help other be better.
I guess it’s time to change my wallpaper from “coding” to “teaching”….
Abstract: Open source developers, particularly the elite developers who own the administrative privileges for a project, maintain a diverse portfolio of contributing activities. They not only commit source code but also exert significant efforts on other communicative, organizational, and supportive activities. However, almost all prior research focuses on specific activities and fails to analyze elite developers’ activities in a comprehensive way. To bridge this gap, we conduct an empirical study with fine-grained event data from 20 large open source projects hosted on GITHUB. We investigate elite developers’ contributing activities and their impacts on project outcomes. Our analyses reveal three key findings: (1) elite developers participate in a variety of activities, of which technical contributions (e.g., coding) only account for a small proportion; (2) as the project grows, elite developers tend to put more effort into supportive and communicative activities and less effort into coding; and (3) elite developers’ efforts in nontechnical activities are negatively correlated with the project’s outcomes in terms of productivity and quality in general, except for a positive correlation with the bug fix rate (a quality indicator). These results provide an integrated view of elite developers’ activities and can inform an individual’s decision making about effort allocation, which could lead to improved project outcomes. The results also provide implications for supporting these elite developers.
Understanding how your product performs in the field is a very hot topic. To be honest, it’s always been. When we design and develop our product, we often want to know whether the outcome is going to be good or not. Well, this is not an easy task because we cannot really know to runtime properties of our products before very close to the final version. It’s difficult to measure end-to-end response time if we only have half of the database, or if we cannot simulate the full load of the product.
In this paper, the authors analysed over 700 web services and measured their code quality and interface quality as predictors for the quality of service.
From the abstract: source code and interface quality metrics/antipatterns are correlated with web service quality attributes (response time, availability, throughput, successability, reliability, compliance, best practices, latency, and documentation).
The internal code/interface quality measures were:
NPT: Number of port types, Interface
NOPT: Average number of operations in port types, Interface
NBS: Number of services, Interface
NIPT: Number of identical port types, Interface
NIOP: Number of identical operations, Interface
ALPS: Average length of port-types signature, Interface
AMTO: Average meaningful terms in operation names, Interface
AMTM: Average meaningful terms in message names, Interface
AMTMP: Average meaningful terms in message parts, Interface
AMTP: Average meaningful terms in port-type names, Interface
NOD: Number of operations declared, Code
NAOD: Number of accessor operations declared, Code
ANIPO: Average number of input parameters in operations, Code
ANOPO: Average number of output parameters in operations, Code
NOM: Number of messages, Code
NBE: Number of elements of the schemas, Code
NCT: Number of complex types, Code
NST: Number of primitive types, Code
NBB: Number of bindings, Code
NPM: Number of parts per message, Code
COH Cohesion: The degree of the functional relatedness of the operations of the service, Code
COU Coupling: A measure of the extent to which inter-dependencies exist between the service modules, Code
ALOS: Average length of operations signature, Code
ALMS: Average length of message signature, Code
This is quite a collection of measures and they are quite interesting, e.g. the average meaningful terms in port-type names. I must admit that it’s a new measure that I’ve not seen before.
The measures of the quality of service were:
Response Time: Time taken to send a request and receive a response, QoS
Availability: How often is the service available for consumption, QoS
Throughput: Total Number of invocations for a given period of time, QoS
Successability: Number of response / number of request messages, QoS
Reliability: Ratio of the number of error messages to total messages, QoS
Compliance: The extent to which a WSDL document follows WSDL specification, QoS
Best Practices: The extent to which a web service follows WS-I Basic Profile, QoS
Latency: Time taken for the server to process a given request, QoS
Documentation: Measure of documentation (i.e. description tags) in WSDL, QoS
Some of these are also quite interesting, e.g. Successability, which is the number of response messages per request messages.
The authors also measured some anti patterns of service design, which they list in the paper. I will not go through these anti-patterns, but I think that they are also correlated with some of the metrics.
I suggest this reading to everyone interested in web services and service design. Perhaps this could help us to get more high-performance services. I hope that I can see more of this type of research in the domain of service security – that’s an interesting area in itself.
This is great that we have increasingly more tools and that AI engineering matures. We do not want to have precision and recall steer our development. We want to have real testing and real systems. We also need to work with data quality in order to ensure that the systems are reliable.
The alternative is that we use MCC, precision, recall, F1-score to tell us how good a system is, which is not entirely true. These measures do not provide any view on how the system reflects the requirements put on it. These measures allow us to compare different classifiers, but not systems.
I hope that we can focus more discussion on AI quality and not classification quality/accuracy.