Skip to main content

Improve Evolvability via Custom Data Types

We strive to craft maintainable software systems, since evolution of large systems is their most important and longest lifecycle phase. Various software development paradigms, like functional and object-oriented programming, contain wealth of approaches how to attain the previous objective. One of them is the creation and usage of custom data types. This blog peeks into the power of strongly typed programming languages and shows via a simple case study how to devise a flexible application by adorning it with custom data types. For this purpose I will use Java, but the general ideas are language agnostic.

Problem Description

For teachers the booksite accompanying the book Computer Science: An Interdisciplinary Approach by Robert Sedgewick and Kevin Wayne is an excellent source of educational materials. I will provide here solutions to some exercises that are not published in this book nor on the booksite, at the time of writing this document. The main task is to write a client program to analyze the data about movies and performers and print an actor appearing in most movies.

Setup

Here are the steps to setup an environment for this case study (I assume that you have Java installed on your computer, otherwise jump here):

  1. Open a terminal window
  2. Create a new folder on your machine and navigate into it
  3. Save the following files there:

First Client Program

To decipher which performer appears in most movies we can write the following simple client program to dump the (frequency, name) pairs on output. Afterward we can use command line utilities to further narrow the search. The Graph data type treats all vertices as strings, so our client will emit both movies and performers as entities. To differentiate between them we will need to do some exploratory data analysis, too.

Create a new file PopularityListV1.java and copy-paste the following content:
/**
 * Client console program to print out vertices 
 * and their degree of a graph stored in a file structured 
 * in adjacency list format.
 */
public class PopularityListV1 {
    /**
     * Prints out all (degree, vertex) pairs each on 
     * separate line:
     * 
     * 100 "Movie 1"
     * 20 "Performer 1"
     * 150 "Movie 2"
     * ...
     * @param args the filename and delimiter
     */
    public static void main(String[] args) {
        // Read in the graph from a file.
        String filename = args[0];
        String delimiter = args[1];
        Graph g = new Graph(filename, delimiter);

        // Print out all (degree, vertex) pairs.
        for (String v : g.vertices())
            StdOut.printf("%d \"%s\"\n", g.degree(v), v);
    }
}
Execute the next command inside a terminal window (I am using macOS, Windows users may want to install Cygwin):
javac -classpath .:stdlib.jar PopularityListV1.java
We are now ready to devise a pipeline of commands to find the most prolific performer:
java -classpath .:stdlib.jar PopularityListV1 movies.txt "/" | sort -k1nr | egrep -v '\(\d{4}.*\)' | head -1
As you can see, an output from one stage is an input to another. This is a powerful Pipes and Filters architectural pattern in action. The options to the sort utility are:

  • -k1 sort based on the first field (in our case frequency)
  • -n treat field 1 as numeric data
  • -r sort items in decreasing order
egrep (grep using extended regular expression syntax) filters out lines satisfying the given regular expression (the effect of the -v option). Namely, movie names contain the year of production and possibly some other information inside parenthesis. Finally, we use head to print only the first line. The end result is:

57 "Welker, Frank"

Analysis of Current State

The ingenuity of figuring out the most frequent performer lies in pattern recognition and command line wizardry. Any change in the format of a movie name will most probably require adjustment to the regular expression, provided that we would still be able to find a marked difference in names. Furthermore, we have optimized our solution to this specific input file. All these issues are a consequence of working with pure string values. Can we do better?

Next Version of the Graph Data Type

It turns out that by embellishing our Graph data type with a template type parameter could take us a long way. It would represent a type of values stored in vertices. A client may even choose to produce on-the-fly separate types for the two groups in a bipartite graph (like our movies database). This is a fine example how adding custom data types may tremendously improve evolvability of our software system. Here are the steps how to alter Graph.java:

  1. Make a copy of Graph.java and name it GraphV2.java
  2. Change the class definition to public class GraphV2<T extends Comparable<T>>
  3. Amend the internal symbol table to read as private final ST<T, SET<T>> st;
  4. Alter code accordingly on all other places where values of vertices are expected to be String
The last step is an easy exercise and a good opportunity to familiarize yourself with inner details. The central change is in the constructor that reads a file. Below I show it in full, so you may need to copy it over the current one:
/**
 * Initializes a graph from the specified file, 
 * using the specified delimiter. If it represents
 * a bipartite graph and systematically enlists vertices 
 * by starting with an entity from the first group and let 
 * neighbors belong to the second one, then it is easy to 
 * decipher groups during construction.
 *
 * @param filename the name of the file
 * @param delimiter the delimiter
 * @param f1 convert a string into {@code T} for group 1.
 * @param f2 convert a string into {@code T} for group 2.
 */
public GraphV2(String filename, String delimiter, 
               Function<String, T> f1,
               Function<String, T> f2) {
    this();
    In in = new In(filename);
    while (in.hasNextLine()) {
        String line = in.readLine();
        String[] names = line.split(delimiter);
        for (int i = 1; i < names.length; i++)
            addEdge(f1.apply(names[0]), f2.apply(names[i]));
    }
}
There are two functions enlisted as parameters, which convert a string value into a custom type. A client may also choose to skip making a difference by providing the same function on both places. At any rate, using functions as first-class citizens enables a client to leverage custom types without getting bogged down into the intricacies and nuances of Java generics. For a super overview of this topic see this website.

Second Client Program

Create a new file PopularityListV2.java and copy-paste the following content:
/**
 * Client console program to print out vertices and their degree of
 * a graph stored in a file structured in adjacency list format.
 */
public class PopularityListV2 {
    private static class Vertex implements Comparable<Vertex> {
        private final String name;

        public Vertex(String name) {
            this.name = name;
        }

        @Override
        public int compareTo(Vertex o) {
            return name.compareTo(o.name);
        }

        @Override
        public String toString() {
            return name;
        }
    }

    private static class Movie extends Vertex {
        public Movie(String name) {
            super(name);
        }
    }

    private static class Performer extends Vertex {
        public Performer(String name) {
            super(name);
        }
    }

    /**
     * Prints out all (degree, vertex type, vertex) pairs 
     * each on separate line:
     * 
     * 100 Movie "Movie 1 Name"
     * 20 Performer "Performer 1 Name"
     * 150 Movie "Movie 2 Name"
     * ...
     * @param args the filename and delimiter
     */
    public static void main(String[] args) {
        // Read in the graph from a file.
        String filename = args[0];
        String delimiter = args[1];
        GraphV2<Vertex> g = new GraphV2<>(
            filename, delimiter, Movie::new, Performer::new);

        // Print out all (degree, vertex type, vertex) triples.
        for (var v : g.vertices())
            StdOut.printf("%d %s \"%s\"\n", 
                          g.degree(v), 
                          v.getClass().getSimpleName(), 
                          v);
    }
}
Notice the usage of function references Movie::new and Performer::new. Execute the next command inside a terminal window:
javac -classpath .:stdlib.jar PopularityListV2.java
We are now ready to devise a simpler pipeline of commands to find the most prolific performer:
java -classpath .:stdlib.jar PopularityListV2 movies.txt "/" | sort -k1nr | grep Performer | head -1
Observe that the regular expression is gone and we simply use grep in affirmative manner to pick lines with performers.

Conclusion

We have started with a client that used the version of Graph treating all values as strings. This resulted in a fragile application dependent on peculiar data formatting rules. Simply adding an ability to designate different types for values tremendously improved flexibility of our solution. There is no reliance anymore on naming approaches.

A client has also nicely applied the separation of concerns principle. It is straightforward to expand it in the following ways:

  • Add year of production to a Movie data type by parsing its name. All that logic would be now properly encapsulated inside this class.
  • Optimize memory and runtime performance by interning strings. The requited call would be performed in the Vertex base class.
  • Report data quality problems if year of production is not present for a movie.
Hopefully this article has showcased the power of abstraction and working with custom data types in making evolvable software systems. Modern programming languages are equipped with many sophisticated features pertaining to data types, so we should definitely use them.

Comments

Popular posts from this blog

The Power of a Command-Line Processor

A command-line processor, as its name implies, is a software appliance intended to be executed from a command-line in pipelined fashion. Most operating systems are equipped with bunch of utilities that can be ingeniously combined to create powerful mini programs for transforming data. We will focus our attention here on jq specialized to mangle JSON data similarly how sed crunches textual content. You can easily start using it by issuing brew install jq on macOS (or download it for other operating systems). Nonetheless, even without placing anything on your machine, there is also a nice playground for trying out things online. The following example illustrates what sort of actions could be crafted into a unified instruction, i.e., mini program that may be reused as a whole: > echo "A test string." | jq -R "ascii_upcase | gsub(\"STRING\"; \"CONTENT\")" "A TEST CONTENT." The input is piped into jq as an ordinary string (this is hi...

Brief Introduction to the JSON API v1 Specification

JSON API provides a systematic method of specifying Level 3 REST APIs in JSON. It has a registered hypermedia type of application/vnd.api+json issued by the Internet Assigned Numbers Authority (IANA). The next list briefly enrolls the most salient features and benefits of JSON API (see my book Creating Maintainable APIs for more information): It is a hypermedia-driven conventional interface : This is a fundamental difference between JSON API and other pure message formats. The JSON API, besides establishing a shared convention regarding message formats, also encompasses rules for how to interact with services (it specifies an assumed API). For example, JSON API contains instructions on how services should fetch, create, update, and delete resources, specify pagination, pass query parameters, and so on. This trait of JSON API is tightly related to the concept of openness in distributed systems. An open distributed system mandates certain rules regarding how services should be implem...

Invention as a Reincarnation of an Old Idea

In all creative disciplines fundamental ideas reoccur in different shapes. This cannot be truer in computer science and software engineering. Advances in technology even enables old ideas to appear as pure novelties. In this article we will focus on an intellectual toolbox associated with morphing existing ideas, concepts, and technique to fit new scenarios. This is tightly connected with generalization and pattern matching, which are both prominent ingredients in achieving success in professional practice and various competitive activities. The latter is examined in more detail in my article about competitive software engineering . To make the point clear let us use a concrete example from Petlja , a website maintained by the Serbian Mathematical Society, which also organizes official competitions in mathematics and programming. The problem we will use is called Mingo . Below is its description translated to English. Problem Description Now, after completing the competition, Miroslav ...