We strive to craft maintainable software systems, since evolution of large systems is their most important and longest lifecycle phase. Various software development paradigms, like functional and object-oriented programming, contain wealth of approaches how to attain the previous objective. One of them is the creation and usage of custom data types. This blog peeks into the power of strongly typed programming languages and shows via a simple case study how to devise a flexible application by adorning it with custom data types. For this purpose I will use Java, but the general ideas are language agnostic.
Problem Description
For teachers the booksite accompanying the book Computer Science: An Interdisciplinary Approach by Robert Sedgewick and Kevin Wayne is an excellent source of educational materials. I will provide here solutions to some exercises that are not published in this book nor on the booksite, at the time of writing this document. The main task is to write a client program to analyze the data about movies and performers and print an actor appearing in most movies.
Setup
Here are the steps to setup an environment for this case study (I assume that you have Java installed on your computer, otherwise jump here):
- Open a terminal window
- Create a new folder on your machine and navigate into it
- Save the following files there:
- The stdlib.jar file of the standard library developed for the previously mentioned book
- ST.java implements a symbol table
- SET.java implements a set data type
- Graph.java implements a graph data type
- movies.txt a sample movie-cast data file
First Client Program
To decipher which performer appears in most movies we can write the following simple client program to dump the (frequency, name) pairs on output. Afterward we can use command line utilities to further narrow the search. The Graph data type treats all vertices as strings, so our client will emit both movies and performers as entities. To differentiate between them we will need to do some exploratory data analysis, too.
/**
* Client console program to print out vertices
* and their degree of a graph stored in a file structured
* in adjacency list format.
*/
public class PopularityListV1 {
/**
* Prints out all (degree, vertex) pairs each on
* separate line:
*
* 100 "Movie 1"
* 20 "Performer 1"
* 150 "Movie 2"
* ...
* @param args the filename and delimiter
*/
public static void main(String[] args) {
// Read in the graph from a file.
String filename = args[0];
String delimiter = args[1];
Graph g = new Graph(filename, delimiter);
// Print out all (degree, vertex) pairs.
for (String v : g.vertices())
StdOut.printf("%d \"%s\"\n", g.degree(v), v);
}
}
javac -classpath .:stdlib.jar PopularityListV1.java
java -classpath .:stdlib.jar PopularityListV1 movies.txt "/" | sort -k1nr | egrep -v '\(\d{4}.*\)' | head -1
- -k1 sort based on the first field (in our case frequency)
- -n treat field 1 as numeric data
- -r sort items in decreasing order
Analysis of Current State
The ingenuity of figuring out the most frequent performer lies in pattern recognition and command line wizardry. Any change in the format of a movie name will most probably require adjustment to the regular expression, provided that we would still be able to find a marked difference in names. Furthermore, we have optimized our solution to this specific input file. All these issues are a consequence of working with pure string values. Can we do better?
Next Version of the Graph Data Type
It turns out that by embellishing our Graph data type with a template type parameter could take us a long way. It would represent a type of values stored in vertices. A client may even choose to produce on-the-fly separate types for the two groups in a bipartite graph (like our movies database). This is a fine example how adding custom data types may tremendously improve evolvability of our software system. Here are the steps how to alter Graph.java:
- Make a copy of Graph.java and name it GraphV2.java
- Change the class definition to public class GraphV2<T extends Comparable<T>>
- Amend the internal symbol table to read as private final ST<T, SET<T>> st;
- Alter code accordingly on all other places where values of vertices are expected to be String
/**
* Initializes a graph from the specified file,
* using the specified delimiter. If it represents
* a bipartite graph and systematically enlists vertices
* by starting with an entity from the first group and let
* neighbors belong to the second one, then it is easy to
* decipher groups during construction.
*
* @param filename the name of the file
* @param delimiter the delimiter
* @param f1 convert a string into {@code T} for group 1.
* @param f2 convert a string into {@code T} for group 2.
*/
public GraphV2(String filename, String delimiter,
Function<String, T> f1,
Function<String, T> f2) {
this();
In in = new In(filename);
while (in.hasNextLine()) {
String line = in.readLine();
String[] names = line.split(delimiter);
for (int i = 1; i < names.length; i++)
addEdge(f1.apply(names[0]), f2.apply(names[i]));
}
}
Second Client Program
/**
* Client console program to print out vertices and their degree of
* a graph stored in a file structured in adjacency list format.
*/
public class PopularityListV2 {
private static class Vertex implements Comparable<Vertex> {
private final String name;
public Vertex(String name) {
this.name = name;
}
@Override
public int compareTo(Vertex o) {
return name.compareTo(o.name);
}
@Override
public String toString() {
return name;
}
}
private static class Movie extends Vertex {
public Movie(String name) {
super(name);
}
}
private static class Performer extends Vertex {
public Performer(String name) {
super(name);
}
}
/**
* Prints out all (degree, vertex type, vertex) pairs
* each on separate line:
*
* 100 Movie "Movie 1 Name"
* 20 Performer "Performer 1 Name"
* 150 Movie "Movie 2 Name"
* ...
* @param args the filename and delimiter
*/
public static void main(String[] args) {
// Read in the graph from a file.
String filename = args[0];
String delimiter = args[1];
GraphV2<Vertex> g = new GraphV2<>(
filename, delimiter, Movie::new, Performer::new);
// Print out all (degree, vertex type, vertex) triples.
for (var v : g.vertices())
StdOut.printf("%d %s \"%s\"\n",
g.degree(v),
v.getClass().getSimpleName(),
v);
}
}
javac -classpath .:stdlib.jar PopularityListV2.java
java -classpath .:stdlib.jar PopularityListV2 movies.txt "/" | sort -k1nr | grep Performer | head -1
Conclusion
We have started with a client that used the version of Graph treating all values as strings. This resulted in a fragile application dependent on peculiar data formatting rules. Simply adding an ability to designate different types for values tremendously improved flexibility of our solution. There is no reliance anymore on naming approaches.
A client has also nicely applied the separation of concerns principle. It is straightforward to expand it in the following ways:
- Add year of production to a Movie data type by parsing its name. All that logic would be now properly encapsulated inside this class.
- Optimize memory and runtime performance by interning strings. The requited call would be performed in the Vertex base class.
- Report data quality problems if year of production is not present for a movie.
Comments
Post a Comment