1. Implementing behavior using state machines.

Systems sometimes have to exhibit some kind of behavior that is different observable output over time. This is different from a pure function, which will always produce the same output for the same input. The behavior differs depending on the 'history' the system experienced, typically in the form of 'events' or method calls that happened over time. Such method calls change the state the system is in. We call such behavior state-full and this way of operating is often implemented using a 'state machine'. In this week and in the AADE course we will be using two concepts:

  1. The state (machine) diagram

  2. The State pattern


lamp battery switch A state machine is a software concept used to implement the behavior of something whose 'reaction' or outcome depends on the events that happened over time. A very simple example is a lamp and a switch. When properly connected and none of the parts is defect, the bulb will light up when the switch is in the on state. As state machine: the switch passes current when it has been turned on, which can be expressed as a turn-on event, making the bulb light up. From the diagram it should be obvious that a current can only flow if the switch is in the on (or closed) state.

In software we model this as the path the code takes, which then depends on some condition or value.

switch The state diagram is a way to express the behavior in a graphical manner. Each state of the system is drawn as a box with rounded corners, like in the diagram on the left. In the GoF State pattern, which is an object oriented way of modeling state behavior, one uses different objects to realize the effect of changing the software path. The objects all implement the same interface, but have a different implementation. Then, whenever a state switching event occurs, the current state object is replaced by the state object for the new state.

railway switch As another analogy, you can think of a railway switch. The active state object will always have the same behavior. One makes the train go straight, the other makes it go right. Each state on itself always does the same, like a pure function, so with the same input (train) the same output is produced. The switching between states makes it possible to control the direction of the train traffic.

State machines or state charts can be implemented in several ways. In this week we will apply an OO approach using the well known State Pattern , combined with enums and some Java 8 features, in particular default methods.

The State Pattern has the following UML class diagram:

StatePattern
Figure 1. State Pattern

We deal with a context, an abstract state and multiple concrete state implementations that do the work and represent the state elements in the diagram. In the diagram above there would be a state instance for both ConcreteStateA and ConcreteStateB. In the implementation of the exercise of this week we will be using enum as the basic construct for our work.

Using a functional state approach by using enum has the following advantages:

  • There is no chance of creating new states. Enum ensures that the only instances there ever will be are created at class loading time. You always know all states there ever will be.

  • A state can be implemented as a pure function, making the number of tests needed minimal. The inputs to the state are the state of the context (or technically, the state stored in the context), the inputs, and the event or method call.

  • Such functions and thereby state instances can be shared between contexts, because functional states do not have any state themselves. The actual state is saved in the context. This is not because enums cannot have state, they can, but because a functional state in a state-machine shouldn’t. It IS the state and should not have any. It should always do the same. More importantly, they are immutable, meaning no one can break them by setting some wrong value.

  • A functional state is easily tested without a need for a complex context. The context can easily be mocked as can be seen in this week’s olifantysballs exercise.

The disadvantage of using functional states in this way is that they cannot keep information themselves, such as counting something. But quite often the counting involves only a few possible values, which can be states themselves, or can be realized in the context. As an example for few counting states you might consider a traffic light which counts green, yellow, red.

A common error made by novice programmers is to call a method inside a constructor that takes the constructor’s type as a parameter.

Do not do this inside the constructor
    StateMachine(){
      this.initialState = State.IDLE;
      this.initialState.enter( this ); (1)

    }
1 Problematic call of someField.enter(this) inside constructor. Something to avoid.

The problem is that the constructor may not yet be complete, in particular when fields are defined and initialized later in the constructor or even after the constructor definition.

To avoid this problem, you typically have some init() method that should be called after construction or use a factory method that does something similar. In the gumball machine we chose the later. You can find the details in the javadoc documentation of the exercise.

Use init method.
    StateMachine(){
      this.initialState = State.IDLE;

    }

    StateMachine init(){
      this.initialState.enter( this ); (1)
      return this;
    }
Caller has not much extra work.
    StateMachine sm = new StateMachine().init();
Do not aways use all tricks you’ve learned so far. They are not always needed. In particular, states do not have to be generic, and with enums that would even be impossible, because an enum value is a leaf class all by it self, so making it generic for later specialisation like in the Zoo example in part 4 is futile. If you appear to run into this situation, because your context is generic, do not simply use raw objects but see if you can refactor out an interface that is NOT generic, too be implemented by the context that wants to be generic. You will see that in the JsonMarshaller, the state machine part.

Exercise OlifantysBalls

Mockito is a very powerful tool and serves the developer best when you stick to the rule of not writing code that is not executed. This applies to stubbing and mocking as well. When you stub or mock something, that fact serves as a kind of documentation for the developer or maintainer of the code and by implication of the tests as well. So if you apply stubbing and get a warning, do not just simply turn of the warning, but also document the potential unneeded stubbing.

There are situations where you must compromise between the strictness of Mockito and the convenience of other aspects of a test approach. One such case is applicable to the GumballStateTest, where we use one test table as external test data. This test data is used in all tests. Changing the test data to just including the exactly right stubbing or mocking would make the test data set bigger, and thereby less easy to comprehend.

In the exercise below we applied the annotation
@MockitoSettings( strictness = Strictness.WARN )
on two methods, where a specific interaction does not take place with all test data.

TURN of Warnings only, once you have verified that the path you tread is safe given the conditions of the tests

Olifantys Balls

In this exercise you will test driven develop the behavior of a gumball machine. You will be implementing a state machine using the GoF State Pattern . The machine has following events:

  • insertCoin(),

  • ejectCoin(…​),

  • refill(…​),

  • and draw(…​)
    which can be understood as events, modeled in the interface State and should therefore have an implementation in each state.

and four states:

  • SOLD_OUT,

  • NO_COIN,

  • HAS_COIN

  • and WINNER.

State machine
Figure 2. state diagram of the gumball machine.

Study the state diagram, it is the (whole) specification of our gumball system.

You should find that the system starts in the state

  • SOLD_OUT, and will go to NO_COIN after a refill with some gumballs.

  • When you insert a coin, it goes to HAS_COIN, in which you can draw to get a ball.

  • If the machine becomes empty by a draw, it will go to SOLD_OUT,

  • If not and you are lucky because the winner guard is true, the state is WINNER in which you can draw another ball without extra payment.

  • If not winner or after the extra draw the system will go back to NO_COIN, waiting for the next coin insertion.

    • As last detail, when after WINNER the machine becomes empty because draw takes the last ball, the system will go to the SOLD_OUT out state.

Context
Figure 3. context in the state machine

Because both the Java artefacts Context and State in the initial design are interfaces, we’ve added the UML public visibility modifier to all methods, because that is how interfaces work in Java.

The circled plusses near StateEnum is the UML way of saying that SOLD_OUT etc are inner classes of StateEnum, which they effectively are.

The OlifantysBallMachine implements two interfaces with the exact purpose of

  • GumBallApi being the public API for consumption by the client and the

  • Context interface as one of the collaborators in the state pattern.

To keep the API simple, that interface has one public static method for the client to get an instance of the package private OlifantysBallMachine. This keeps our design nice and tidy with the least possible coupling and leakage of unneeded information to the client. See it as an example of a clean design.

The behavior is easily modeled in a state diagram, but can also be expressed in a state transition table, which has exactly the same information with the added benefit that it can easily be understood by a program, in our case the tests of the state machine. Using such state transition table is an older technique than the state machine diagrams. It predates UML, but is still valuable, in particular because it will easily list all possibile states and thus will help to find more test cases, ensuring that the set of tests is complete.

gumballmachine thumb In a real machine or automaton, such as a vending machine, the state machine will control one or more devices. As an example, think of a coffee brewer, which must grind coffee, maybe dispense cups, boil water and brew the coffee by pressing the water through the ground coffee, thereby filling the cup. To work properly, these 'devices', such as boiler, grinder, cup dispenser, and pump need to be switched on and off. This is best done on enter (turn on) and exit (turn off) of the state, and is the reason why the State interface provides these methods. They are not used in the gumball exercise because there is no device, only imaginary, but might be used in a future exercise or project. A properly programmed context must therefor call the enter(…​) and exit(…​) methods when changing state, and a context test should verify that.

Test data in a table In the table below, empty and winner are guard expressions, with the outcome specified for the moment of evaluation. Guard expressions have a state decide if it should react to the trigger (guard == true) or not. A guard comes from outside the state and is typically available on the context, as in this exercise. Where the guard is not relevant, it is set to false. Nicer would have been leave the guard’s table cell empty when it is not relevant but that does not play very well with junit csv tests and would complicate the tests. Another approach would be to have a line each for each of the possible guard values, to ensure that guard really does not have effect in specific cases.

For the column End State, empty means no state change.

#Start State trigger End State empty winner dispense refills expected text

HAS_COIN

draw

NO_COIN

FALSE

FALSE

1

0

OlifantysGumball

HAS_COIN

draw

SOLD_OUT

TRUE

TRUE

1

0

OlifantysGumball

HAS_COIN

draw

WINNER

FALSE

TRUE

1

0

OlifantysGumball

HAS_COIN

ejectCoin

NO_COIN

FALSE

FALSE

0

0

Quarter returned

HAS_COIN

insertCoin

FALSE

FALSE

0

0

You should draw to get your ball

HAS_COIN

refill

TRUE

FALSE

0

1

refilled

NO_COIN

draw

TRUE

FALSE

0

0

You must put in a coin before you can continue

NO_COIN

ejectCoin

FALSE

FALSE

0

0

You must put in a coin before you can continue

NO_COIN

insertCoin

HAS_COIN

FALSE

FALSE

0

0

You inserted a coin

NO_COIN

refill

TRUE

FALSE

0

1

refilled

SOLD_OUT

draw

FALSE

FALSE

0

0

Machine is empty

SOLD_OUT

insertCoin

FALSE

FALSE

0

0

Machine is empty

SOLD_OUT

insertCoin

FALSE

FALSE

0

0

Machine is empty

SOLD_OUT

refill

NO_COIN

FALSE

FALSE

0

1

refilled

WINNER

draw

NO_COIN

FALSE

FALSE

1

0

You got two gumballs for your coin

WINNER

draw

SOLD_OUT

TRUE

TRUE

1

0

OlifantysGumball

WINNER

insertCoin

FALSE

FALSE

0

0

You should draw once more to get an extra ball

WINNER

insertCoin

FALSE

FALSE

0

0

You should draw once more to get an extra ball

WINNER

refill

FALSE

FALSE

0

1

refilled

# STATE

trigger pairs

that

have

not been

used

yet

and should be ignored

WINNER

ejectCoin

FALSE

FALSE

0

0

You should draw once more to get an extra ball

SOLD_OUT

ejectCoin

FALSE

FALSE

0

0

Machine is empty

We will start testing the state enum StateEnum.

The table reiterates that the state HAS_COIN needs the most attention in testing.
Also note that this table is just the specification, just like the state diagram, and not the implementation.

To get our State tested without any real machine nearby, we must mock out Context. This can be done by hand, but would soon make us create a context implementation that sort of works, but

  • first of all, is not tested itself, thus violates the TDD workflow

  • does not behave the way we need it in the test for the State type

  • and we do not need to have this context to behave in any way other then to give predictable answers when called by the State, and we must ensure that the State has the exact interactions we want. Not too few, and also not too many.

This is where Mockito comes in. We only use a few of its features, namely

  • the mocking facility,

  • recording and playback of method calls and return values

  • verification of the correct calls on the mocked Context.

Our setup looks as follows:

preparing for tests: setup.
    @Mock
    final Context ctx; (1)

    StringOutput sout = new StringOutput(); (2)
    PrintStream out = sout.asPrintStream(); (3)

    /**
     * Map the trigger name from the csv file to a lambda expression of type
     * {@code BiConsumer<Context,State>}.
     */
    static Map<String, BiConsumer<Context, State>> triggerMap = Map.of( (4)
            "insertCoin", ( c, s ) -> s.insertCoin( c ),
            "ejectCoin", ( c, s ) -> s.ejectCoin( c ),
            "drawBall", ( c, s ) -> s.draw( c ),
            "refill", ( c, s ) -> s.refill( c, 5 )
    );

    void setupMocks() { (5)
        when( ctx.getOutput() ).thenReturn( out );
        // any colour will do.
        when( ctx.dispense() ).thenReturn( new OlifantysGumball( "RED" ) );
    }
1 The context is mocked in the constructor
2 StringOuput and PrintStream are combined into a PrintStream that takes the role of system output.
3 Creates a StringOutput and ensure that anything output by the context is returned in an easily interpreted String.
4 translates strings to actual lambda expressions that do the interaction in the test.
5 Is the setup the standard mocks are trained to return the appropriate things on specific method calls.

This basic setup is done for every test anew and makes sure our SUT has a fresh collaborator every time, ignorant of what happened in earlier tests.

First test

The tabular form of the state machine behavior above is taken as the input form the parameterized tests.

First we get the initial state from the csv test source, which is in the column name "Start State". Because we need a State enum value we need to look that up using the enum classes' values() method.

When a state method is invoked, it can or should have multiple interactions with the context. To test that, we use the mocked context. First we need to complete it’s teaching, that is what to answer on the boolean methods isEmpty() and isWinner(). We take the values from the two columns winner and empty

seq draw
Figure 4. sequence diagram of draw

Here is an example a first test: Make sure the machine does not return any ball(s) when it’s state is NO_COIN. From the state diagram of the gumball machine. we learn that the action draw() should only be effective in the state HAS_COIN. This gives the interaction, modeled in the sequence diagram below. The dispense method should only be called in specif states.

As can be seen in the sequence diagram, the dispense() call is only allowed in the state HAS_COIN, not in NO_COIN. Let the test verify that using a mockito verify. A Mockito verify is similar to a JUnit assertXXX, it will complain (in Java terms: throw an Assert Exception), that a dispense() is NOT happening.

first test
    @ParameterizedTest
    @CsvFileSource( resources = { "testtable.csv" } )
    public void verifyDispense(
            int nr, String initialStateName, String finalStateName,
            String triggerName, boolean empty, boolean winner,
            int dispenseCount, int addBallsCount, String expectedText ) {

        State initialState = StateEnum.valueOf( initialStateName ); (1)
        var triggerAction = triggerMap.get( triggerName );          (2)

        // prime collaborator
        when( ctx.isEmpty() ).thenReturn( empty );                  (3)
        when( ctx.isWinner() ).thenReturn( winner );                (4)

        triggerAction.accept( ctx, initialState );                  (5)

        verify( ctx, times( dispenseCount ) ).dispense();           (6)
    }
1 Select the appropriate instance of the State.
2 Lookup the interaction in the 'lambda' map, which translates strings to BiConsumer of Context and State.
3 Read the values empty and
4 winner from the table and use them to train the mocked context.
5 Do the interaction
6 verify the the number of dispenses matches the requirement, in the NO_COIN state 0.

CsvFileSourceStateTest has in total 4 tests which assert all stated requirements in the table.

assertMessage that verifies that the correct output is 'printed'
verifyDispense ensures that the exact number of balls is dispensed, 0 or 1.
verifyRefillAddsBalls that some balls are added to the machine when refill is called
verifyStateTransition does what the name says.

Do not forget the context.

The context’s implementation becomes surprisingly simple. It only needs to forward the event or 'trigger' to the current state, which in turn will tell the context what to do.

The context must provide a bit of functionality to make the actions directed by the states meaningful.

  • forward the trigger to the state, providing itself as the first parameter. For instance insertCoin() in the API should result in an `insertCoin(Context) call in the state.

  • Remember what state the context is 'in'. We do that with the changeState method.

  • invoke the exit and enter method when we swicth between states.

  • invoke the enter method on the initial state.

and a few more. These test bodies are given in the test class 'OlifantysMachineImpleTest' with their purpose, so you should be able to fill in the blanks.

How about coverage
Some of the tests in OlifantysMachineImpleTest are there for the purpose of maintaining a high coverage. Since you are already familiar with test coverage, it simply is a matter of turning on the proper maven profile.

The csv test data are in a separate resource folder. If you try to test just one file, make sure you do test project first, which makes maven copy over the csv file to a place where the unit test can find it.

2. State Machines and embedded systems

The techniques used in the olifantysballs exercise are also applicable to real hardware automatons, such as coffee brewers. It is where the enter and exit methods of each states come from, to be able to turn on and off devices that are active in a state and inactive in another. In C the enum can be replaced by structs containing function pointers, which point to the correct implementation of each function. Functions that do not have to be overridden can simply point to a default implementation, and changing state is replacing one such struct by the struct for the next state. The 'main' program would set up and initialize the structs.

using method pointers in C
typedef struct stateHandles{
    void (*ejectCoin)();
    ...
} State ;

State NO_COIN= {
  &defaultEjectCount;
  ...
};

State state=...

    state->ejectCion();

3. Other uses of State Machines

State machines or Automatons are not only of interest in actual vending machines or toys, but also in many other problems in the computer science domain. One such application is that of regular expressions. Another is what compilers do when Lexing and Parsing the source code.

Template Engine: Text processing with state machines.

Exercise Template Engine

Use case: Think serial mail or mail merge or something similar.

The code is also a kind of template, you need to fill in the empty spots, like in an assessment.

sigil symbols There will be talk about sigil, a magic sign. In the example we use two:
'{{' as starter and '}}' as terminator, but the solution is flexible enough to take any other pair you like. Sigil stands for the character or character-sequence given and is there to make parsing of template text easier. By choosing a proper set, they will not handicap what you can write in your template text. The defaults are quite reasonable. And if you do, you can always escape.

Use case and idea
This exercise is about a simple templating engine.

Its usecase is a typical serial letter generator. This comes in handy when you want to send 'personalised' emails to students. A seasoned informatics person knows you should not use eval(…​) for that, because that will potentially execute any code.

Say you have some text and you want to substitute certain words with something from a data source, such as student names and/or grades. Then you could embellish the text with special tokens called sigils (magic signs) that help to find the words to be substituted. An example may help:

Dear {{teacher}},

We would like to complain about the module {{mod}}.
We learn to little and the examples are way too trivial.
We would like to receive our moneys worth of teaching.
Kind regards,
The students.

In the example the magic signs are the {{ and }} sequences, anything in between is understood as key in the map, and when found, replaced by the mapped value.

If the map is filled like:
Map<Object, String> map = Map.of(
        "teacher", "Richard van den Ham",
        "mod", "Programming Concepts 2 (PRC2)"
);
Then the engine can be created and started as
Function<String, String> lookup = ( s ) -> map.getOrDefault( s, "" );
// write to file.
new Engine( "{{", "}}", lookup )
        .reading( "letter.txt" )
        .writing( "mail-out.txt" )
        .run();

This program will print to a the file name 'mail-out.txt'.

sample output
Dear Richard van den Ham,

We would like to complain about the module Programming Concepts 2 (PRC2).
We learn too little and the examples are way too trivial.
Kind regards,
The students.

Design

The design is about a simple state machine with five states. See figure state diagram.

In addition the "templating engine" has two helper classes, a SigilMatcher and the enum TemplateState. See the class diagram These classes are package private and so are the methods in them.

The methods you use to configure the engine, constructors, the readingXXX(…​), and writing(…​) are part of a so called fluent API. They, plus the accept(int) and run() methods are the only methods that are part of the public API.

Given the sigils as above, the Test data in the EngineTest test class are:

test data
static Stream<ET> getTestValues() {
    // we use a helper class ET to collect strings into a engine test data object.
    ET[] testData = new ET[]{
                et( "{{", "}}", '\\', "hello {{world}}...", "hello Schöne Heimat...",
                "world", "Schöne Heimat" ),
                et( "{{", "}}", '\\', "hello {{süßes}}", "hello Schöne Heimat", "süßes", "Schöne Heimat" ),
                et( "{{", "}}", '\\', "now somethings with {{two}} words to {{replace}}", // template
                "now somethings with more words to work with", // expected
                "two", "more", //  kv 1
                "replace", "work with" // kv 2
                ),
                et( "{{", "}}", '\\', "Niks geen substituties", "Niks geen substituties" ),
                et( "{{", "}}", '\\', "Leave me {alone}}", "Leave me {alone}}" ), // incomplete starter
                et( "{{", "}}", '\\', "Leave me { {alone}}", "Leave me { {alone}}" ), // broken starter
                et( "{{", "}}", '\\', "Leave me {{alone}, go away", "Leave me {{alone}, go away" ), // broken postfix
                et( "{{", "}}", '\\', "Leave me {{alone}}, go away I am Broken at the {{Tail}",
                "Leave me , go away I am Broken at the {{Tail}" ), // broken postfix at tail
                // text below for escaping version only.
                // Note the duplication of the \ char to make the strings valid Java strings.
                et( "{{", "}}", '\\', "Look ma, this is how you write a sigil"
                + " text in your letter: \\{{my dear boy}}"
                + " and I used \\\\ to write this example",
                "Look ma, this is how you write a sigil"
                + " text in your letter: {{my dear boy}}"
                + " and I used \\ to write this example",
                "not used", "not used either" // kv1
                ),
                // below different sigil lengths, 1 and 3.
                et( "{", "}", '\\', "hello single char template {world}...", "hello single char template Schöne Heimat...",
                "world", "Schöne Heimat" ),
                et( "{{{", "}}}", '@',
                "And it works with three and other escape as well."
                + "text in your letter: {{{my dear boy}}}.",
                "And it works with three and other escape as well."
                + "text in your letter: well, that depends.",
                "my dear boy", "well, that depends" ), // kv1
            };
    return Arrays.stream( testData );
}

Your Task.

The sigil matcher is complete, tested and implemented. It is also a state full class, however NOT using the state pattern, but simply counting the characters matched. So that implementation will provide little help in your real task. Sorry.

Implement this design. Develop test driven. (Should be easy. The test data are already waiting in the code).

  1. Implement the reading method in Engine.

  2. You only need to complete the testEngine method in the EngineTest test class.

  3. Implement the state machine in the TemplateState class, see the state diagram.

Look for the todo’s with CTRL+6 in NetBeans IDE.

Hints
A statemachine reacts to 'events' or method calls.

  • The functions in java.util.function have no member that processes chars. The reason for that is that in many case chars can be processed as ints. Also, when retrieving the character elements from a CharSequence (an interface implemented by the java.lang.String class) using either the method IntStream chars() or IntStream codePoints() will produce an intStream. For the processing part that is just fine. One needs only to remember to cast the values to `char`s when writing them to the output.

  • In this state machine the event of importance is the invocation of the method Engine.accept(int char), which is stipulated by the IntConsumer functional interface.

  • The Engine provides an accept(int) method that matches the IntConsumer functional interface. This means that is the engine is fed one character at the time. You can use that method as a method reference (this::accept) or a lambda expression. ((c) → accept(c)).

  • To use a file with text as input template you will have to turn that file into many accept calls.

    • Think of streams, and use the method Files.lines(), study what flatMap and flatMapToInt is all about and use it to get the result: from stream of strings to stream of int, one int per character of the string. In the String API you have two options. Either one will do in most cases.

  • Make a drawing (pencil, paper). Files.line produces a stream, the Engine puts itself at the end, consuming the stream of `int`s. You have to tie these parts together.

engineclassesx
Figure 5. class diagram

In the class diagram of engine substitute Engine for the T in the ObjIntConsumer<T>.
Remember that an ObjIntConsumer takes two inputs and produces no direct return value. State Pattern in action with a twist of enum thrown in.

templaterStateMachinex i
Figure 6. state diagram

There is a difference between the state machine diagram on this website and in the source folder in the project. The diagram on this website should take precedence, because this is how we test it now. The difference actually only affects the way wrongly formatted template documents are handled, so the choice is arbitrary. In the design in this website, the 'junk' is output, in the other it is discarded. Showing the junk may help the user to find the error in the template earlier. A more professional variant might produce some diagnostics, but we left that out of the exercise.

4. State machines and regular expressions

You can find a lot on the relation between state machines and regular expressions. In most cases, including in Java, the regex is translated or compiled into a statemachine (just) before matches are attempted.

A nice illustration is given by the REGEXPER website, that turns a regex into a railroad diagram. This make the understanding of a regular expression way easier.

A simple regex : Dutch postal code: "\d{4}\s[A-Z]{2}"

postcode regex
Figure 7. with diagram

The postcode railroad diagram reads as a digit followed by 3 more digits, a space, a letter and one more letter.

More complex example:
// note that you can break off the regex in java and intersperse it with comments.
String emailRegex="^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+"+ // quite some rubish before the at sign
     "@"+ // the at sign followed by a more restricted set of chars,
          //  with some lengths restrictions
     "[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?"+ // optional non capt grp
     // any number of letter-number combos
     "(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$";
email regex
Figure 8. Produces this 'railroad' diagram. which is equivalent to a state machine diagram.

Note that there is not one TRUE regex for email addresses. If you want to know the nitty-gritty details of valid email addresses look at RFC 8398 or RFC 5322, or ask in CSAR class.

Once you understand that regexes are state machine specifications, and that is is easy to create state machines that never 'terminate', you understand that regular expressions can also be dangerous.

Accepting a regular expression from any source may cause a denial of service attack, because such expression may put your server into an endless loop.

Advice: only accept a restricted set of regexes, maybe parse them before use. In particular avoid back tracking constructs.

So beware.

5. Regular expressions

An often heard saying is: When you have a problem that can be solved with a regular expression, you actually have two problems, the problem and how to write the expression.

There is indeed some truth in the saying: Regular expressions can be hard to read (and hence maintain), and sometimes hard to write as well.

The problem lies in the fact that the format of the most popular form, the Perl dialect, is quite terse. However you can do quite a few things to improve at least their readability.

5.1. Basic regex syntax and rules

A simple string is also a regular expression. As far as there are no meta characters involved, a string simply matches itself.

  • The first meta character you need to know is the dot '.', which matches any character. So a line with a dot will match any line that is the same as the line with any substution for the dot character (including itself).

  • The second meta character is the backslash '\'. It takes the special meaning (as in meta) of the following character away. Back to our matching line: if you want exactly a dot at the place of the dot, prefix it with a backslash, so it will be understood as a dot. Like this: [red]"\.". But since we using Java, and the backslash already has a special role in strings, you must double the backslash in most cases, thus: "\\.".

  • The next are the quantifier: how many of something do you want.

    • ? 0 or one, aka at most. So "a?" means at most one a.

    • + at least one.

    • * any number of times, including zero.

    • numberic quantifier, using braces ('{' and '}''). For instance "a{1,4}" means a between 1 and 4 times, inclusive. You can also write "a{3}" meaning exaclty 3 as. Or a lower (a{2,} for at least 2 a) or upper boundary (a{,12}) at most 12.

  • character classes

    • user specified: [aeiou] matches the English vowels, [a-p] the first 16 characters of the alphabet, [a-pA-P] the same, but ignoring case.

  • predefined :

    • \w = word, which is the set of characters in [a-zA-Z_0-9], so a through z in both upper and lower case, the underscore and the digits 0-9.

    • \W negates the class, in this case it matches the non-word characters

    • \d the decimal digits

    • \D not a digit. etc.

char Meaning

.

match all

+

at least one quantifier

?

ate most one

{

start of quantifier spec

}

end of quantifier spec

[

start character class

]

end character class

The definition as far as java is concerned is given in the java.util.Pattern class. You do not have to know all details, the summary above suffices for the most cases. The java doc of pattern gives a complete overview.

5.2. Grouping

Sometimes you are not just interested in a match or not, but want to capture parts of the input for further processing. For that you use the parenthesis '(' and ')'. A group, when found, will get a number (or even a name, but that is not in this lesson). Group 0 is typically the whole expression, and number one the first group identified with the parenthesis.

So if you have a regular expression string line "a(e|i|o|u)", group number one will be the group containing the vowel following a, if any.

To get acquainted with regular expressions, it is very helpful to write some tests, to verify your pattern or assumptions on the pattern.

6. Matches, Matchers, match Groups, and group names.

The java.util.regex package contains the classes to work with regular expressions. The most important are the Pattern and Matcher classes.

A Pattern is a compiled representation of a regular expression. It is used to create a Matcher object that will match the regular expression against an input string.

In a pattern you can define groups of interest, for instance values that you want to extract from the input string. The Matcher object will then contain the groups that were found in the input string. You create a group by surrounding the part of the pattern with parenthesis.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Main {
    public static void main(String[] args) {
        String input = "The quick brown fox jumps over the lazy dog";
        String pattern = "The (\\w+) brown (\\w+) jumps over the (\\w+) dog"; (1)
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(input);
        if (m.find()) { (2)
            System.out.println("Found a match");
            System.out.println("Group 0: " + m.group(0)); (3)
            System.out.println("Group 1: " + m.group(1));
            System.out.println("Group 2: " + m.group(2));
            System.out.println("Group 3: " + m.group(3));
        } else {
            System.out.println("No match found");
        }
    }
}
1 There are three groups in the pattern. The first group matches the word after "The", the second group matches the word after "brown", and the third group matches the word after second "the".
2 There are several methods to match a pattern against an input string. The find method is the most common one. It returns true if the pattern is found in the input string, and false otherwise. There is also a matches() method, which returns true if the pattern matches the whole input string, and false otherwise. The third method is lookingAt(), which returns true if the pattern matches the beginning of the input string, and false otherwise.
3 Group 0 is the whole match. The other groups are the parts of the input string that match the groups in the pattern.

6.1. Named groups and comments to the regex

You can give the groups names, so the code becomes less brittle when the pattern changes. This is done by using the syntax (?<name>pattern).

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Main {
    public static void main(String[] args) {
        String input = "The quick brown fox jumps over the lazy dog";
        String pattern =
                      "The (?<first>\\w+)" // comment 1
                      +" brown (?<second>\\w+)" // comment 2
                      +" jumps over the (?<third>\\w+) dog"; // comment 3
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(input);
        if (m.find()) {
            System.out.println("Found a match");
            System.out.println("Group 0: " + m.group(0));
            System.out.println("Group first: " + m.group("first"));
            System.out.println("Group second: " + m.group("second"));
            System.out.println("Group third: " + m.group("third"));
        } else {
            System.out.println("No match found");
        }
    }
}

The comments are not part of the pattern, but are there to explain the pattern. They are ignored by the compiler. But they are very useful to explain the pattern to others, or to yourself when you come back to the code after a while.

Reading grades from text source

Grade Capture

Use capture groups.

To be able to insert student grades into progress, the teachers want to be able to copy and paste grades from a spreadsheet (which is the typical way grades are assembled) into a text area. In this exercise, a simple file will do as the source of the grades, because working with a real clipboard is a project in itself.

  • The spreadsheet is replaced by a text file with one or more spaces or tabs ("white space") as column separator.

  • The spreadsheet has the typical format that the student number is in the first column and the
    grade for that student is in the last column. The columns in between are arbitrary and may contain anything.

  • You may ignore the spreadsheet part, because the paste action is done in a text area, which can be collected as a text, that can be divided into lines and columns, the column separator being white space.

  • The student number consist of 7 consecutive digits, the grade can consists of one digit,
    two digits (a 10) or two digits separated by a dot or period (.) or a comma (,),
    since both continental Europeans and English speaking teachers want to enter grades this way.

  • The number of columns is not fixed and may change within one input, e.g. having separate column(s) for the optional Dutch 'tussenvoegsel'. A regex pattern to represent this is available in the GradeFilter class.

  • When no grade is found in the input line, and the getResult method is invoked anyway, the result should contain key=null and value =null.

    • So both the grade and studentId methods should in that case return null

A set of test data is given and has the exact input as the table below.

Table 1. Test data
# "quoted test input" should match found group value found snummer grade

3785895 Jan Jansen 8.7.1997 6.6 8.3 2.4 6,8

true

6,8

3785895

6.8D

3895785 Piet Jansen 8.7.1997

false

3895785

0.0D

3785985 Henk Jansen 8.12.1994 6.6 8.3 2.4 7.9

true

7.9

3785985

7.9D

3785915 Niki Jansen 8.12.1994 6.6 8.3 2.4 8

true

8

3785915

8D

1245717 Joepie Hombergh van den 18.03.1992 10.0 10.0 10.0 10

true

10

1245717

10D

Note that the first column of the csv file is the "quoted input", so should be seen as one input. The rest of the columns are the output values to be used in the assertions. Empty cells represent null values.

The result type of the getResult() method in the GradeCapture class may look a bit strange: AbstractMap.SimpleEntry<Integer,Double>. The type is what is stored in hash maps and such, and is a way to use a tupple without having to declare the class. It is a kludge, just a way to have a pair of objects. It stems from the fact that the student-number-grade pairs will typically land in a Map of sorts anyway.
Java 14 has a preview feature called record which will make this more way more elegant. There you would write record GradeResult(int snummer,double grade){} as the full 'class' definition, including constructor, getters, toString and hashCode and equals. It is likely to have that as a standard feature in Java 15 and onward, and will be quite useful in case you want a method to return more than just one value.

Your Tasks

  • Complete the TODOs in the test classes. (Press control+6 in NetBeans IDE)

  • In the GradeFilter class complete all TODOs.

  • In the client class there is a last todo and that is to use the grade filter in the getGradesAsMap() method, by reading a file as lines, applying the gradefilter on each line and collect the student numbers and grades in a map with student number as key and grade as value.

There are a few naming problems in the exercise, because of incomplete (renaming) refactoring. Where you see the class name GradeFilter read (or replace it with) GradeCapture.

7. Reading

  • Regular expressions: Horstmann, Vol II Section 2.7

Exercises in this part.