1. Implementing behavior using state machines.
Systems sometimes have to exhibit some kind of behavior that is different observable output over time. This is different from a pure function, which will always produce the same output for the same input. The behavior differs depending on the 'history' the system experienced, typically in the form of 'events' or method calls that happened over time. Such method calls change the state the system is in. We call such behavior state-full and this way of operating is often implemented using a 'state machine'. In this week and in the AADE course we will be using two concepts:
-
The state (machine) diagram
-
The State pattern
A state machine
is a software concept used to implement the behavior of something whose 'reaction'
or outcome depends on the events that happened over time. A very simple example is a lamp and a switch.
When properly connected and none of the parts is defect, the bulb will light up when the switch is in the on state.
As state machine: the switch passes current when it has been turned on, which can be expressed as a turn-on event, making the bulb light up.
From the diagram it should be obvious that a current can only flow if the switch is in the on (or closed) state.
In software we model this as the path the code takes, which then depends on some condition or value.
The state diagram is a way to express the behavior in a graphical manner. Each state of the system is drawn as a box with rounded corners, like in the diagram on the left.
In the GoF State pattern, which is an object oriented way of modeling state behavior, one uses different
objects to realize the effect of changing the software path. The objects all implement the same interface, but have a different implementation.
Then, whenever a state switching event occurs, the current state object is replaced by the state object for the new state.
As another analogy, you can think of a railway switch.
The active state object will always have the same behavior. One makes the train go straight, the other makes it go right.
Each state on itself always does the same, like a pure function, so with the same input (train) the same output is produced.
The switching between states makes it possible to control the direction of the train traffic.
State machines or state charts can be implemented in several ways. In this week we will apply an OO approach using the well known
State Pattern , combined with enums
and some Java 8 features, in particular default
methods.
The State Pattern has the following UML class diagram:
We deal with a context, an abstract state and multiple concrete state implementations that do the work
and represent the state elements in the diagram. In the diagram above there
would be a state instance for both ConcreteStateA and ConcreteStateB.
In the implementation of the exercise of this week we will be using enum
as the basic construct for our work.
Using a functional state approach by using enum
has the following advantages:
-
There is no chance of creating new states. Enum ensures that the only instances there ever will be are created at class loading time. You always know all states there ever will be.
-
A state can be implemented as a pure function, making the number of tests needed minimal. The inputs to the state are the state of the context (or technically, the state stored in the context), the inputs, and the event or method call.
-
Such functions and thereby state instances can be shared between contexts, because functional states do not have any state themselves. The actual state is saved in the context. This is not because enums cannot have state, they can, but because a functional state in a state-machine shouldn’t. It IS the state and should not have any. It should always do the same. More importantly, they are immutable, meaning no one can break them by setting some wrong value.
-
A functional state is easily tested without a need for a complex context. The context can easily be mocked as can be seen in this week’s olifantysballs exercise.
The disadvantage of using functional states in this way is that they cannot keep information themselves, such as counting something. But quite often the counting involves only a few possible values, which can be states themselves, or can be realized in the context. As an example for few counting states you might consider a traffic light which counts green, yellow, red.
A common error made by novice programmers is to call a method inside a constructor that takes the constructor’s type as a parameter. Do not do this inside the constructor
The problem is that the constructor may not yet be complete, in particular when fields are defined and initialized later in the constructor or even after the constructor definition. To avoid this problem, you typically have some init() method that should be called after construction or use a factory method that does something similar. In the gumball machine we chose the later. You can find the details in the javadoc documentation of the exercise. Use init method.
Caller has not much extra work.
|
Do not aways use all tricks you’ve learned so far. They are not always needed. In particular, states do not have to be generic, and with enums that would even be impossible, because an enum value is a leaf class all by it self, so making it generic for later specialisation like in the Zoo example in part 4 is futile. If you appear to run into this situation, because your context is generic, do not simply use raw objects but see if you can refactor out an interface that is NOT generic, too be implemented by the context that wants to be generic. You will see that in the JsonMarshaller, the state machine part. |
Exercise OlifantysBalls
Mockito is a very powerful tool and serves the developer best when you stick to the rule of not writing code that is not executed. This applies to stubbing and mocking as well. When you stub or mock something, that fact serves as a kind of documentation for the developer or maintainer of the code and by implication of the tests as well. So if you apply stubbing and get a warning, do not just simply turn of the warning, but also document the potential unneeded stubbing. There are situations where you must compromise between the strictness of Mockito and the convenience of other aspects of a test approach. One such case is applicable to the GumballStateTest, where we use one test table as external test data. This test data is used in all tests. Changing the test data to just including the exactly right stubbing or mocking would make the test data set bigger, and thereby less easy to comprehend. In the exercise below we applied the annotation TURN of Warnings only, once you have verified that the path you tread is safe given the conditions of the tests |
Olifantys Balls
In this exercise you will test driven develop the behavior of a gumball machine. You will be implementing a state machine using the GoF State Pattern . The machine has following events:
-
insertCoin()
, -
ejectCoin(…)
, -
refill(…)
, -
and
draw(…)
which can be understood as events, modeled in the interfaceState
and should therefore have an implementation in each state.
and four states:
-
SOLD_OUT
, -
NO_COIN
, -
HAS_COIN
-
and
WINNER
.
Study the state diagram, it is the (whole) specification of our gumball system.
You should find that the system starts in the state
-
SOLD_OUT, and will go to NO_COIN after a refill with some gumballs.
-
When you insert a coin, it goes to HAS_COIN, in which you can draw to get a ball.
-
If the machine becomes empty by a draw, it will go to SOLD_OUT,
-
If not and you are lucky because the winner guard is true, the state is WINNER in which you can draw another ball without extra payment.
-
If not winner or after the extra draw the system will go back to NO_COIN, waiting for the next coin insertion.
-
As last detail, when after WINNER the machine becomes empty because draw takes the last ball, the system will go to the SOLD_OUT out state.
-
Because both the Java artefacts Context
and State
in the initial design are
interfaces, we’ve added the UML public
visibility modifier to all methods,
because that is how interfaces work in Java.
The circled plusses near StateEnum is the UML way of saying that SOLD_OUT etc are inner classes of StateEnum, which they effectively are.
The OlifantysBallMachine
implements two interfaces with the exact purpose of
-
GumBallApi
being the public API for consumption by the client and the -
Context
interface as one of the collaborators in the state pattern.
To keep the API simple, that interface has one public static method for the client to get an instance of the package private OlifantysBallMachine. This keeps our design nice and tidy with the least possible coupling and leakage of unneeded information to the client. See it as an example of a clean design.
The behavior is easily modeled in a state diagram, but can also be expressed in a state transition table, which has exactly the same information with the added benefit that it can easily be understood by a program, in our case the tests of the state machine. Using such state transition table is an older technique than the state machine diagrams. It predates UML, but is still valuable, in particular because it will easily list all possibile states and thus will help to find more test cases, ensuring that the set of tests is complete.
|
Test data in a table
In the table below, empty
and winner
are guard expressions, with the outcome specified for the moment of evaluation. Guard expressions
have a state decide if it should react to the trigger (guard == true) or not. A guard comes from outside the state and is typically available on the context, as in this exercise.
Where the guard is not relevant, it is set to false. Nicer would have been leave the guard’s table cell empty when it is not relevant but that does not
play very well with junit csv tests and would complicate the tests. Another approach would be to have a line each for each of the possible guard values, to ensure that guard
really does not have effect in specific cases.
For the column End State, empty means no state change.
#Start State | trigger | End State | empty | winner | dispense | refills | expected text |
---|---|---|---|---|---|---|---|
HAS_COIN |
draw |
NO_COIN |
FALSE |
FALSE |
1 |
0 |
OlifantysGumball |
HAS_COIN |
draw |
SOLD_OUT |
TRUE |
TRUE |
1 |
0 |
OlifantysGumball |
HAS_COIN |
draw |
WINNER |
FALSE |
TRUE |
1 |
0 |
OlifantysGumball |
HAS_COIN |
ejectCoin |
NO_COIN |
FALSE |
FALSE |
0 |
0 |
Quarter returned |
HAS_COIN |
insertCoin |
FALSE |
FALSE |
0 |
0 |
You should draw to get your ball |
|
HAS_COIN |
refill |
TRUE |
FALSE |
0 |
1 |
refilled |
|
NO_COIN |
draw |
TRUE |
FALSE |
0 |
0 |
You must put in a coin before you can continue |
|
NO_COIN |
ejectCoin |
FALSE |
FALSE |
0 |
0 |
You must put in a coin before you can continue |
|
NO_COIN |
insertCoin |
HAS_COIN |
FALSE |
FALSE |
0 |
0 |
You inserted a coin |
NO_COIN |
refill |
TRUE |
FALSE |
0 |
1 |
refilled |
|
SOLD_OUT |
draw |
FALSE |
FALSE |
0 |
0 |
Machine is empty |
|
SOLD_OUT |
insertCoin |
FALSE |
FALSE |
0 |
0 |
Machine is empty |
|
SOLD_OUT |
insertCoin |
FALSE |
FALSE |
0 |
0 |
Machine is empty |
|
SOLD_OUT |
refill |
NO_COIN |
FALSE |
FALSE |
0 |
1 |
refilled |
WINNER |
draw |
NO_COIN |
FALSE |
FALSE |
1 |
0 |
You got two gumballs for your coin |
WINNER |
draw |
SOLD_OUT |
TRUE |
TRUE |
1 |
0 |
OlifantysGumball |
WINNER |
insertCoin |
FALSE |
FALSE |
0 |
0 |
You should draw once more to get an extra ball |
|
WINNER |
insertCoin |
FALSE |
FALSE |
0 |
0 |
You should draw once more to get an extra ball |
|
WINNER |
refill |
FALSE |
FALSE |
0 |
1 |
refilled |
|
# STATE |
trigger pairs |
that |
have |
not been |
used |
yet |
and should be ignored |
WINNER |
ejectCoin |
FALSE |
FALSE |
0 |
0 |
You should draw once more to get an extra ball |
|
SOLD_OUT |
ejectCoin |
FALSE |
FALSE |
0 |
0 |
Machine is empty |
We will start testing the state enum StateEnum
.
The table reiterates that the state HAS_COIN
needs the most attention in testing.
Also note that this table is just the specification, just like the state diagram,
and not the implementation.
To get our State
tested without any real machine nearby, we must mock out Context
.
This can be done by hand, but would soon make us create a context implementation that sort of works, but
-
first of all, is not tested itself, thus violates the TDD workflow
-
does not behave the way we need it in the test for the
State
type -
and we do not need to have this context to behave in any way other then to give predictable answers when called by the State, and we must ensure that the State has the exact interactions we want. Not too few, and also not too many.
This is where Mockito comes in. We only use a few of its features, namely
-
the mocking facility,
-
recording and playback of method calls and return values
-
verification of the correct calls on the mocked Context.
Our setup looks as follows:
@Mock
final Context ctx; (1)
StringOutput sout = new StringOutput(); (2)
PrintStream out = sout.asPrintStream(); (3)
/**
* Map the trigger name from the csv file to a lambda expression of type
* {@code BiConsumer<Context,State>}.
*/
static Map<String, BiConsumer<Context, State>> triggerMap = Map.of( (4)
"insertCoin", ( c, s ) -> s.insertCoin( c ),
"ejectCoin", ( c, s ) -> s.ejectCoin( c ),
"drawBall", ( c, s ) -> s.draw( c ),
"refill", ( c, s ) -> s.refill( c, 5 )
);
void setupMocks() { (5)
when( ctx.getOutput() ).thenReturn( out );
// any colour will do.
when( ctx.dispense() ).thenReturn( new OlifantysGumball( "RED" ) );
}
1 | The context is mocked in the constructor |
2 | StringOuput and PrintStream are combined into a PrintStream that takes the role of system output. |
3 | Creates a StringOutput and
ensure that anything output by the context is returned in an easily
interpreted String . |
4 | translates strings to actual lambda expressions that do the interaction in the test. |
5 | Is the setup the standard mocks are trained to return the appropriate things on specific method calls. |
This basic setup is done for every test anew and makes sure our SUT has a fresh collaborator every time, ignorant of what happened in earlier tests.
First test
The tabular form of the state machine behavior above is taken as the input form the parameterized tests.
First we get the initial state from the csv test source, which is in the column name "Start State".
Because we need a State enum value we need to look that up using the enum classes' values()
method.
When a state method is invoked, it can or should have multiple interactions with the
context. To test that, we use the mocked context. First we need to complete it’s teaching, that
is what to answer on the boolean methods isEmpty()
and isWinner()
. We take the values from the two columns winner and empty
Here is an example a first test: Make sure the machine does not return any ball(s)
when it’s state is NO_COIN
. From the state diagram of the gumball machine. we learn that the action
draw()
should only be effective in the state HAS_COIN
. This gives the
interaction, modeled in the sequence diagram below. The dispense method should only be called in specif states.
As can be seen in the sequence diagram, the dispense()
call is only allowed in
the state HAS_COIN
, not in NO_COIN
. Let the test verify
that using a mockito
verify. A Mockito verify
is similar to a JUnit assertXXX
, it will complain (in Java terms:
throw an Assert Exception), that a dispense() is NOT happening.
@ParameterizedTest
@CsvFileSource( resources = { "testtable.csv" } )
public void verifyDispense(
int nr, String initialStateName, String finalStateName,
String triggerName, boolean empty, boolean winner,
int dispenseCount, int addBallsCount, String expectedText ) {
State initialState = StateEnum.valueOf( initialStateName ); (1)
var triggerAction = triggerMap.get( triggerName ); (2)
// prime collaborator
when( ctx.isEmpty() ).thenReturn( empty ); (3)
when( ctx.isWinner() ).thenReturn( winner ); (4)
triggerAction.accept( ctx, initialState ); (5)
verify( ctx, times( dispenseCount ) ).dispense(); (6)
}
1 | Select the appropriate instance of the State. |
2 | Lookup the interaction in the 'lambda' map, which translates strings to BiConsumer of Context and State. |
3 | Read the values empty and |
4 | winner from the table and use them to train the mocked context. |
5 | Do the interaction |
6 | verify the the number of dispenses matches the requirement, in the NO_COIN state 0. |
CsvFileSourceStateTest has in total 4 tests which assert all stated requirements in the table.
assertMessage that verifies that the correct output is 'printed'
verifyDispense ensures that the exact number of balls is dispensed, 0 or 1.
verifyRefillAddsBalls that some balls are added to the machine when refill is called
verifyStateTransition does what the name says.
Do not forget the context.
The context’s implementation becomes surprisingly simple. It only needs to forward the event or 'trigger' to the current state, which in turn will tell the context what to do.
The context must provide a bit of functionality to make the actions directed by the states meaningful.
-
forward the trigger to the state, providing itself as the first parameter. For instance
insertCoin()
in the API should result in an `insertCoin(Context) call in the state. -
Remember what state the context is 'in'. We do that with the changeState method.
-
invoke the exit and enter method when we swicth between states.
-
invoke the enter method on the initial state.
and a few more. These test bodies are given in the test class 'OlifantysMachineImpleTest' with their purpose, so you should be able to fill in the blanks.
How about coverage
Some of the tests in OlifantysMachineImpleTest are there for the purpose of maintaining
a high coverage. Since you are already familiar with test coverage, it simply is a matter
of turning on the proper maven profile.
The csv test data are in a separate resource folder. If you try to test just one file, make sure you do test project first, which makes maven copy over the csv file to a place where the unit test can find it. |
2. State Machines and embedded systems
The techniques used in the olifantysballs exercise are also applicable to real hardware automatons, such as coffee brewers. It is where the enter and exit methods of each states come from, to be able to turn on and off devices that are active in a state and inactive in another. In C the enum can be replaced by structs containing function pointers, which point to the correct implementation of each function. Functions that do not have to be overridden can simply point to a default implementation, and changing state is replacing one such struct by the struct for the next state. The 'main' program would set up and initialize the structs.
typedef struct stateHandles{
void (*ejectCoin)();
...
} State ;
State NO_COIN= {
&defaultEjectCount;
...
};
State state=...
state->ejectCion();
Template Engine: Text processing with state machines.
Exercise Template Engine
Use case: Think serial mail or mail merge or something similar.
The code is also a kind of template, you need to fill in the empty spots, like in an assessment.
There will be talk about sigil, a magic sign. In the example we use two:
'{{' as starter and '}}' as terminator,
but the solution is flexible enough to take any other pair you like. Sigil stands for the character or character-sequence
given and is there to make parsing of template text easier. By choosing a proper set, they will not handicap what you can write in your template text.
The defaults are quite reasonable.
And if you do, you can always escape.
Use case and idea
This exercise is about a simple templating engine.
Its usecase is a typical serial letter generator. This comes in handy when
you want to send 'personalised' emails to students. A seasoned
informatics person knows you should not use eval(…)
for
that, because that will potentially execute any code.
Say you have some text and you want to substitute certain words with something from a data source, such as student names and/or grades. Then you could embellish the text with special tokens called sigils (magic signs) that help to find the words to be substituted. An example may help:
Dear {{teacher}},
We would like to complain about the module {{mod}}.
We learn to little and the examples are way too trivial.
We would like to receive our moneys worth of teaching.
Kind regards,
The students.
In the example the magic signs are the {{
and }}
sequences, anything in
between is understood as key in the map, and when found, replaced by
the mapped value.
Map<Object, String> map = Map.of(
"teacher", "Richard van den Ham",
"mod", "Programming Concepts 2 (PRC2)"
);
Function<String, String> lookup = ( s ) -> map.getOrDefault( s, "" );
// write to file.
new Engine( "{{", "}}", lookup )
.reading( "letter.txt" )
.writing( "mail-out.txt" )
.run();
This program will print to a the file name 'mail-out.txt'.
Dear Richard van den Ham,
We would like to complain about the module Programming Concepts 2 (PRC2).
We learn too little and the examples are way too trivial.
Kind regards,
The students.
Design
The design is about a simple state machine with five states. See figure state diagram.
In addition the "templating engine" has two helper classes, a SigilMatcher
and the enum TemplateState
. See the
class diagram These classes are package private and so are the methods in them.
The methods you use to configure the engine, constructors, the
readingXXX(…)
, and
writing(…)
are part of a so called fluent API. They, plus the
accept(int)
and run()
methods are the only methods that are
part of the public API.
Given the sigils as above, the Test data in the EngineTest test class are:
static Stream<ET> getTestValues() {
// we use a helper class ET to collect strings into a engine test data object.
ET[] testData = new ET[]{
et( "{{", "}}", '\\', "hello {{world}}...", "hello Schöne Heimat...",
"world", "Schöne Heimat" ),
et( "{{", "}}", '\\', "hello {{süßes}}", "hello Schöne Heimat", "süßes", "Schöne Heimat" ),
et( "{{", "}}", '\\', "now somethings with {{two}} words to {{replace}}", // template
"now somethings with more words to work with", // expected
"two", "more", // kv 1
"replace", "work with" // kv 2
),
et( "{{", "}}", '\\', "Niks geen substituties", "Niks geen substituties" ),
et( "{{", "}}", '\\', "Leave me {alone}}", "Leave me {alone}}" ), // incomplete starter
et( "{{", "}}", '\\', "Leave me { {alone}}", "Leave me { {alone}}" ), // broken starter
et( "{{", "}}", '\\', "Leave me {{alone}, go away", "Leave me {{alone}, go away" ), // broken postfix
et( "{{", "}}", '\\', "Leave me {{alone}}, go away I am Broken at the {{Tail}",
"Leave me , go away I am Broken at the {{Tail}" ), // broken postfix at tail
// text below for escaping version only.
// Note the duplication of the \ char to make the strings valid Java strings.
et( "{{", "}}", '\\', "Look ma, this is how you write a sigil"
+ " text in your letter: \\{{my dear boy}}"
+ " and I used \\\\ to write this example",
"Look ma, this is how you write a sigil"
+ " text in your letter: {{my dear boy}}"
+ " and I used \\ to write this example",
"not used", "not used either" // kv1
),
// below different sigil lengths, 1 and 3.
et( "{", "}", '\\', "hello single char template {world}...", "hello single char template Schöne Heimat...",
"world", "Schöne Heimat" ),
et( "{{{", "}}}", '@',
"And it works with three and other escape as well."
+ "text in your letter: {{{my dear boy}}}.",
"And it works with three and other escape as well."
+ "text in your letter: well, that depends.",
"my dear boy", "well, that depends" ), // kv1
};
return Arrays.stream( testData );
}
Your Task.
The sigil matcher is complete, tested and implemented. It is also a state full class, however NOT using the state pattern, but simply counting the characters matched. So that implementation will provide little help in your real task. Sorry.
Implement this design. Develop test driven. (Should be easy. The test data are already waiting in the code).
-
Implement the
reading
method inEngine
. -
You only need to complete the
testEngine
method in theEngineTest
test class. -
Implement the state machine in the
TemplateState
class, see the state diagram.
Look for the todo’s with CTRL+6 in NetBeans IDE.
Hints
A statemachine reacts to 'events' or method calls.
-
The functions in java.util.function have no member that processes chars. The reason for that is that in many case chars can be processed as ints. Also, when retrieving the character elements from a CharSequence (an interface implemented by the java.lang.String class) using either the method
IntStream chars()
orIntStream codePoints()
will produce anintStream
. For the processing part that is just fine. One needs only to remember to cast the values to `char`s when writing them to the output. -
In this state machine the event of importance is the invocation of the method
Engine.accept(int char)
, which is stipulated by theIntConsumer
functional interface. -
The Engine provides an accept(int) method that matches the
IntConsumer
functional interface. This means that is the engine is fed one character at the time. You can use that method as a method reference (this::accept
) or a lambda expression. ((c) → accept(c)
). -
To use a file with text as input template you will have to turn that file into many accept calls.
-
Think of streams, and use the method
Files.lines()
, study what flatMap and flatMapToInt is all about and use it to get the result: from stream of strings to stream of int, one int per character of the string. In the String API you have two options. Either one will do in most cases.
-
-
Make a drawing (pencil, paper).
Files.line
produces a stream, the Engine puts itself at the end, consuming the stream of `int`s. You have to tie these parts together.
In the class diagram of engine substitute Engine for the T in the ObjIntConsumer<T>
.
Remember that an ObjIntConsumer takes two inputs and produces no direct return value. State Pattern in action with a twist of enum thrown in.
There is a difference between the state machine diagram on this website and in the source folder in the project. The diagram on this website should take precedence, because this is how we test it now. The difference actually only affects the way wrongly formatted template documents are handled, so the choice is arbitrary. In the design in this website, the 'junk' is output, in the other it is discarded. Showing the junk may help the user to find the error in the template earlier. A more professional variant might produce some diagnostics, but we left that out of the exercise. |
4. State machines and regular expressions
You can find a lot on the relation between state machines and regular expressions. In most cases, including in Java, the regex is translated or compiled into a statemachine (just) before matches are attempted.
A nice illustration is given by the REGEXPER website, that turns a regex into a railroad diagram. This make the understanding of a regular expression way easier.
A simple regex : Dutch postal code: "\d{4}\s[A-Z]{2}"
The postcode railroad diagram reads as a digit followed by 3 more digits, a space, a letter and one more letter.
// note that you can break off the regex in java and intersperse it with comments.
String emailRegex="^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+"+ // quite some rubish before the at sign
"@"+ // the at sign followed by a more restricted set of chars,
// with some lengths restrictions
"[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?"+ // optional non capt grp
// any number of letter-number combos
"(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$";
Note that there is not one TRUE regex for email addresses. If you want to know the nitty-gritty details of valid email addresses look at RFC 8398 or RFC 5322, or ask in CSAR class.
Once you understand that regexes are state machine specifications, and that is is easy to create state machines that never 'terminate', you understand that regular expressions can also be dangerous. Accepting a regular expression from any source may cause a denial of service attack, because such expression may put your server into an endless loop. Advice: only accept a restricted set of regexes, maybe parse them before use. In particular avoid back tracking constructs. So beware. |
5. Regular expressions
An often heard saying is: When you have a problem that can be solved with a regular expression, you actually have two problems, the problem and how to write the expression.
There is indeed some truth in the saying: Regular expressions can be hard to read (and hence maintain), and sometimes hard to write as well.
The problem lies in the fact that the format of the most popular form, the Perl dialect, is quite terse. However you can do quite a few things to improve at least their readability.
5.1. Basic regex syntax and rules
A simple string is also a regular expression. As far as there are no meta characters involved, a string simply matches itself.
-
The first meta character you need to know is the dot
'.'
, which matches any character. So a line with a dot will match any line that is the same as the line with any substution for the dot character (including itself). -
The second meta character is the backslash
'\'
. It takes the special meaning (as in meta) of the following character away. Back to our matching line: if you want exactly a dot at the place of the dot, prefix it with a backslash, so it will be understood as a dot. Like this: [red]"\.". But since we using Java, and the backslash already has a special role in strings, you must double the backslash in most cases, thus:"\\."
. -
The next are the quantifier: how many of something do you want.
-
? 0 or one, aka at most. So "a?" means at most one a.
-
+ at least one.
-
* any number of times, including zero.
-
numberic quantifier, using braces ('{' and '}''). For instance "a{1,4}" means a between 1 and 4 times, inclusive. You can also write "a{3}" meaning exaclty 3 as. Or a lower (a{2,} for at least 2 a) or upper boundary (a{,12}) at most 12.
-
-
character classes
-
user specified: [aeiou] matches the English vowels, [a-p] the first 16 characters of the alphabet, [a-pA-P] the same, but ignoring case.
-
-
predefined :
-
\w = word, which is the set of characters in [a-zA-Z_0-9], so a through z in both upper and lower case, the underscore and the digits 0-9.
-
\W negates the class, in this case it matches the non-word characters
-
\d the decimal digits
-
\D not a digit. etc.
-
char | Meaning |
---|---|
. |
match all |
+ |
at least one quantifier |
? |
ate most one |
{ |
start of quantifier spec |
} |
end of quantifier spec |
[ |
start character class |
] |
end character class |
The definition as far as java is concerned is given in the java.util.Pattern class. You do not have to know all details, the summary above suffices for the most cases. The java doc of pattern gives a complete overview.
5.2. Grouping
Sometimes you are not just interested in a match or not, but want to capture parts of the input for further processing. For that you use the parenthesis '(' and ')'. A group, when found, will get a number (or even a name, but that is not in this lesson). Group 0 is typically the whole expression, and number one the first group identified with the parenthesis.
So if you have a regular expression string line "a(e|i|o|u)", group number one will be the group containing the vowel following a, if any.
To get acquainted with regular expressions, it is very helpful to write some tests, to verify your pattern or assumptions on the pattern.
6. Matches, Matchers, match Groups, and group names.
The java.util.regex package contains the classes to work with regular expressions. The most important are the Pattern and Matcher classes.
A Pattern is a compiled representation of a regular expression. It is used to create a Matcher object that will match the regular expression against an input string.
In a pattern you can define groups of interest, for instance values that you want to extract from the input string. The Matcher object will then contain the groups that were found in the input string. You create a group by surrounding the part of the pattern with parenthesis.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Main {
public static void main(String[] args) {
String input = "The quick brown fox jumps over the lazy dog";
String pattern = "The (\\w+) brown (\\w+) jumps over the (\\w+) dog"; (1)
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
if (m.find()) { (2)
System.out.println("Found a match");
System.out.println("Group 0: " + m.group(0)); (3)
System.out.println("Group 1: " + m.group(1));
System.out.println("Group 2: " + m.group(2));
System.out.println("Group 3: " + m.group(3));
} else {
System.out.println("No match found");
}
}
}
1 | There are three groups in the pattern. The first group matches the word after "The", the second group matches the word after "brown", and the third group matches the word after second "the". |
2 | There are several methods to match a pattern against an input string. The find method is the most common one. It returns true if the pattern is found in the input string, and false otherwise. There is also a matches() method, which returns true if the pattern matches the whole input string, and false otherwise.
The third method is lookingAt() , which returns true if the pattern matches the beginning of the input string, and false otherwise. |
3 | Group 0 is the whole match. The other groups are the parts of the input string that match the groups in the pattern. |
6.1. Named groups and comments to the regex
You can give the groups names, so the code becomes less brittle when the pattern changes. This is done by using the syntax (?<name>pattern)
.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Main {
public static void main(String[] args) {
String input = "The quick brown fox jumps over the lazy dog";
String pattern =
"The (?<first>\\w+)" // comment 1
+" brown (?<second>\\w+)" // comment 2
+" jumps over the (?<third>\\w+) dog"; // comment 3
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
if (m.find()) {
System.out.println("Found a match");
System.out.println("Group 0: " + m.group(0));
System.out.println("Group first: " + m.group("first"));
System.out.println("Group second: " + m.group("second"));
System.out.println("Group third: " + m.group("third"));
} else {
System.out.println("No match found");
}
}
}
The comments are not part of the pattern, but are there to explain the pattern. They are ignored by the compiler. But they are very useful to explain the pattern to others, or to yourself when you come back to the code after a while.
Reading grades from text source
Grade Capture
Use capture groups. |
To be able to insert student grades into progress, the teachers want to be able to copy and paste grades from a spreadsheet (which is the typical way grades are assembled) into a text area. In this exercise, a simple file will do as the source of the grades, because working with a real clipboard is a project in itself.
-
The spreadsheet is replaced by a text file with one or more spaces or tabs ("white space") as column separator.
-
The spreadsheet has the typical format that the student number is in the first column and the
grade for that student is in the last column. The columns in between are arbitrary and may contain anything. -
You may ignore the spreadsheet part, because the paste action is done in a text area, which can be collected as a text, that can be divided into lines and columns, the column separator being white space.
-
The student number consist of 7 consecutive digits, the grade can consists of one digit,
two digits (a 10) or two digits separated by a dot or period (.) or a comma (,),
since both continental Europeans and English speaking teachers want to enter grades this way. -
The number of columns is not fixed and may change within one input, e.g. having separate column(s) for the optional Dutch 'tussenvoegsel'. A regex pattern to represent this is available in the
GradeFilter
class. -
When no grade is found in the input line, and the getResult method is invoked anyway, the result should contain key=
null
and value =null
.-
So both the
grade
andstudentId
methods should in that case returnnull
-
A set of test data is given and has the exact input as the table below.
# "quoted test input" | should match | found group value | found snummer | grade |
---|---|---|---|---|
3785895 Jan Jansen 8.7.1997 6.6 8.3 2.4 6,8 |
true |
6,8 |
3785895 |
6.8D |
3895785 Piet Jansen 8.7.1997 |
false |
3895785 |
0.0D |
|
3785985 Henk Jansen 8.12.1994 6.6 8.3 2.4 7.9 |
true |
7.9 |
3785985 |
7.9D |
3785915 Niki Jansen 8.12.1994 6.6 8.3 2.4 8 |
true |
8 |
3785915 |
8D |
1245717 Joepie Hombergh van den 18.03.1992 10.0 10.0 10.0 10 |
true |
10 |
1245717 |
10D |
Note that the first column of the csv file is the "quoted input", so should be seen as one input.
The rest of the columns are the output values to be used in the assertions. Empty cells represent null
values.
The result type of the getResult()
method in the GradeCapture
class may look a bit strange: AbstractMap.SimpleEntry<Integer,Double>
.
The type is what is stored in hash maps and such, and is a way to use a tupple without having to declare the class. It is a kludge, just a way to have a pair
of objects. It stems from the fact that the student-number-grade pairs will typically land in a Map of sorts anyway.
Java 14 has a preview feature called record
which will make this more way more elegant. There you would write record GradeResult(int snummer,double grade){}
as the full 'class' definition,
including constructor, getters, toString and hashCode and equals. It is likely to have that as a standard feature in Java 15 and onward, and will be quite useful in case
you want a method to return more than just one value.
Your Tasks
-
Complete the TODOs in the test classes. (Press control+6 in NetBeans IDE)
-
In the GradeFilter class complete all TODOs.
-
In the client class there is a last todo and that is to use the grade filter in the
getGradesAsMap()
method, by reading a file as lines, applying the gradefilter on each line and collect the student numbers and grades in a map with student number as key and grade as value.
There are a few naming problems in the exercise, because of incomplete (renaming) refactoring. Where you see the class name |
7. Reading
-
Regular expressions: Horstmann, Vol II Section 2.7
-
regex tester at regex101