Stabilizing a large interactive Java application

Let us imagine you have a large interactive Java application with a stability problem. The longer you use it, the more the behavior seems to degrade. Certain features stop working and eventually it crashes. You look at the log and do not find anything useful. Or worse, the log is filled with cryptic messages and warnings, errors, and stack traces, none of which seem related to the problem you saw.

What you have is almost certainly an application with bad error handling. Declared exceptions are caught and ignored or merely logged. Generic exceptions are handled at the wrong level. Unchecked exceptions from programmer bugs are hidden until their side-effects become fatal.

Good error handling takes some time and design to do properly. When a larger application grows by accretion of smaller experiments, developers sometimes postpone these global decisions and stub out their intentions. Some of the code might have originally been written as a prototype, or in haste for an emergency deadline. There are many ways bad error handling accumulates, but it must eventually be fixed, or the instability becomes unsupportable.

In short, do not catch an checked exception that you cannot repair on the spot to the satisfaction of all clients. Declare all other checked exceptions: leave them uncaught, or rethrow if necessary. Do not catch unchecked exceptions that derive from RuntimeException: they result from an avoidable programmer mistake; avoid those bugs at the source. Do not declare or throw a generic Exception or RuntimeException, but use a specific derived type.

Note that these rules are not my invention, but rather standard java practice: http://docs.oracle.com/javase/tutorial/essential/exceptions/catchOrDeclare.html (More here: [ Catching_Exceptions_in_Java.html ] . )

§ Where to begin?

How do you begin to clean up an already unstable application? You might have millions of lines of code and thousands of classes to audit. Here is one way to start.

A domain expert picks a few critical parts of the application to exercise. For example, load some data, manipulate it, view it, and save it. Keep it simple, but try to use important subsystems, not just shallow features.
Run the application with a coverage tool like Jacoco. The coverage should at least record which classes were loaded and used by the application.
While using this workflow, take screen shots or capture a video and save. At a minimum, take good notes of what was done, in order.
Save the log, stdout, and stderr to files.
Take all of these records and check them into your repository.

If you get too much information this first time through, then simplify the workflow. Do not worry about getting too little. I am sure you will want to do it again.

§ Refactoring

Now the developers have a place to start, and a way to prioritize the cleanup of error handling. Much of this refactoring could be performed mechanically by anyone with an IDE.

First of all, check the logs for any visible errors and identify the classes responsible. Next examine all classes recorded by the coverage tool. I recommend starting with the most fundamental services, from the bottom up. The deeper a bug is hidden, the harder it is to diagnose externally.

I recommend several passes through the code. Each time you can handle a different variety of bad exception handling.

For the first pass, I would make sure that no unchecked exceptions derived from RuntimeException are being caught and hidden. These are avoidable programmer errors and must be allowed to crash the application so that they will be fixed at the source. (Hiding unchecked exceptions only postpones a crash and hides the original cause.)

At a minimum, for a first pass only, you could simply rethrow any RuntimeExceptions caught by a generic catch:

catch (Exception e) { // TODO: Replace by a specific exception. if (e instanceof RuntimeException) { // Temporary measure. throw (RuntimeException) e; // Never hide these. } //... Previous exception handling for all other exceptions ... }

For your next pass, remove all generic catch (Exception e) blocks. Your compiler or IDE will let you know which checked exceptions need to be caught or declared. If the original catch properly handled some specific exceptions, then retain the handler for those specific exceptions. If the compiler still complains, then you must make a design decision. Add new exception handling for specific checked exceptions if you are sure how to fix them, and declare the rest to be rethrown to clients. When in doubt, throw them to clients. Remember you are trying to expose errors that were previously hidden.

Do not simply wrap a specific exception in a generic RuntimeException to avoid declaring it. That will only encourage clients to catch all exceptions.

It might be that your client dependencies do not allow you to change the signature of a method today, but maybe in a month or two. In the meantime, throw a specific runtime exception like the following, with the intention of removing it in a third pass.

/** @deprecated Replace by declaring the checked exception */ public class UndeclaredException extends RuntimeException { public UndeclaredException(Exception cause) {super(cause);} }

This temporary measure will convert a hidden design flaw into a runtime programmer bug. If you encounter this exception during a test, you will have a much better idea how you want that situation handled.

Here is an example of a convenience method UndeclaredException.handle(Exception e) that you can insert into all your generic exception handlers: [ ../code/UndeclaredException.java ] .

Before you are done, you must be sure that all catches actually fix a problem. Simply logging a warning or returning a default value does not count.

For example, if you catch an IOException from an attempt to read a file, you could simply return an empty result and give the impression that the file is empty. This might be a reasonable behavior for a specific use, but less useful for general use. If you pass the exception further up, then a user can be notified that the file could not be read. A general purpose library should not attempt to correct too many problems. When in doubt, pass the checked exception along.

If you are not entirely sure you caught and handled an exception correctly, then record your decision, and check later with others. Otherwise, you have no choice: the method must declare the exception for clients to handle.

Wrap a declared exception in a different exception only to clarify the significance.

Scan the code and remove any remaining instances of throw new RuntimeException. Runtime exceptions are limited to those problems that are avoidable and indicate a programmer error. They are tantamount to an assertion failure. You cannot expect clients to catch and inspect each passing RuntimeException.

Specific runtime exceptions are perfectly legitimate if they indicate a programmer bug. Throw IllegalArgumentException if an argument has illegal values. If the situation results from calling methods in the wrong order, or insufficient initialization, then throw an IllegalStateException. If you think a certain state should be impossible and indicates a logic bug, then assert or explicitly throw an AssertionError. (Make sure you have assertions turned on during testing.)

Some exceptions may have been caught only to avoid killing a daemon thread or event-dispatching thread. Those catches should save the exception so that it can be handled or rethrown by a main thread.

Yes, this refactoring process may be painful as client after client is forced to deal with newly declared exceptions. You are dealing with a part of the design that someone else only postponed. The dilemmas were always there. You are just the first person to decide what to do about them.

You should also expect your application to look more unstable in the short term as hidden bugs are uncovered. But you cannot fix a bug until you see it. Debugging is much easier if you see the original failure, and not only delayed side-effects. (Remember how hard memory bugs were to track down in C?)

You may find some of the exceptions were being logged for debugging, or merely by Exception#printStackTrace(). Take a look at the log. Is the level of detail consistent? What would make the log more useful? Try to clean it up, at least for this one workflow. Here are some guidelines I use for logging: [ Routine_Java_Logging.html ]

§ Error dialogs

Remember that an error is not handled until it is resolved to the satisfaction of the user. Typically, a user should be notified when a series of operations is invalid, when a service fails, data fails to load, an installation is poorly configured, etc. Simply logging such errors accomplishes nothing. You would not expect to look at a console for an error if your web browser could not load all of a particular webpage. You expect either an error dialog, or a visual indication of a problem.

There are several approaches to reporting an error from a specific exception. The most disciplined approach would pass checked exceptions for these conditions up the stack until they reach the controller for the user interface. The user interface could take all responsibility for consistent reporting.

The most casual approach would allow any library or module to pop up a custom error dialog. No consistency could be expected.

Alternatively, errors could be reported from any code to a central error reporting service. The difficulty is that code higher up the stack may need to know about these errors as well. Instead they receive only default or empty results.

If poor error handling is being corrected late in a project, then I recommend a central error reporting service, plus ongoing refactoring to declare and pass along appropriate checked exceptions.

More suggestions here: [ Logging_for_error_dialogs.html ]

Bill Harlan, March-June 2009

Return to parent directory.