Building internal IT systems is as much about the data that passes through your application as it is about the logic that operates on it. I’ve scanned log files of web servers for years. I often use a combination of grep, sed, and other tools to look into specific problems. These are often one-off curiosities that don’t require a GUI.

Since before taking over management of this team in July, our app has had some long-running cases where we see ERROR level items in our logs. Some I question whether they should be logged as errors, while some are disconcerting. The application works well overall, but in the spirit of No Broken Windows I’d like to clean up some of the sources of these errors.

Here’s a little snippet of the JBoss log file, showing some standard log output from Hibernate:

Jan 18 00:00:59 devserver local6:[00:00:59,182,SettingsFactory] INFO  Query cache factory: org.hibernate.cache.StandardQueryCacheFactory
Jan 18 00:00:59 devserver local6:[00:00:59,213,SettingsFactory] INFO  Statistics: disabled
Jan 18 00:00:59 devserver local6:[00:00:59,217,SettingsFactory] INFO  Deleted entity synthetic identifier rollback: disabled
Jan 18 00:00:59 devserver local6:[00:00:59,221,SettingsFactory] INFO  Default entity-mode: pojo
Jan 18 00:00:59 devserver local6:[00:00:59,331,SessionFactoryImpl] INFO  building session factory
Jan 18 00:01:00 devserver local6:[00:01:00,509,CacheFactory] WARN  read-only cache configured for mutable class: Address
Jan 18 00:01:00 devserver local6:[00:01:00,513,EhCacheProvider] WARN  Could not find configuration [Address]; using defaults.
Jan 18 00:01:02 devserver local6:[00:01:02,315,SessionFactoryObjectFactory] INFO  Not binding factory to JNDI, no JNDI name configured
Jan 18 00:01:02 devserver local6:[00:01:02,323,UpdateTimestampsCache] INFO  starting update timestamps cache at region: org.hibernate.cache.UpdateTimestampsCache

After a minute or two I formulated this little command-line string to give me a count of the classes that had the most errors:

grep ERROR jboss.log | sed -e "s/^.,(.)] .*$/\1/" | sort | uniq -c

This cuts through the log file, locates only errors, slices out the classname from the log message, sorts the remaining list, and then counts up the number of times each classname appears in the sorted list.

 4 LegacySkimmer
 4 PersistentRef
21 ProgramStateExclusionDaoImpl
 8 QualifyLeadsDelegate
28 RDBQualifyLeadsDAO
 2 SSOLogger
 1 ShowNextLeadAction
 3 [action]

Interpreting these results can be a bit tricky. I take it really informally. This doesn’t represent the number of bugs in the modules, or even the classes with the most software defects. For me, this is simply a list that shows from among the classes that do log errors, which classes have experienced conditions that trigger those error logs most frequently. For our setup, that usually means that there’s some data from another system coming in that doesn’t meet some assumptions. Other causes may include intermittent connection problems to other systems that just get noticed on one part of the code.

I’m not certain of the exact outcomes that will result from this little discovery exercise, but I’m betting that our code will be a bit cleaner and a bit more robust for it.