"Tech Coaster" by Chinthaka Dharmasiri: 2014

Sunday, August 10, 2014

Leveraging Spring Batch to Run Your Workflows

In most enterprise applications which involves rapid data transactions and data sources of large volumes, there is always the requirement of a batch processing framework to do the bulk data processing. Most often, this bulk data processing job will be a EOD, Weekly or a Monthly job in which the data would be propagated from one source to the other and will probably be tunneled through a data processing engine in the middle. Spring Batch is one of the stand out frameworks today which is tailor made for this kind use case. It even contains many off-the-shelf components for reading and writing to various data sources, ex: Files (Plain, CSV, XML), Databases (DB2, SQL) and etc. which makes it really easy to develop a working solution in very quick time. It may end up to be one single bean configuration file which contains all the definitions from data reading, tokenization, POJO mapping, data processing and writing to the end source. I will share some of the very good spring batch references for anyone who is a little new for Spring Batch.

However, in this blog post, I will try to explore more into Spring Batch and see how it can really help an enterprise application as a workflow engine. It is not always the case in large enterprise applications to have only one use case for Bulk Data Processing. As more applications get integrated, the core framework needs to support many different kinds of bulk data processing and in fact, until the requirement comes, you may not know what you really need.

If the aforementioned argument make sense to your architecture as well, then Spring Batch is even more suited. In my experience with Spring Batch for the last few months, what I realized more and more is how its' core architecture can beautifully be separated from its implementations. In most cases, you will be working with its implementations that fits to your use cases. However when you peel them off, what you find is a very extensible, and very powerful framework that can help you run many different kinds of business workflows and with many capabilities that you want it to have.

The key is to write a core component in your framework that embeds Spring Batch and leverage Inversion of Control to externalize the eventual business implementation details from the framework. I will provide a design level idea for this in a bit.

Whatever the core capabilities that are required, you need to orchestrate them in your framework implementation.

I encountered the requirement to design a workflow engine using Spring Batch which has the following capabilities.

Extensible to easily configure new workflows

This is what I mean when I spoke about externalizing business implementation. We needed to support accelerated development to process different types of workflows as and when they come.

Parallel Processing Capability.

Once again, Spring Batch has this capability off-the-shelf with the concept of Multi Threaded Steps. [1]

All you have to do is, specify your way of splitting the work into set of workers. Using the remote chunk processing concept in Spring Batch, you can define a Master Step which will delegate the work to set of Slave Steps which will complete the task a multi-threaded order.

Configurable

With Spring Batch, this becomes very easy because everything can be configured using the bean configuration.

Recoverability

This is one important aspect for any workflow process. Because the data being process bound to has fault data, thus it is essential that you have the control over how the workflow process the data. If something bad happens, you should be able to bailout from completing the workflow and later manually invoke it after correcting whatever that has gone wrong before.

Spring Batch Jobs once again has an off-shelf feature for this called Job Repository, which has its own data schema which stores the state of each job that it runs into a specified data store.(ex: Database) You can make use of this solution for your recoverability purpose, or if it is too over complicated in your opinion (like mine), you can design your own solution and Spring has enough space in its architecture to place yours.

Here I will get back to the externalization piece which will help easily configure many workflows using a straight forward interface implementation. As a good engineer, we make it a practice to externalize the interface from the implementation. It is the same thing that we can do here.

For a Spring Job, there are 3 key extension points that you can look at.

ItemReader
ItemProcessor
ItemWriter

Externalizing Implementation from Core Architecture

The above diagram depicts a reference design which can actually externalize the implementation details out of the core classes and into the business layer using the inversion of control. To make it more understandable, I will take few code snippets to show how this can be achieved.

1. Sample Bean Configuration

         <!-- batch job -->
 <job id="batchjob" xmlns="http://www.springframework.org/schema/batch">
  <step id="masterStep">
   <partition step="slave" partitioner="lotPartitioner">
    <handler task-executor="taskExecutor" />
   </partition>
  </step>
 </job>
        <!-- Slave -->
 <step id="slave" xmlns="http://www.springframework.org/schema/batch">
  <tasklet>
   <chunk reader="customItemReader" writer="customItemWriter"
    processor="customRowProcessor" />
  </tasklet>
 </step>

We have set customized reader,writer and a processor for a slave. Lets look at one of the three, customItemProcessor to see how simply we can externalize the implementation using IOC.

2. Reference Implementation of Custom Item Processor

public class CusItemProcessor implements ItemProcessor {

 @Autowired
 private IWorkflowLifeCycleMethods lifecycleMethods; 

 public BaseItem process(BaseItem item) throws Exception {
  return lifecycleMethods.process(item);
 }
 
}

In this manner, we have abstracted the core classes and made sure for each use case for the workflow engine, we only need to define a set of methods defined in the IWorkflowLifeCycleMethods interface and luckily using dependancy injection we can configure different implementations to meet our different needs in different workflows.

For further reading, I encourage to refer the book, "Spring Batch In Action" which has a wealth of information.

Some other useful links:

[1] http://docs.spring.io/spring-batch/reference/html/scalability.html

[2] http://docs.spring.io/spring-batch/trunk/reference/html/spring-batch-intro.html

[3] http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html

[4] http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/

[5] http://www.techavalanche.com/2011/08/21/spring-batch-hello-world-in-memory/

Cheers.!!

Thursday, May 1, 2014

Keeping track of my personal Technology Stack

Recently at a project interview in my company, they asked me the question, "So can you tell us about your technology stack?"

So here I will be keeping track of my technology stack. I will be adding new technologies as I come across them and include some useful references for the future use.

For any techie, you cannot survive in a competitive environment without an up-to-date technology stack. Whenever the time comes, you need to pull out your trump cards in technology and tell your boss, "Hey, I have used this before, it fits to our purpose. So not to worry, I got this covered."

Here I am keeping track of my tech-stack..!!

Area	Technology
Programming Languages	Java, Python, JavaScript, C#, HTML (5), PHP
Concepts	OOP , Design Patterns, Distributed Systems, Concurrency, Cloud, Database, Time complexity, State Machine Design, Batch Processing, Service Oriented Architecture
Java	Application Frameworks	Spring MVC, Batch
	Middleware Frameworks	Servlet 3.0 Asynchronous
		JBoss Netty
		Google Guava - Utils
	Web Services	Apache CXF
	ORM	Hibernate, MyBatis
	Mobile	Android
	Build Management	Maven
	Testing	JUnit
	Code Quality	ERA Analyzer, Sonar Analyzer, FindBugs
JavaScript	JQuery, Backbone.js, Angular.js
C#	.Net MVC, LINQ
PHP	Symfony MVC, Doctrine ORM
Cross Platform Dev.	TideSDK, Titanium API
IDEs	IntelliJ IDEA, Eclipse, Visual Studio
Source Control	SVN, Git
Database Technologies	MySQL, DB2, MongoDB
Operating Systems	Windows, Linux

Sunday, April 20, 2014

Learning MongoDB - Event Triggering with Tailable Cursors using Java Driver

I am currently working on a project in which we are designing an interactive task management tool. While working on it, I realized that we need to have a "push" notification mechanism from the database layer. Why, because it provides a very nice way for the middleware to send the latest updates relevant to a particular item (a task in my context) to the client side, without bothering client to keep polling time to time for new data.

Luckily for me, we were NoSQL db fans and were using MongoDB.

Well, Google came for the rescue. Here I was searching event triggers in MongoDB. Apparently Mongo did not have an straight forward event triggering mechanism like what could be achieved in MySQL [1].

However, there is a pretty nice workaround for this in MongoDB which is called tailable cursors. It provides you a mechanism to setup a query that retrieves data dynamically as and when it is inserted/updated in the database. Pretty cool right..

However, here is the tricky part.

It works only with the capped collection in MongoDB. What are Capped Collections in MongoDB?

Capped collections are fixed sized collections which provides high speed writes with the trade-off of being bit slow in reading data as it reads sequentially. They are circular collections, i.e. behaves like a FIFO queue when the size is full. As rightly said in its documentation, these are ideal collections for logging purposes and can well be used for versioning purposes. [2]

Important: In creating a capped collection, we need to specify the size. Make sure the collection is not oversized, but still can accommodate the values for long enough for your application requirement without being evicted.

Now lets look at one of the high level designs where we can use tailable cursors similar to a pub-sub method.

High Level Design : Tailable Cursor Based Pub-Sub System

According to the above design, Listening Channels will be pointing tailable cursors to the Log_Collection to retrieve data of their interest, whenever they are available.

Right, now lets see how we can write bit of a code in Java to establish a tailable cursor.

1. Creating a capped collection in Java.

 if (db.collectionExists("item_collection")) {  
     collection = db.getCollection("item_collection");  
     } else {  
     DBObject options = BasicDBObjectBuilder.start().add("capped", true).add("size", 100000000l).get();  
     collection = db.createCollection("item_collection", options);  
     }  
 }

2. Writing the search query

     BasicDBObjectBuilder builder = BasicDBObjectBuilder.start();  
     builder.add("rev_no",lastRevNo); 
     builder.add("type", typeA);

I am basically looking at documents that are stored to the collection with the given typeA. (Assume typeA to be predefined categorical value) I am using a RevNo which is a global value which helps me to keep track for the last retrieved documents and help me get the delta. [3]

     DBObject searchQuery = builder.get();  
     DBObject sortBy = BasicDBObjectBuilder.start("$natural", 1).get();

Here we are requesting the results to retrieved in the same order they are stored in the collection ($natural operator).

3.. Preparing a tailable cursor for a custom query.

     DBCursor cursor = collection  
         .find(searchQuery)  
         .sort(sortBy)  
         .addOption(Bytes.QUERYOPTION_TAILABLE)  
         .addOption(Bytes.QUERYOPTION_AWAITDATA);

4. Retrieving and notifying subscribers.

 while (cursor.hasNext()) {  
     // Java Observer Design Pattern can come useful here.   
     result = //Convert cursor result into a value object;  
     //publishResults will call the notifyObservers() to notify all the subscribers  
     publishResults(result);  
 }

Key points:

1. Cost of cursor establishment

Remember that tailable cursors are expensive. But once established it can be used pretty smoothly to get latest data through a particular channel.

However, we need to be smart not to break the harmony by disconnecting regularly by returning from cursor. cursor.hasNext() will block until a result becomes available. Thats why I have proposed the observer pattern in Java to notify the subscribers of a particular listening channel whenever the data is available.

2. How does Tailable cursor mechanism work in Mongo DB?

Something that I also need to get more info would be that. For now, my interpretation is that it uses underline polling mechanism that goes beneath our implementation. I am looking forward to get more knowledge about it.

All and all, for applications where we need to support async requests and long polling, I guess MongoDB tailable cursors do become handy.

How would tailable cursors peform in production? Next up, I will try to update this with results of the load testing in my application and how the chatting with capped collections have performed.

Cheers..!!
Chinthaka

[1]. https://dev.mysql.com/doc/refman/5.5/en/triggers.html

[2]. http://docs.mongodb.org/manual/tutorial/use-capped-collections-for-fast-writes-and-reads/

[3]. http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/