понедельник, 5 апреля 2010 г.

MySQL Prepared Statements to Generate Crosstab SQL

MySQL Reporting requirements sometimes require both unknown column and row values, necessitating a more powerful means of generating crosstabs. Today's article presents Prepared Statements, which dynamically generate the SQL and assign it to a variable so that we can tailor the output based on the number of data values.
During the past several weeks, we've been learning how to create crosstabs in MySQL. We've now covered the basics of fetching the row and column data, as well as how to overcome some of the challenges inherent to outputting data as columns. As we saw, organizing data into columns can be an arduous task due to the SQL language's natural tendency of appending data to rows. We can transpose row data to columns easily enough, but the number of possible data values in the horizontal axis need to be known before hand.
Unfortunately, there will be times that your reporting requirements will require both unknown column and row values, or have a tendency to change often enough to invalidate previous code. In such instances, you need a more powerful means of generating crosstabs. Today's article presents just such a mechanism: Prepared Statements. By dynamically generating the SQL and assigning it to a variable, we can tailor the output based on the number of data values, thus unburdening us from having to anticipate changes.


How Static Is the Data Really?

Way back in the Tips for Simplifying Crosstab Query Statements article, we were introduced to a crosstab report that displayed the number of cases by Region broken down by month, and later, by year as well. No sooner was the article published than someone asked the question "What happens when a new Region is introduced?" The answer is simple: an extra column must be added to the SELECT field list. This is straightforward enough to do and can be expected to occur very rarely, as Regions are what you would call a static data set. Other than Regions, other geographic entities, including continents, countries, counties, provinces, states, and cities can also be considered to be static. Having said that, even fixed data sets such as time frames can vary enormously. Now I'm not referring to the elasticity of time-space, as discovered by Einstein, but rather, how the start and end points of a SELECT query depend on reporting needs. With regards to our own crosstab query, going from a single year to multiple ones necessitated many changes to the SQL.
At the other end of the spectrum is variable data, which can change drastically from one report to the other. Imagine reporting on Starbucks coffee houses in the nineties boom period! Since you could expect the number of shops to increase on an almost daily basis, you’d definitely need a more flexible approach!

Steps in Converting the Query into a Prepared Statement

Going from an SQL statement to a Prepared Statement will be done in two steps:
  • First, we'll rewrite the query to generate the Prepared Statement whose output will vary according to the number of columns.
  • Second, we'll insert the SQL generating Prepared Statement into a stored proc, so that we can create the Prepared Statement and execute it in one fell swoop.

Dynamically Generating the SQL Statement

In order to dynamically generate an SQL string, we’ll be using the CONCAT() and GROUP_CONCAT() string functions.
The CONCAT() function accepts a variable number of string parameters, and returns another string, which is comprised of all the input parameters joined together in the order that they were passed in. The following code would concatenate a name in last name (comma) first name format:
SELECT CONCAT(last_name, ", ", first_name) AS NAME
FROM   CLIENTS;
Produces:
NAME
_____________
Jones, Paul
McDonald, Jim
Miller, Bruce
Portman, Bess
The GROUP_CONCAT() function returns a string result with the concatenated non-NULL values from a group. Here, it's used to aggregate all the rows from the TA_CASES table and return the collection of SELECT list expressions that makes up the horizontal axis of the crosstab. The following query returns a string value that replaces the SQL statement of our previous crosstab query:
SELECT concat(
    "SELECT CASE WHEN Month_Num IS NULL", "\n", 
    "            THEN 'TOTAL'", "\n", 
    "            ELSE Month", "\n", 
    "       END        AS 'Month',", "\n",
    group_concat( DISTINCT concat("       REGION_", REGION_CODE, 
                                  "  AS 'REGION ", REGION_CODE, "',", "\n"
                           )
                  order by REGION_CODE
                  separator '' 
                ),
    "       TOTAL", "\n",
    "FROM  (     SELECT   MONTH(CREATION_DATE)\t\t\t\t\t\t\t\tAS Month_Num,", "\n",
    "\t\tMONTHNAME(CREATION_DATE)\t\t\t\t\t\t\t\t\tAS 'Month',", "\n",  
    group_concat( 
        DISTINCT concat("\t\t\t\tCOUNT(CASE WHEN REGION_CODE ='", REGION_CODE, 
                        "' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_", 
                        REGION_CODE, "',", "\n"
                 )
        order by REGION_CODE
        separator '' 
    ),
    "            COUNT(*)\t\t\t\t\t\t\t\t\t\t\t\t\tAS 'TOTAL'", "\n",
    "            FROM  TA_CASES", "\n",
    "            WHERE YEAR(CREATION_DATE)=", YEAR(CREATION_DATE), "\n",
    "            GROUP BY Month_Num WITH ROLLUP) AS CA;"
) statement
FROM TA_CASES
WHERE YEAR(CREATION_DATE)=1998;
Here is the resulting SQL code as created by our dynamic SQL generator:
SELECT CASE WHEN Month_Num IS NULL
            THEN 'TOTAL'
            ELSE Month
       END        AS 'Month',
       REGION_01  AS 'REGION 01',
       REGION_02  AS 'REGION 02',
       REGION_03  AS 'REGION 03',
       REGION_04  AS 'REGION 04',
       REGION_05  AS 'REGION 05',
       TOTAL
FROM  (SELECT    MONTH(CREATION_DATE)
         MONTHNAME(CREATION_DATE)  
                 COUNT(CASE WHEN REGION_CODE ='01' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_01',
                 COUNT(CASE WHEN REGION_CODE ='02' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_02',
                 COUNT(CASE WHEN REGION_CODE ='03' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_03',
                 COUNT(CASE WHEN REGION_CODE ='04' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_04',
         COUNT(CASE WHEN REGION_CODE ='05' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_05',
                 COUNT(*) 
       FROM  TA_CASES
       WHERE YEAR(CREATION_DATE)=1998
       GROUP BY Month_Num WITH ROLLUP) AS CA;

Moving the Prepared Statement into a Stored Procedure

Placing our code in a stored proc will make running the query a lot easier as it can generate the statement and then execute it to retrieve the results. Here is the code for the stored proc:
CREATE PROCEDURE `p_case_counts_per_region_by_month`() 
LANGUAGE SQL 
NOT DETERMINISTIC 
CONTAINS SQL 
SQL SECURITY DEFINER 
BEGIN  
  SELECT concat(
    "SELECT CASE WHEN Month_Num IS NULL", "\n", 
    "            THEN 'TOTAL'", "\n", 
    "            ELSE Month", "\n", 
    "       END        AS 'Month',", "\n",
    group_concat( DISTINCT concat("       REGION_", REGION_CODE, 
                                  "  AS 'REGION ", REGION_CODE, "',", "\n"
                           )
                  order by REGION_CODE
                  separator '' 
                ),
    "       TOTAL", "\n",
    "FROM  (     SELECT   MONTH(CREATION_DATE)\t\t\t\t\t\t\t\tAS Month_Num,", "\n",
    "\t\tMONTHNAME(CREATION_DATE)\t\t\t\t\t\t\t\t\tAS 'Month',", "\n",  
    group_concat( 
        DISTINCT concat("\t\t\t\tCOUNT(CASE WHEN REGION_CODE ='", REGION_CODE, 
                        "' THEN FEE_NUMBER ELSE NULL END) AS 'REGION_", 
                        REGION_CODE, "',", "\n"
                 )
        order by REGION_CODE
        separator '' 
    ),
    "            COUNT(*)\t\t\t\t\t\t\t\t\t\t\t\t\tAS 'TOTAL'", "\n",
    "            FROM  TA_CASES", "\n",
    "            WHERE YEAR(CREATION_DATE)=", YEAR(CREATION_DATE), "\n",
    "            GROUP BY Month_Num WITH ROLLUP) AS CA;"
  ) statement
  into @case_counts_per_region_by_month
  FROM TA_CASES
  WHERE YEAR(CREATION_DATE)=1998;

  prepare case_counts_per_region_by_month   
  from @case_counts_per_region_by_month;    
  execute case_counts_per_region_by_month;   
  deallocate prepare case_counts_per_region_by_month; 
END
Inside the procedure, we generate the SQL for the query as we did above, but within a proc we can save it to a variable using the SELECT INTO syntax. A Prepared Statement is then utilized to execute the generated code.
A SELECT INTO can only be used where the SQL returns exactly one row; yet another reason that generating the SQL statement as a string works so well!
A Prepared Statement is a combination of three separate SQL statements:
  • PREPARE prepares a statement for execution.
  • EXECUTE executes a prepared statement.
  • DEALLOCATE PREPARE releases a prepared statement.
Once the proc has been created, all we need to do is call it by entering the following command line:
mysql> call p_case_counts_per_region_by_month;
Here is the record set that is returned by our proc:
Month REGION 01 REGION 02 REGION 03 REGION 04 REGION 05 TOTAL
April 13 33 76 2 47 171
May 17 55 209 1 143 425
June 8 63 221 1 127 420
July 13 104 240 6 123 486
August 18 121 274 9 111 533
September 25 160 239 2 88 514
October 9 88 295 2 127 521
November 2 86 292 2 120 502
December 1 128 232 6 155 522
TOTAL 106 838 2078 31 1041 4094
In the last installment of the Crosstab series, we'll make the proc more generic by moving the columns and table(s) names that we're reporting on to input parameters.
Source: http://www.databasejournal.com

вторник, 30 марта 2010 г.

Apache Lucene 2.9.2 and 3.0.1 Released

Here’s the announcement:
Hello Lucene users,
On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2:
Both releases fix bugs in the previous versions:
- 2.9.2 is a bugfix release for the Lucene Java 2.x series, based on Java 1.4
- 3.0.1 has the same bug fix level but is for the Lucene Java 3.x series, based on Java 5.
New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type-safe API.
Important improvements in these releases include:
- An increased maximum number of unique terms in each index segment.
- Fixed experimental CustomScoreQuery to respect per-segment search. This introduced an API change!
- Important fixes to IndexWriter: a commit() thread-safety issue, lost document deletes in near real-time indexing.
- Bugfixes for Contrib’s Analyzers package.
- Restoration of some public methods that were lost during deprecation removal.
- The new Attribute-based TokenStream API now works correctly with different class loaders.
Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0.

Lucene and Solr Development Have Merged

The Lucene community has recently decided to merge the development of two of its sub-projects – Lucene->Java and Lucene->Solr. Both code bases now sit under the same trunk in svn and Solr actually runs straight off the latest Lucene code at all times. This is just a merge of development though. Release artifacts will remain separate: Lucene will remain a core search engine Java library and Solr will remain a search server built on top of Lucene. From a user perspective, things will be much the same as they were – just better.
So what is with the merge?

Because of the way things worked in the past, even with many overlapping committers, many features that could benefit Lucene have been placed in Solr. They arguably “belonged” in Lucene, but due to dev issues, it benefited Solr to keep certain features that were contributed by Solr devs under Solr’s control. Moving some of this code to Lucene would mean that some Solr committers would no longer have access to it – A Solr committer that wrote and committed the code might actually lose the ability to maintain it without the assistance of a Lucene committer – and if Solr wanted to be sure to run off a stable, released version of Lucene, Solr’s release could be tied to Lucene’s latest release when some of this code needed to be updated. With Solr planning to update Lucene libs less frequently (due to the complexities of releasing with a development version of Lucene), there would be long waits for bug fixes to be available in Solr trunk.
All and all, there would be both pluses and minuses to refactoring Solr code into Lucene without the merge, but the majority have felt the minuses outweighed the pluses. Attempts at doing this type of thing in the past have failed and resulted in diverging similar code in both code bases. With many committers overlapping both projects, this was a very odd situation. Fix a bug in one place, and then go and look for the same bug in similar, but different code in another place – perhaps only being able to commit in one of the two spots.

With merged dev, there is now a single set of committers across both projects. Everyone in both communities can now drive releases – so when Solr releases, Lucene will also release – easing concerns about releasing Solr on a development version of Lucene. So now, Solr will always be on the latest trunk version of Lucene and code can be easily shared between projects – Lucene will likely benefit from Analyzers and QueryParsers that were only available to Solr users in the past. Lucene will also benefit from greater test coverage, as now you can make a single change in Lucene and run tests for both projects – getting immediate feedback on the change by testing an application that extensively uses the Lucene libraries. Both projects will also gain from a wider development community, as this change will foster more cross pollination between Lucene and Solr devs (now just Lucene/Solr devs).

All and all, I think this merge is going to be a big boon for both projects. A tremendous amount of work has already been done to get Solr working with the latest Lucene API’s and allow for a seamless development experience with Lucene/Solr as a single code base (the Lucene/Solr tests are ridiculously faster than they were as well!). Look for some really fantastic releases from Lucene/Solr in the future.

Google App Engine: What Is It Good For?

As a developer, I'm enthusiastic about cloud computing platforms because they let me spend more time writing web applications and services and less time dealing with scalability and deployment issues. In particular, Google App Engine offers automatic scaling and potential cost savings if you design the applications to run on it with the proper discipline.
In this article, I provide an overview of the Google Apps Engine platform for developers. Along the way, I offer some tips for writing scalable and efficient Google App Engine applications.

Google App Engine Overview

I use Google App Engine for several of my own projects but I have not yet used it on any customer projects. Google engineers use Google App Engine to develop and deploy both internal and public web applications. As you will see, designing applications to run on Google App Engine takes some discipline.

The Datastore and App Efficiency and Scalability


The non-relational datastore for Google App Engine is based on Google's Bigtable system for storing and retrieving structured data. Bigtable can store petabyte-sized data collections, and Google uses Bigtable internally for web indexing and as data storage for user facing applications like Google Docs, Google Finance, etc. Bigtable is built on top of the distributed Google File System (GFS). As a developer using Google App Engine, you can also create very large datastores.
The datastore uses a structured data model, and the unit of storage for this model is called an entity. The datastore is hierarchical, which provides a way to cluster data or to manage "contains" type relationships. The way this works is fairly simple: each entity has a (primary) key and an entity group. For a top-level entity, the entity group will simply be the (primary) key. For example, if I have a kind of entity (think of this as being a type or a class) called a Magazine, I might have an entity representing an issue of this magazine identified with a key value of /Magazine:programingillustrated0101 and the entity group value would be the same as the key. I might have another entity that is an article of kind Article that might have an entity group of /Magazine:programingillustrated0101 and a key of /Magazine:programingillustrated0101/Article:10234518. Thus, you know that this article belongs to this issue of the magazine.
Entity groups also define those entities that can be updated atomically in a transaction. There is no schema for entities; you might have two entities of kind Article that have different properties. As an example, a second article might have an additional property relatedarticle that the first article does not have. The datastore also naturally supports multiple values of any property.
The primary technique for making your Google App Engine applications efficient and scalable is to rely on the datastore—rather than your application code—to sort and filter data. The next most important technique is effectively caching data for HTTP requests, which can be reused until the data becomes "stale."

понедельник, 29 марта 2010 г.

The Java 7 Features Bound to Make Developers More Productive

If you've tracked each major Java release, you probably were eager to see what new packages were in each one and how you could employ them in your Java projects. Along the same lines, the next major Java SE release, Java 7, promises several new features across all packages, such as modularization, multi-language support, developer productivity tools, and performance improvement. I think programmers eventually will begin specializing in individual Java packages (i.e., java.util programmers, java.io programmers, java.lang programmers, etc.), but until then, let's explore a few of the notable new developer productivity features slated for Java 7.

New Objects Class

The new Objects class of the java.util package provides a fail-proof way for comparing two objects at runtime:
  1. The equals() method of the Objects class does a reference comparison.
  2. The deepEquals() method piggybacks on the first argument's equals() method definition.
Similarly, when both the arguments are object arrays, Array.deepEquals() is invoked on the objects. The new Objects class provides all the required static utility methods.

New Classes to Operate on File System

Java SE 7 provides classes that greatly simplify the age old integration processes of one application dropping files at a predefined shared location and other application picking them up. Java 7 provides a new class WatchService that notifies any events that take place in the file system under the watch.
The following steps create an asynchronous file-watcher service:
  1. Obtain the path from the File class.
    Path fPath = new File(filePath).toPath();


  2. Obtain a handle to the Watch service from the file system.
    dirWatcher = fPath.getFileSystem().newWatchService();


  3. Register which type of events you are interested in.
    fPath.register(dirWatcher,    
    StandardWatchEventKind.ENTRY_CREATE, 
    StandardWatchEventKind.ENTRY_DELETE, StandardWatchEventKind.ENTRY_MODIFY);


  4. Wait for the event to happen.
    try{
    WatchKey key = dirWatcher.take();
    }catch(InterruptedException ie){
    return;
    }


    The WatchKey class now has all the details of the event that occurred in the directory.

  5. Loop through Step 4 to continue receiving events.

New Classes for Concurrency Package

The Java SE team added a wide variety of new classes to Java 7 to cater to various concurrency functionalities. Most notable among them are the RecursiveAction and RecursiveTask classes, which simplify new algorithm development. Understanding the difference between heavyweight and lightweight processes will help you grasp the value of these new classes.
  • A heavyweight process gets a replica of the code, stack, and data from the parent process. You create a heavyweight process by invoking fork().
  • A lightweight process gets its own stack and shares resources and data with other threads or the parent thread. The Unix Thread API standard (POSIX) provides methods to create a thread.
Java 7 defines a new abstract class called ForkJoinTask, a lightweight process that generates a distinct stream of control flow from within a process. RecursiveAction and RecursiveTask are abstract subclasses of ForkJoinTask.
To code a recursive call, you must subclass either one of these classes and define the compute() method. The getRawResult() method returns null for RecursiveAction and returns a value for RecursiveTask. The Java 7 documentation provides a simple example for each of these classes.

Code Development Made Too Easy?

For me, the joy of being a computer scientist is spending long hours writing code for various algorithms. The problem solving keeps my mind alert, and the computations keep going in my head even in sleep. All the utilities in Java 7 take much of that joy of programming away, but they contribute to the bottom line for the companies supporting Java projects, which is what really matters for Java

PDF and Java

I discovered a Java library for PDF from Etymon Consulting. Although it does not cover the full specification, it does provide a convenient approach for reading, changing and writing PDF files from within Java programs. As with any Java library, the API is organized into packages. The main package is
com.etymon.pj.object
 
. Here, you'll find an object representation of all PDF core objects, which are arrays, boolean, dictionary, name, null, number, reference, stream, and string. Where the Java language provides an equivalent object, it is used but with a wrapper around it for consistency purposes. So, for example, the string object is represented by PjString.

When you read a PDF file, the Java equivalents of the PDF objects are created. You can then manipulate the objects using their methods and write the result back to the PDF file. You do need knowledge of PDF language to effectively do some of the manipulations. The following lines, for example, create a Font object:
 
PjFontType1 font = new PjFontType1(); 
font.setBaseFont(new PjName("Helvetica-Bold")); 
font.setEncoding(new PjName("PDFDocEncoding")); 
int fontId = pdf.registerObject(font);


where
pdf
is the object pointer to a PDF file.

One thing, I wanted to do was to change parts of the text in the PDF file to create "customized" PDF. While I have access to the PjStream object, the bytearray containing the text is compressed and the current library does not support decompression of LZW. It does support decompression of Flate algorithm.
Despite some limitations, you can still do many useful things. If you need to append a number of PDF documents programmatically, you can create a page and then append the page to the existing PDF documents, all from Java. The API also provide you with information about the document like number of pages, author, keyword, and title. This would allow for a Java servlet to dynamically create a page containing the document information with a link to the actual PDF files. As new PDF files are added and old ones deleted, the servlet would update the page to reflect the latest collection.
Listing 1 shows a simple program that uses the pj library to extract information from a PDF file and print that information to the console.
 
Listing 1.
import com.etymon.pj.*;
import com.etymon.pj.object.*;

public class GetPDFInfo {
  public static void main (String args[]) {
   try {
           Pdf pdf = new Pdf(args[0]);
            System.out.println("# of pages is " + pdf.getPageCount());
       int y = pdf.getMaxObjectNumber();
       for (int x=1; x <= y; x++) {
     PjObject obj = pdf.getObject(x);
         if (obj instanceof PjInfo) {
        System.out.println("Author: " + ((PjInfo)
                                                        obj).getAuthor());
        System.out.println("Creator: " + ((PjInfo)
                                                        obj).getCreator());
        System.out.println("Subject: " + ((PjInfo)
                                                        obj).getSubject());
        System.out.println("Keywords: " + ((PjInfo)
                                                         obj).getKeywords());

         }
       }
   }
   catch (java.io.IOException ex) {
        System.out.println(ex);
   }
   catch (com.etymon.pj.exception.PjException  ex) {
        System.out.println(ex);
   }   
  }
}

Before you compile the above program, you need to download the pj library, which includes the pj.jar file. Make sure your CLASSPATH includes the pj.jar file.
The program reads the PDF file specified at the command-line and parses it using the following line:

Pdf pdf = new Pdf(args[0]);
It then goes through all the objects that were created as a result of parsing the PDF file and searches for a
PjInfo
object. That object encapsulates information such as the author, subject, and keywords, which are extracted using the appropriate methods. You can also "set" those values, which saves them permanently in the PDF file.
There are a number of sample programs that ship with the pj library, along with the standard javadoc-style documentation. The library is distributed under GNU General Public License.

Conclusion

Despite additions and advancements of HTML, PDF continues to be the most popular mean for sharing rich documents. As a programming language, Java needs to be able to interact with data. The pj library shown here, is a preview of how PDF objects can be modeled in Java and then use Java's familiar constructs to manipulate the seemingly complex PDF documents. With this type of interaction, applications that need to serve rich documents can actually "personalize" the content before sending out the document. This scenario can be applied, for example, to many legal forms where a hand signature is still required and the form is too complex to be drawn entirely in HTML. Java and PDF provide a nice solution for these types of applications.

Selenium: Automated Integration Testing for Java Web Apps

The value of unit tests is well established and all applications ideally complete a suite of unit tests. However, in the real world, not all applications possess these ideal qualities. In reality, developers have to work with applications that are not well designed/developed and that may not have had any unit testing. This makes modifying/enhancing these applications riskier.
In such circumstances, running automated integration tests might be quicker and just as effective. The integration tests will allow you to modify/enhance the application with confidence. Integration tests also test the application as a whole, which unit tests do not. Unit tests execute only a part of the application in isolation. While integration tests can detect issues in any of the application's components, unit tests detect issues only within a particular component.

Automated integration tests can be useful particularly for legacy applications, CRUD applications and applications that have business logic tightly coupled to the environment in which they run. The Selenium web application testing system is a powerful tool for implementing automated integration testing for Java-based web applications. In his Web Developer's Virtual Library (WDVL) article, "Selenium: Automated Integration Testing for Java Web Apps," Avneet Mangat explains automated integration testing with Selenium. You will learn how to develop integration tests using the Selenium IDE, how to export the integration tests as JUnit tests, and then how to automate test execution.