Sunday, October 3, 2010

Percolator - Google's newest weapon

Google's got a new weapon to fight exponentially growing web. Recently published paper shows that they developed a distributed transaction layer and notification framework over Bigtable. Percolator helps to make incremental changes into Google's search index.
Authors of the paper dismissed Sinfonia's approach (another consistent distributed storage system) as if it is intended to use only for building infrastructure not for application.
IMHO, Sinfonia's approach is simpler, probably, faster and with minor modification can be suitable for application builders.

Wednesday, June 16, 2010

Multiple substring search in Java

My recent attempts to find a decent Rabin-Karp algorithm implementation on Java have failed. After some research I wrote one and want to share it.
Please note that it works better if text and patterns contain character with codes < 256.
Also it makes sense to use that implementation if number of patterns is more than 40-50. If less then use brute force algorithm with String.indexOf. It performs better on small number of patterns.

import java.util.HashMap;

/**
* Multiple substrings search using Rabin-Karp algorithm.
* Performs better than brute force algorithm when number of patterns is more than 50.
* */
public final class RKPatternMatcher
{
private HashMap<Integer, String> patterns;
private int minPatternLength=Integer.MAX_VALUE;

private int matchIndex=-1;
private String matchText;

private static final int M = 8300009;
private static final int B = 257;
private int E = 1;

public RKPatternMatcher(String[] patterns)
{
this.patterns = new HashMap<Integer, String>(patterns.length);
for(int i=0;i<patterns.length;++i)
{
int patternLength = patterns[i].length();
if(patternLength == 0) throw new IllegalArgumentException("Pattern cannot be an empty string");
if(patternLength < minPatternLength)
minPatternLength = patternLength;
}
for(int i=0;i<patterns.length;++i)
{
this.patterns.put(hash(patterns[i]), patterns[i]);
}
if(patterns.length > 0)
{
for (int i=0; i<minPatternLength-1; i++) // computing E = B^{m-1} mod M
E = (E * B) % M;
}
}

private int hash(String pattern)
{
// calculate the hash value of the pattern
int hp = 0;
for(int i = 0; i < minPatternLength; i++)
{
char ch = pattern.charAt(i);
hp = (hp * B + ch) % M;
}
return hp;
}
private boolean checkMatch(char[] text, int startIndex, String pattern)
{
int n = pattern.length();
if(text.length-startIndex < n) return false;
for(int i=0;i<n;++i)
{
int ch = text[startIndex+i];
if(ch != pattern.charAt(i)) return false;
}
matchIndex = startIndex;
matchText = pattern;
return true;
}

public boolean hasMatch(String text)
{
return hasMatch(text, 0);
}

public boolean hasMatch(String text, int fromIndex)
{
int n = text.length();
if(n - fromIndex < minPatternLength) return false;
char[] chars = text.toCharArray();
return hasMatch(chars, fromIndex);
}
public boolean hasMatch(char[] chars, int fromIndex)
{
matchText = null;
matchIndex = -1;
int n = chars.length;
if(n - fromIndex < minPatternLength) return false;
// calculate the hash value of the first segment
// of the text of minPatternLength
int ht = 0;
for(int i = 0; i < minPatternLength; i++)
{
int ch = chars[fromIndex+i];
ht = (ht * B + ch) % M;
}
String pattern = patterns.get(ht);
if(pattern!=null)
{
if(checkMatch(chars, fromIndex, pattern)) return true;
}
//start the "rolling hash" - for every next character in
// the text calculate the hash value of the new segment
// of minPatternLength
for(int i = minPatternLength + fromIndex; i < n; i++)
{
char ch1 = chars[i - minPatternLength];
char ch2 = chars[i];
ht = (B*(ht - ((ch1*E) % M)) + ch2) % M;
if (ht < 0) ht += M;
pattern = patterns.get(ht);
if(pattern!=null)
{
if(checkMatch(chars, 1 + i - minPatternLength, pattern)) return true;
}
}
return false;
}

public int getMatchIndex()
{
return matchIndex;
}

public String getMatchText()
{
return matchText;
}
}

Friday, March 5, 2010

OSGi and source code growth problem

Recently I've found pretty interesting OSGi DevCon keynote preview:
http://techdistrict.kirkk.com/2010/02/17/osgi-devcon-preview/
Author proposes to solve source code growth problem by using OSGi in the system architecture.
I agree that OSGi is the best modularization technology for Java and quite useful if you need one or more of the following:
- modularity framework
- usage of multiple versions of the same module
- hot module redeployment
- dependency injection

However, in present state OSGi has a major drawback. It does not make developers more productive, instead it adds more work such as metadata management and following OSGi guidelines. In many cases, benefits of OSGi may be not worth additional development overhead.

And, finally, I dont's see how OSGi can solve the source code growth problem. Instead of managing thousands of classes, developer have to manage thousands of modules. The growth problem still exist with OSGi too.

Monday, March 1, 2010

Source code as data

Von Neumann architecture opened an era of modern programming where program instructions became data. Later, computer languages, compilers and source code were invented in order to help programmers deal with the growing complexity and size of programs. Today, 65 years later, people are able to manage terabytes of data, literally build clouds out of computers, but still struggle with programs with millions of lines of code which are far beyond a human's comprehension.

In the programming world, source code is treated like sacred knowledge, stored in text files, written in programming languages and understandable mostly by authors. Most difficulties in code comprehension and research stem from the fact that code is not considered data (data as in a database). Every programming language, technology or framework has it's own syntax and
program structure. Source code search is implemented by generic text search engines, which allow searching only by terms or names of classes, methods or so. It is impossible to formulate a precise query using names, program structure and references in a generic way. For example, generic text search engine cannot find a list of all functions called from said functions, recursively called functions etc.

It’s clear that exponential growth of programs size can be managed only by applying principles of structured data organization for source code. Program artifacts must be indexed and stored in database in technology-independent, generic form. Such program database must allow querying and modifying artifact in a transactional way. Generic form of artifacts will allow transferring algorithms, business logic between programming languages and technologies.