Wednesday, August 28, 2013

Objects and functional programming

In a recent question on Stackoverflow, someone asked “when to use interfaces and when to use higher-order functions?”. To summarize, it’s about deciding whether to design for an interface or a function to be passed. It’s expressed in F#, but the same question and arguments can be applied to similar languages like C#, VB.NET or Java.

Functional programming is about programming with mathematical functions, which means no side-effects. Some people say that “functional programming” as a paradigm or concept isn’t useful, and all that really matters is being able to reason about your code. The best way I know to reason about my code is to avoid side-effects or isolate them as much as possible.

In any case, none of this says anything about objects, classes or interfaces. You can represent functions however you like. You can write pure code with objects or without them. OOP is effectively orthogonal to functional programming. In this post I'll use the terms 'objects', 'classes', 'interfaces' somewhat interchangeably, the differences don't matter in this context. Hopefully my point still gets across.

Higher-order functions are of course a very useful tool to raise the level of abstraction. However, many perhaps don’t realize that any function receiving some object as argument is effectively a higher-order function. To quote William Cook in “On Understanding Data Abstraction, Revisited”:

Object interfaces are essentially higher-order types, in the same sense that passing functions as values is higher-order. Any time an object is passed as a value, or returned as a value, the object-oriented program is passing functions as values and returning functions as values. The fact that the functions are collected into records and called methods is irrelevant. As a result, the typical object-oriented program makes far more use of higher-order values than many functional programs.

So in principle there is little difference between passing an interface and passing a function. The only difference here is that an interface is named and has named functions, while a function is anonymous. The cost of the interface is the additional boilerplate, which also means having to keep track of one more thing.

Even more, since objects typically have many functions, you could say that you’re not just passing functions as values, but passing modules as values. To put it clearly: objects are first-class modules.

As an aside, the term “first-class value” doesn’t have a precise definition, but I find it useful to wield it with the definition given in MSDN or the Wikipedia.

Objects are also parametrizable modules because constructors can take parameters. If a constructor takes some other object as parameter, then you could say that you’re parameterizing a module by another module.

In contrast to this, F# modules (more generally, static classes in .NET) are not first-class modules. They can’t be passed as arguments, you can’t do something like creating a list of modules. And they can’t be parameterized either.

So why do we even bother with modules if they’re not first-class? Because it’s easier to pick just one function out of a module to use or to compose. Object composition is more coarse-grained. As Joe Armstrong famously said: “You wanted a banana but you got a gorilla holding the banana”.

Back to the Stackoverflow question, what’s the difference between:

module UserService =
   let getAll memoize f =
       memoize(fun _ -> f)

   let tryGetByID id f memoize =
       memoize(fun _ -> f id)

   let add evict f name keyToEvict  =
       let result = f name
       evict keyToEvict
       result

and

type UserService(cacheProvider: ICacheProvider, db: ITable<User>) =
   member x.GetAll() = 
       cacheProvider.memoize(fun _ -> db |> List.ofSeq)

   member x.TryGetByID id = 
       cacheProvider.memoize(fun _ -> db |> Query.tryFirst <@ fun z -> z.ID = ID @>)

   member x.Add name = 
       let result = db.Add name
       cacheProvider.evict <@ x.GetAll() @> []
       result

The first one has some more parameters passed in from the caller, but you can imagine what it would look like. I probably wouldn’t arrange things like either of them, but for one, both lack side-effects. To the first one, you can pass pure functions. To the second one, you can pass implementations of ICacheProvider and ITable with pure functions.

However if you take a good look at the second one, you’ll see that every method uses both cacheProvider and db. So in this case it’s not so bad to pass a couple of gorillas. And it gives the reader a lot more information about what’s being composed, as opposed to a signature like

add : evict:('a -> unit) -> f:('b -> 'c) -> name:'b -> keyToEvict:'a -> 'c

To summarize: The beauty of functional programming lies in being able to reason about your code. One of the easiest ways to achieve this is to write code without side-effects. Classes, interfaces, objects are not opposed to this. In object-capable languages, objects can be a useful tool. Here I talked about objects as modules, but they can model other things too, like records or algebraic data types. They can be easily overused though, especially by programmers new to functional programming. Consider carefully if you want to be juggling gorillas rather than bananas!

Saturday, August 3, 2013

Book review: Apache Solr for Indexing Data How-to

A few days ago I kindly received a copy of the book “Apache Solr for Indexing Data How-to” by Alexandre Rafalovitch for review. Here are my impressions about it.

Solr, by now a nine-year old project, is a powerful piece of software, with lots of high-level features and facilities for text-centric data. And it builds on Lucene, itself an 11-year-old stand-alone project.

At 80 pages, “Apache Solr for Indexing Data How-to” doesn’t try to cover all the features. Instead, it focuses on indexing, that is, getting data from some source (Relational database, text files, etc) into Solr. This is of course a major part of using Solr.

When starting out with Solr, most people first follow the official tutorial, but then feel lost when faced with real-world requirements. The official wiki docs have greatly improved in the last few years but there’s still a large gap between the tutorial and the docs. The reference guide is also great but for a novice it may seem daunting at first. You can see this in many questions on Stackoverflow. This book helps close that gap a bit, at least the part about getting your data into Solr.

You can read it like a cookbook, as a guidance for specific indexing scenarios. As a good “how-to” book, each section starts with a short introduction, then a step-by-step guidance on how to get to the goal, and a “how it works” section explaining everything. An additional section adds tips and further references about each subject.

Of course you can also read it like a regular book. It starts with the most basic scenario, picking up where the tutorial leaves off, and then dives into more complex scenarios. All examples are on github so you can follow on a concrete instance of Solr while reading. The book is written for Solr 4.3. As of now Solr 4.4 is already out and 4.5 is coming soon, but don’t worry, the dev team seems to follow Semantic Versioning so there aren’t any breaking changes.

One problem with this kind of books is that often they can’t focus just on the main topic (in this case, indexing) without at least touching on other topics. Indexing is related to the Solr schema, which in turn is a function of the search needs of your application. This book dabbles in faceting and searching when the scenario demands it, but otherwise acknowledges its limited scope and refers the reader to other books or the reference documentation when appropriate, so you never feel lost.

Another issue is the simplification of some scenarios in order to focus on operative topics and avoid scope creep. For example, the section on indexing data from a relational database uses an example where the database has only one table, no foreign keys. In most real-world scenarios you’ll have lots of related database tables which you’ll have to denormalize and flatten depending on your search needs.

Overall, I think “Apache Solr for Indexing Data How-to” is great for a novice in Solr. It’s a simple, concrete guide to indexing which is one the first things you do with Solr. Just don’t expect it to be all-comprehensive: it doesn’t cover all scenarios and you should read it along the docs to truly understand the concepts at work. It’s designed to help you move forward when, as a beginner, everything looks too complex and you have no idea what to do.

The tutorial will get you started, but this book will get you going.