Geek Thoughts

Tuesday, 12 February 2013

Green Deal or No Deal?

Green Deal Basics

As detailed in a blog post last year, my home energy use and a thermal imaging survey I had done show very clearly that my house is in need of serious energy efficiency work. I was intending to do this last year but working out how to finance the work proved to be a bit of a problem as I had depleted my savings to minimise the size of the mortgage. So when the Green Deal came along, I thought it would be the ideal opportunity to get the work done.

The way the Green Deal work is meant to be extremely simple:

Get an assessment done via an accredited Green Deal Assessor,
Find a Green Deal Provider,
Get the work done via a the provider and a Green Deal Installer,
Pay back the money via your electricity bill.

As an aside on the last step, considering that the Green Deal is meant to improve energy efficiency and that most of the housing stock in the UK suffers from poor insulation, the energy use that is most likely to be affected is heating, which in turn would primarily affect gas bills. Therefore paying back via your electricity bill sounds slightly counter-intuitive.

As another aside on the last step, you pay back the money with interest (you didn't think it'd be free credit, did you?) and the interest rate on Green Deal loans is around 7.5% as explained by Which? in their guide. This is not a great rate and you may have cheaper options at your disposal, as explained by Which? as you can choose to finance the works differently. The main advantage of Green Deal loans compared to any other option is that they are attached to the property rather than yourself. The way I understand it, it means that if you sell the property, the next owners will repay the remainder of the loan on their electricity bill as they benefit from the improvements and the loan will not be subject to any credit checks against you. However, don't take my word as gospel, check with an accredited adviser, which brings me to the next step.

Finding an Assessor

The first step in the Green Deal is to find an assessor. The best way to do this is to peruse the search facility on the Green Deal Oversight & Registration Body's web site: you can search by name or postcode, residential, non-residential or both. If you search by postcode, results will be ordered based on proximity, which is very handy.

I did just that and am going down the list, calling each one of them in turn based on the assumption that I will get a much better feel for the company if I have them on the phone than if fill in an online form. I added a small tweak to this, I will not call British Gas for two reasons: I called them last year so I know what their offering is like and will actually use them as a benchmark to compare the other guys against, and I am all for giving my money to smaller businesses rather than large incumbents. So far I called five of them and got fairly varied answers, little of it being promising.

When I called the first company, I got to speak to a nice lady who had an accent that was quite hard to understand. She understood what I was looking for but failed to adequately answer my questions. In particular when I asked her how they could justify charging me £150 for an assessment when British Gas was charging £99, she told me that they would not require me to go forward with them as a Provider, contrary to British Gas that would. This sounded a bit off as I seemed to remember there wasn't any condition to the British Gas assessment. Anyway, the whole conversation felt a bit dodgy and she completely failed to take any of my details or to tell me how to organise an assessment with them. In fact, I positively had the impression she wanted me off the phone.
The second company was a bit more promising as the lady I spoke to knew what I wanted and said that I had to fill in their online survey and that someone would call me back. I felt that their survey was more geared towards working out how much cashback I could get on the deal than actually taking meaningful information from me but I filled it in nonetheless. Nobody has called back yet. Note that even though the cashback option is nice, this is not the primary reason I want to go through the Green Deal so I'm not sensitive to any advertising that focuses on that. Others may be.
The third company I called was the first one that is a generic green energy company and that does not focus solely on the Green Deal. To the point that not only does their web site fail to display the Green Deal Approved logo, it also fails to mention the Green Deal altogether and the lady I had on the phone had no idea what I was talking about but promised to forward my details to the man who did know about this. He hasn't called me back yet. This is a shame because as far as I am concerned I may take the opportunity to have other improvements done that are not covered by the Green Deal so working with a company that has a wider scope could have been good.
The fourth company I called knew what I was talking about but said that they couldn't take any booking for now because they were still waiting for some software from the government. Now, I do understand that if the government promised assessors some software or other to do their work and they are late to provide it, it will put a fly in the ointment but my understanding of the assessment is that the first thing to do is a site visit that doesn't need any special software, just someone who knows what they're talking about. Besides, based on past experience, basing your ability to take on new customers on having some software delivered by the government is not a predictable way to run your business so if you don't have a plan B it doesn't inspire much confidence: conditions are never perfect in business, deal with it or you won't have a business very long.
The last company I called so far was by far the best of the lot. The lady on the phone knew what I wanted, was very knowledgeable about the Green Deal, acknowledged that I could finance the deal in different ways and that even after I had done the assessment with them I could go to a different provider. She took all my details and said that an assessor would call me to arrange a convenient time for a site visit. Nobody has called yet but it's still a lot more than any of the others were able to do.

The conclusion of all this is that if the Green Deal is to take off, the first thing would be to have companies ready to initiate new business when it comes knocking on their door. Apart from British Gas who were already eager to book me for an assessment back in November, only one of the smaller companies I called came remotely close to doing the same thing. Note to companies out there: being Green Deal Approved does not mean you can be completely devoid of business sense, get your act together and train your sales team!

Next Step

Call the next 5 companies on the list.

Sunday, 20 January 2013

Identifying spam

Refugees United

Yesterday, I participated in the Refugees United Modding Day 2013 organised by Rewired State for Refugees United. I worked on a hack that tried to implement and algorithm to identify spam and eventually also identify spammers. Our presentation was OK but not fantastic as time was short. The short of it was that by the end of the day I wasn't convinced the algorithm as coded worked so today I decided to get some statistics out of it and tweak it a little bit.

Bayesian Filters

The idea was to implement a Bayesian filtering algorithm to detect spam. This type of algorithms requires initial training using a body of existing messages comprising both spam and non-spam where you tell it which is which. By doing that, it will build a corpus of know good data and a corpus of known spam data. It will then use that stored information to score incoming messages. The resulting score is a probability between 0 and 1 of the message being spam (in practice the probability is between 0.01 and 0.99 because you can never be completely sure). After that, each time the algorithm gets its decision wrong, the user has the ability to correct it and it will learn from the correction. This is exactly the sort of algorithms that is implemented in popular email clients like Mozilla Thunderbird.

In order to train the filter, we had a file containing 5500 SMS messages, some spam, some ham. So in order to see how good the filter was, I decided to do the following:

Train the filter on the first few hundred messages in the file;
The score the remainder of the file and see for each score whether it correctly identified the message as ham or spam.

Tokenizing

The main factor that influences how such a filter works is the way the message is broken into individual tokens. The simplest algorithm is to break the message into words. Some words will be used mainly in spam messages, others mainly in innocent messages and yet others in both. Using 5% of the file for training and the remainder for scoring and applying that simple tokenizing algorithm, I got the following statistics:

Correct guesses: 72.03%
Neutral score (doesn't know if it's spam or ham): 25.68%
False positives (identified as spam when it is ham): 1.74%
False negatives (identified as ham when it is spam): 0.55%

That's not bad for a basic algorithm! So I decided to apply three additional tokenizing techniques on top of the core one and see what would happen:

The first technique consists in adding word pairs in addition to individual words. The logic behind this is that words like fantastic and offer may not be typical of spam on their own but the combination fantastic offer may be.
The second technique consists in adding the lowercase version of every word to the list of tokens in an attempt to see if weird capitalisation would catch spammers out.
The third one is a combination of the first two.

The results show that word pairs seem to be better than capitalisation but both together are even better.

Algorithm comparison: overall

However, this is not the whole story. The important metric is the bad stuff that ends up in the user's inbox as well as the good stuff that gets rejected. The former is composed of two parts: messages that the filter considered good when in fact they should have been caught as spam (false negatives) and messages that had a neutral score that are in fact spam (bad neutral). The latter is messages that the filter marked as spam when in fact they are good (false positives). Looking at those metrics, the result is not great.

Algorithm comparison: mistakes

False positives increase slightly with word pairs and quite significantly when introducing capitalisation variation. This may mean that a lot of legit users actually capitalise their messages like spammers. So not such a good algorithm after all.

Introducing Bias

One aspect of spam is that it is better to have false negatives than false positives. That is, it is better to have a bit of spam in the user's inbox but not block legit messages by mistake. So one typical way to do this is to introduce some bias into the algorithm to give a bit more weight to positive probabilities. Compared to the basic algorithm, it definitely cuts into false positives.

Algorithm bias comparison: overall

Algorithm bias comparison: mistakes

The Impact of Training

An interesting statistic is to see how much training the algorithm makes a difference. So I decided to compare the baseline where 5% of the file is used for training to a version where 10 of the file is used for training.

Algorithm training comparison: overall

Algorithm training comparison: mistakes

And of course, a well trained biased algorithm returns some decent numbers too when compared to the baseline.

Algorithm training and bias comparison: overall

Algorithm training and bias comparison: mistakes

Conclusion

The strength of statistical algorithms like Bayesian filtering is that it learns from the mistakes it does and it doesn't matter if spammers change their wording, the algorithm will adapt. This is demonstrated by the reasonably high rate of success reached by a very basic implementation of the technique with training on a few hundred messages.

In order to improve the efficiency of such an algorithm, there are a few options:

Include message meta-data such as sender and recipient details, in addition to the message text: this would also be a first step to identify spammers in addition to spam, which was one of the aims of the hack.
Use advanced tokenization techniques such as sparse binary polynomial hashing.
Use Markovian discrimination rather than Bayesian filtering.

All those improvements would come at the cost of increased complexity and processing power requirements so may not all be practical.

Sunday, 13 January 2013

for and while constructs in bash

The `for` construct

When you want to iterate over a list in bash, the first thing that comes to mind is to use a for loop, like this:

for f in "abc def"; do
    echo $f
done

Simple for loop

This works great when the list to iterate over is short and is composed of items that do not contain any white space. When they do, or the list is long, this construct will get into trouble. Let's demonstrate with a simple example. If I create a file with one item per line, 3 lines like this:

one
two
third line

Simple file called test

Then the first attempt at using a for loop would be:

for f in $(cat test); do
    echo $f
done

Simple for loop to read the file

The result is not quite what was expected:

one
two
third
line

Output of the for loop

You can put double quote in different places, this will not solve the problem. This is because the for construct splits items against white space and as far as it's concerned, an actual space character or a carriage return are the same and count as separators. Another limitation of the for construct is that the sub-command contained in $(...) needs to be fully executed before for can even start. If the output is large, it can run out of memory or just take a long time to get started.

The `while` construct

Fortunately, bash has another construct that can bypass those limitations, the while construct. It works slightly differently and needs the help of the read command.

cat test | while read f; do
    echo $f
done

A simple while example

And the result is:

one
two
third line

Output of the while loop

This works because the read command reads a full line and does not split on white space. Therefore the value that f is set to is a complete line in the file. The other advantage is that the pipe actually streams the output of the cat command to while and read, meaning that there is no need to wait until it's finished to handle its output. One typical use of that construct is when using the find command: with modern operating systems, file names can have spaces in them and even with a tight condition, find can return hundreds of lines of output.

Use the right tool for the job

So when should you use which construct?

If you are dealing with a list that can be large or where each item can contain space characters, use while;
If you are dealing with a short list where no item can contain a space character, you can use for.

Sunday, 20 May 2012

European Cookie Law

Yesterday, Andy Budd tweeted the following:

Wondering if the browsers are doing anything about the EU cookie law? Would be so much slicker if this could be handled at the brower level.

That got me thinking and, as I like to work out how things work, I started to ask Andy how he would see this being implemented. A few tweets later and it's obvious I need more than 140 characters to explain what is going through my mind, hence this post.

Cookie Law, What Cookie Law?

The Cookie Law is a UK law that derives from a European Directive and requires all site owners to disclose their use of cookies and allow visitors to opt in. The law came into force on 26th May last year and the ICO said at the time that it would not enforce it for the first 12 months. Those 12 months come to a close at the end of next week.

Andy's Idea

Andy's idea is to use the browser to handle this law. This is a good idea for the following reasons:

Every single web site has been implementing the law their own way so using the browser would be a good way to bring a bit of standardisation to it;
The browser is the agent that uses and stores the cookies created by web sites so it is the best place to enforce the choice of the user whether to opt in or not and to keep track of that choice between multiple visits.

Operational Outline

So far, so good. Then comes the question: how do you implement such a thing in the browser? At a high level, you need to do the following when visiting a web site:

Identify whether the web site falls under the jurisdiction of the Cookie Law;
If yes, then identify for each cookie presented by the web site:

What is that cookie used for,
Whether that use is covered by the exceptions detailed in paragraphs (4)(a) and (4)(b),
If not, ask the user for consent.

Let's take all those one at a time to see where we get to.

Jurisdiction

The first step is to identify whether a given web site is subject to the Cookie Law. In order to do this reliably, you would need a cryptographically secure token that can be linked back to a company identity, including a country. Extended Validation Certificates already offer something similar but do they contain a country code in a machine readable format? I simply don't know. And what about sites that use plain HTTP rather than HTTPS?

In all instances, you will have three possible outcome to whether the site falls under the Cookie Law: yes, no or don't know. In the first case, you also need to know what variation of the European Directive to apply. European Directives being what they are, each member country is free to implement it their own way so German law will be different from British law. Conversely, in the last case, what should the browser do? Display a warning or let you go on?

To complicate matters, there is also the question of whether cookies served by a domain other than the main site's domain, such as cookies from ad networks, fall under the main site's jurisdiction or their own domain's jurisdiction. IANAL so I have no idea what the answer is.

Finally, what would prevent a multi-national company to advertise its web site to the browser as being in a non-European jurisdiction even if they do business in Europe?

What is that Cookie for?

The next step is to identify what each cookie is used for. This could take the form of a machine readable file located at a well known URL or referenced by a link tag in the page's header. This was tried before in the form of P3P and it failed to gain traction. Any such standard would have to learn from the issues faced by P3P in order to succeed.

Once this is done, it would be a case of having a number of uses recognised as falling under the exception paragraphs while any other use would require opt in. You would then en up with three possible outcomes regarding whether user opt-in is required for any given cookie served by the web site: yes, no and don't know, the latter being the case if the web site does not provide any information for that particular cookie. This last case will be the controversial one because you can't be too stringent otherwise web sites won't have time to implement the standard but on the other hand you have to at least let the user know that a machine readable privacy use for that cookie is missing otherwise it gives an easy cop out for web sites that don't want to play fair.

Opt-in Management

Once a user has given or declined consent for particular cookies to be stored on their browsers, said browsers can remember such decisions and act accordingly next time the user visits the same web site. It would also be nice if the browser could notify the site of the user's decision so that web sites can avoid creating declined cookies altogether. This should then be accessible to the user in a similar way to saved passwords.

Do Not Track, etc.

A couple of parting thoughts:

How should all this interact with features like Do Not Track?
How can it be made flexible enough such that it can be extended the day other countries implement similar laws?

Answers on a postcard or in the comments below.

Monday, 2 April 2012

Energy Use

I've been using iMeasure roughly since I moved into the new house and here's what the graphs look like so far:

iMeasure Energy Usage Graph

There are two immediate observations on this graph:

electricity use is not seasonnal,
gas use definitely is!

The first observation tells me that my main electricity usage is probably not lighting as it doesn't change with the amount of daylight. So it's probably down to the big electrical items such as the washing machine and the fridge. I should be able to reduce that usage the day I replace them with new efficient models. One additional tidbit of trivia: the spike at the beginning of the graph is down to the sanding machines used when I had the wooden floors of the house sanded and varnished.

The second observation tells me that I need to work on insulating the house. In fact, I had thermal imaging done recently by the excellent Sustainable Lifestyles and it showed me very clearly that I have some low hanging fruit to pick first, in particular the loft insulation (or partial lack thereof) that results in very cold spots above the bay window in the master bedroom:

Cold Spot Above Bay Window

And at the junction points between walls and roof, the fact that whoever fitted the insulation in the loft didn't bother to fit it properly at the bottom causes cold spots underneath:

Cold Spots Where Wall Meets Roof

All this should be reasonably easy to fix so that will be my project for the summer and hopefully it should shave some of that spike off the graph for next year.

Sunday, 15 January 2012

Recycling Smoke Alarms

We all know that we should have smoke alarms fitted in our homes. Those alarms can be damaged and will need replacing every ten years or so anyway. So what do you do with the old ones? Chuck them in the bin? Well, the fact that they are the subject of a best practice guide on the National Household Hazardous Waste Forum suggests that this is probably not the right solution. And indeed, looking at the back of mine, I can see why:

The back of my smoke alarm showing that it is a ionization alarm that contains a small amount of radioactive Americium 241

Ionization smoke alarms contain a small amount of radioactive material, Americium 241. Looking back at the best practice guide above, there are apparently three ways to deal with it:

By a person authorised under section 13 of the Radioactive Substances Act 1993,
By returning it to the manufacturer,
By chucking it in the bin as long as you don't chuck in other radioactive waste and you only throw away one smoke alarm per bin bag.

Option 3 doesn't sound like recycling, while I don't know anybody who can help me with option 1. So that leaves option 2. As I've got the manufacturer's details on the back of the alarm, and their address is confirmed on their web site, that smoke alarm is going to find itself put into a jiffy bag, back to where it came from.

Note that there is another type of smoke alarms: photoelectric ones. They do not contain any dangerous material so are probably safer to dispose of. However, they are geared to detect different types of fires so for maximum protection you should have a combination of both photoelectric and ionization alarms.

For more questions on recycling stuff, have a look at the Recycle This web site.

Update

As very sensibly pointed out by Earth Notes, there may be an even easier way to deal with them: under the WEEE Directive, you can probably just give the old one to the retailer when you buy a new one.

Friday, 13 January 2012

Yodel redefines the word Safe while John Lewis redefines Eco-Friendly

Last week-end I visited the John Lewis web site and bought a couple of Buiani folding chairs. I was advised that they would be delivered within 7 days via a standard delivery service, as opposed to the specialist delivery service you get when you buy larger items and who are very good.

So when I came back home on Wednesday night, I found a very large (more on that later) cardboard box outside my front door and in the letter box was this delivery notice:

Yodel delivery notice

You will note how they checked the a safe place box. They actually left the parcel outside my front door. Luckily I live in a relatively safe place so theft is unlikely. On the other hand, leaving an unprotected cardboard box outside, in London, in January, with something inside that may suffer from getting wet strikes me as a tad optimistic. Or did check the weather forecast before leaving the box outside?

Another thing that I found rather puzzling was the size of the box. It would have made sense had it contained normal chairs. But folding ones: surely they'd be shipped folded? All was revealed when I opened the box:

The big box

You will note the green stickers on the left side of the chairs with the FSC logo advising me that those chairs are made from wood from well-managed forests. Brilliant! Unfortunately the amount of Air Pad packaging filling in the box probably offsets all eco-friendly credentials imparted by the FSC logo. On the plus side, it probably means that I now have enough air pads to send presents to my two nieces until they reach adult age (uncles are meant to spoil nieces and nephews, that's part of the job description).

Friday, 30 December 2011

Non-Exchangeable, Non-Refundable

I travelled on Eurostar today and learnt something about non-exchangeable, non-refundable tickets in the process so thought I'd share in case it can be useful to someone else. Eurostar sells several types of tickets in several classes (Standard, Standard Premier and Business Premier). The higher the class and the more flexible the ticket, the greater the price. So the cheapest tickets are non-exchangeable, non-refundable standard class tickets. Once bought, such tickets cannot be exchanged against another on a different train, cannot be refunded in case you don't want to travel anymore but they can most certainly be upgraded to the next travel class up, as I did today. Of course, it requires you paying an upgrade price, which may not be cheap. The fact that you can upgrade any ticket makes sense because:

An upgrade doesn't fall under the non-refundable rule because you're not asking for a refund, and in fact you're paying extra for the upgrade;
It doesn't fall under the non-exchangeable rule either because you're not asking for an exchange as you still want to travel on the same train at the same date: you just want to upgrade your existing ticket.

So if you ask general information staff in the station and are told that you cannot upgrade your ticket, don't take their word for it, go to the sales counter. The only reason why you would not be able to upgrade (and pay Eurostar more money) is if the travel class you want to upgrade to is already fully booked on your train.

Saturday, 28 May 2011

Crash of the Day

Received today from a colleague:

Please note that if somebody opens Build log excel in Microsoft excel 2007 and updates it while a filter put on any column, the file crashes.
So please avoid updating the build log in Microsoft excel 2007.

So we're talking about a fairy simple file created in Excel 2003 that crashes Excel 2007 if you try to update it while a filter is set on any column... Sigh...

Thursday, 21 April 2011

MS Works to MS Word: LibreOffice to the Rescue

I am at my mum's for Easter and one of the first things she asked me to look at had to do with her computer. She had this document that she wrote using Microsoft Works aeons ago that she wanted to open again. Of course, she's now using Microsoft Word and Word has no idea how to open Works files, even though both products are produced from the same software company.

What to do? The answer is very simple but rather counter-intuitive for people not used to open source software: LibreOffice. From the point of view of someone who lives in a world where closed source is the norm, how can a free office wannabe solve a problem that the mighty MS Office can't solve? Simple: as highlighted by Michael Meeks at FOSDEM earlier this year, LibreOffice wants to have the largest possible list of supported file formats so that they can support their users in reading their old documents stored in long forgotten format and hopefully migrate them to modern and preferably open document formats. As a result, LibreOffice supports MS Works and MS Word out of the box.

So back to my mum's document, retrieving the content was then very easy: copy the document to a USB key, open it on my laptop using LibreOffice, save it again and copy the new document back to her PC. In this case, it meant saving it back as an MS Word document. In an ideal world, I would have saved it as ODF and installed LibreOffice on her computer but I'll leave that for another day.

So if you have any old document lying around that you can't open anymore, try LibreOffice first, you'll be surprised how many weird and wonderful formats it supports. If it still doesn't work, consider contributing a filter to the project or at least reporting the issue and providing sample files so that the developers can build such a filter.