How Data and Features are used to build Machine Learning?Building a machine learning product – Data and features
A guide to how we’re using Features in our machine learning – picking them, visualising them and then using or discarding them to improve accuracy and performance.
Before generating Features
Machine learning is all about recognising and learning patterns. One of the hardest parts of building models that achieve a desired result (i.e. recognising patterns that matter) is how to manipulate and clean up your data so you can apply machine learning algorithms to it. Because data is never ready for machine learning, it always needs cleaning, and processing.
Turning raw data into things that patterns can be derived from is the process of creating features. A feature is simply an individual measurable property of a phenomenon being observed. Depending on your data source there are potentially dozens of different features you can create from a single block of data.
This article gives an introduction to machine learning and it uses an example of trying to calculate whether a given home resides within New York or San Francisco. For each home you have a series of data points: elevation, number of bedrooms, height, year built, bathrooms, price etc. The article also points out that comparing two sets of data can add nuance and illuminate further patterns.
The generation and selection of these features is a combination of art, science, experimentation and learning. Domain expertise can be a real help for this initial generation of features. If you know a lot about the data you’re looking at then your starting point for features will likely be stronger than those without any knowledge of what the data might show. In our previous example prior knowledge of knowing that San Francisco is hilly and therefore more homes above a certain height would immediately put you in a better starting point than somebody who had never been to either New York or San Francisco.
When you start to analyse the effect of these features on the model, the model itself will tell you whether or not they are relevant. It’s one of the beauties of machine learning – it levels the playing field and looks at the data above everything else. It doesn’t care about your previous knowledge once you get started.
Also worth noting is that in the previous example features often already exist that you can apply machine learning to. Altitude, height, price; these are all data points which to an extent have already been defined and recorded. In this sense you’re only selecting features rather than generating them from raw data. Sometimes your data set is even more rudimentary and you actually have to generate new features yourself. In this instance domain expertise can become even more valuable.
So what happens if your data doesn’t have obvious features within it? Let’s take the example of a machine learning algorithm whose aim is to recognise houses in photographs. You could feed in every pixel into the algorithm in an effort to be able to distinguish between houses and non-houses, but it would likely be more effective to process the image first for shapes and other possible elements of houses – doors, windows, porches, bricks etc, and then put these into the algorithm. This is known as feature generation.
At Nudgr we collect a huge amount of data on how people interact with online forms. But these data points arrive at our servers in the form of a list of events, representing clicks, key presses and other interactions with a form. The data looks like this:
This is a great starting point, but on their own these events cannot be used to make predictions. Machine learning tasks such as classification require that the inputs be easy to process mathematically and computationally. We then take these events and arrange them into types of behaviour which become our features. They might be interaction rate, corrections over time, and can be tabulated like this (just for illustration purposes):
What we also need to work out is whether or not these features are worth measuring – in other words, do they seem to correlate with abandonment or completion behaviour. In order to get an inkling for whether or not a given idea might be helpful, visualising the data and comparing abandonment and completion behaviour can help:
Visualisations such as the above help you analyse the data, especially when exploring the generation of new features. Sometimes the visualisation itself will have to be iterated upon to decide whether or not a given feature is worth keeping or discarding.
If it’s decided that a feature is worth keeping then it’s then included in the training and testing of the predictive models health. In theory if you’ve made a good decision about additional features to keep then the performance of the model should improve.
The interesting challenge for us at Nudgr (unlike the housing example above) is that we have to decide what data to collect in the first place. This collection process isn’t readily available so we have to build code to collect information as a user interacts with a form.
Our domain expertise in how people interact with forms (from our Formisimo service) has been invaluable in guiding our feature building activities. It was one of the reasons we felt we could build Nudgr out to be a great product; because we understand how people interact with the many types of online forms.
Finding the balance
Your initial set of features can often be too large, with some being redundant and unhelpful in making predictions. In the previous housing example, you might have hundreds of features (number of windows, latitude, size of garden, crime rate in the area, materials used to make the building) but not all of them will be useful.
The more features you have, the more computing and processing power you need to work through them too. Often a second (and indeed third and fourth) step can be reducing the number of features you use in your model, with the aim being to get a great balance between keeping enough features to be able to reliably make accurate predictions, but without having so many that you can’t make predictions quickly. Also certain features may actually detract from the performance of the model, so it’s important to have the right balance. It is also possible to drop features programmatically, and since you can calculate which features are useful and which are not, you do not have to rely on a person to drop them.
A classic example where machine learning is used is that of spam detection in email. You could imagine a list of features that could be used might be the following:
- Presence of certain words
- Email structure
- Frequency of specific terms
- Absence of specific terms
- Total email length
- Time of sending
- Grammatical correctness of text
Some of these will highly correlate with an email being spam, but others might be less useful. Through analysis you might find for instance the the time the email was sent and the email length were not good indicators of whether or not an email is spam-worthy, and therefore cut them from your list of features. You then might discover that sentence word density is a correlating pattern and add it to the list.
The process never ends
This process can be long and frustrating; it also doesn’t have a fixed end point as we will always be seeking to improve the power of our predictive model. Just as we can look for additional features within the data we already collect, we can also over time expand the number of inputs. Our CTO Doug wrote an article about this process.
The above is also a simplification of the process, but I hope it serves as an introduction to one of the most important parts of building a machine learning model. You can hear more from me about our use of features in the video below.