| [go to the TEA homepage] | [download pdf] | [go back to my homepage] |
|
On-line Adaptive Context Awareness starting from low-level
sensors
by Kristof Van Laerhoven
A thesis submitted in partial fulfillment of the requirements for the degree of Licentiaat in de Informatica Free University of Brussels
May 28, 1999 Thesis supervisor: Professor B. Manderick
Free University of Brussels
last update html document: July 6, 1999 by Kristof Van Laerhoven
Abstract On-line Adaptive Context Awareness starting from low-level sensors by Kristof Van Laerhoven Thesis supervisor: Professor B. Manderick Department of Computer Science
On-line Adaptive Context Awareness startend vanuit low-level
sensoren
door Kristof Van Laerhoven
A thesis submitted in partial fulfillment of the requirements for the degree of Licentiaat in de Informatica Vrije Universiteit Brussel
Mei 28, 1999 Thesis supervisor: Professor B. Manderick
Free University of Brussels
Abstract On-line Adaptive Context Awareness startend vanuit low-level sensoren door Kristof Van Laerhoven Thesis supervisor: Professor B. Manderick Departement Computerwetenschappen
Acknowledgments
I would like to thank the following people for their support and contributions to this thesis:
Vrije Universiteit Brussel May 1999
List of Figures
List of tables
Chapter 1
This chapter will give an introduction to the concept of context awareness followed by a short overview of related context aware projects. An explanation will also be given on how this thesis is situated in the framework of the European Commission Esprit project, TEA. In conclusion, the structure of the proposed architecture is given, along with the further composition of this thesis. 1.1. Characteristics of context awareness The notion of context awareness for devices itself can be split up into three
components (as seen in Figure 1): activity, environment and self. The activity
describes the task the user is performing at the moment, or more generally what
his or her behaviour is. This aspect of context is focused on the user of the
device, and his or her habits. The environment describes what the status is of
the physical and social surroundings of the user. The current location, the
activities in the environment and other extern properties like for instance
temperature or humidity belong to this axis. Finally, the self component
contains the status of the device itself. This third point of view on context
awareness has not been researched as much as the other two, but is a very
interesting one in the scope of cognitive sciences.
1.2. Context Awareness in the past In the past, a lot of context awareness research was often restricted to location awareness, thus providing the device knowledge about location only. The hardware was usually based on either beacons or global positioning systems (GPS) that are able to flawlessly recognise the current location or coordinates, but fail to say anything more about the context. Such an approach can be very expensive and it would be a great deal cheaper to use a lot of different low-level sensors. The combination of these sensors leads to a detailed description of the surroundings, which gives more than just an indication of the location. This alternative is less accurate, but it is superior if the application does not expect the system to be fully aware of its position and a simple hunch of location is sufficient. Context awareness is more than a mere recognition of a geographical location [26 *]; it is about recognising the current situation. Many applications exist that use location awareness to enhance the services provided by a device. Most applications use the knowledge about the location as a sort of guide through buildings, museums, cities and so on [1 *]. An uncommon application is proposed by Eriksson and Finne [8 *], which uses music or soundscapes (sounds, or harmony - landscapes) to "inject" feelings into the user based on location awareness via GPS. More concrete, if the user walks into a ‘bad’ neighbourhood, he or she hears sounds that give the appropriate feelings. Other projects focussed on specific sensors like microphones, photodiodes (light sensors) or movement sensors to derive a (usually basic) notion of context. One particular project by Sawhney at MIT [25 *] focused on solely the output of a microphone to recognise different kinds of noise that might be specific to a location. A DAT recorder was used to record both a training set and a test set of several hours in different environments. A recurrent neural network and a nearest neighbour algorithm were implemented to recognise various "situational contexts" like speech, subway, traffic and so on. The goal of the experiment was not to do adaptive real-time context recognition, but focussed more on the feasibility of context recognition based solely on sound. For some contexts like speech and subway, recognition statistics seem very good. Perhaps the best known device for supporting context-aware computing is the Smart Badge system of Olivetti (described in [3 *]). The Badge is equipped with a standard IR transceiver and a collection of sensors, which allow the Badge to sense and transmit the user's environmental information and user's ID back to the network. In the network, a Badge server is used to analyze the sensor data and provide the environment information to proper users. The Smart Badge also provides enhanced I/O capability, which makes it possible to connect displays, audio, and video devices. One of the best-known applications of the smart badge is tracking people inside a building, with the help of ‘smart doors’ that register which people enter the rooms. 1.3. Technology for Enabling Awareness The approach of this thesis on achieving context awareness fits in the framework of TEA [9] (Technology for Enabling Awareness), a project funded by the European Commission Esprit Program. Researchers from four laboratories in four different countries joined forces for this project: Starlab NV/SA in Belgium, TecO (University of Karlsruhe) in Germany, Omega Generation (University of Bologna) in Italy and Nokia Research in Finland. TEA’s objective is to research context awareness in the field of handheld and wearable devices such as mobile phones, personal digital assistants or even laptops. Low-cost and widely available sensors on physical parameters and information are the focus in this project. Starting with this information, the TEA system should be able to determine in what situation the user is at every moment. The aim is, in the words of the TEA document: "To produce an affordable add-on component to existing portable communication and computation devices that adds awareness of location and activity, and thus enables context sensitive device control." The accomplishments that have been given birth after a first year are
promising. A prototype hardware board has been created to control the sensor
outputs in a flexible way. Sensors can easily be added or removed and the
processing to be performed on the collected data can be done via a PC connected
to it with a serial cable. The presence of the serial connection, combined with
the small size of the hardware board, makes that the whole acquisition hardware
remains portable. Furthermore, experiments were prepared and executed that
merged this hardware module with existing devices such as a mobile phone and a
personal digital assistant (PDA). These experiments, demonstrating the
possibilities of context awareness in the field of mobile communications, were
both promising and encouraging towards the future of the project.
Figure 2: Approach of TEA (and this thesis) to reach context awareness.
The fact that TEA starts with the raw data of low-level sensors is not the only unusual property of the project. TEA also envisions the use of adaptive algorithms to reach adaptive context awareness, which makes it possible to learn and adapt to specific contexts for a specific user. This results in a context aware system that will adjust itself towards the user and the contexts he or she will frequently visit, thus improving the quality and speed of recognition of these contexts. Another requirement is that human supervision must be kept to a minimum, since the system should be designed to bother its user as little as possible and thus to obtain a maximum degree of user-friendliness. Consequently, augmenting the autonomy of the device increases its user-friendliness. A context aware device can be mapped onto a two-dimensional plane that
describes the amount of involvement required from the user and the level of
situational awareness. Sunglasses are an example of not very intelligent,
passive devices. They are not very intelligent because they have no clue whether
the sun is shining or not. They are very passive because they operate usually in
an autonomous way, requiring very little user-involvement – the user only has to
decide whether to put them on or off. More intelligent sunglasses exist that
adapt themselves to the brightness of the sunlight; these would be placed higher
on the situational awareness scale, but they remain as passive as their
less-intelligent versions. The objective of TEA is to make an adaptive system
for a device to make it as aware as possible and at the same time as passive as
possible towards its user.
Figure 3: Context Awareness and Personal augmentation (source: Nitin Sahwney, MIT Media Lab)
Appliances are becoming increasingly ubiquitous, increases in the use of portable computers and cellular telephones are clearly visible and the image of having a portable device is not anymore reserved to a restricted group of people. Research and development in information technology is moving away from desktop based computing towards more portable and task specific applications. Most of these portable devices should behave differently depending on what situation they and their users are in. Along with the sales-figures and the popularity of these devices, the importance of the usability and functionality is increasing as well. Appliances that know more about their environment will be able to function better and will give their users a better service. A device that knows about its own environment and that of its user could transparently adapt to the situation, leading to the idea of the invisible computer as discussed by Weiser [30 *]. This in turn is a step towards a disappearing interface as demanded by Norman [20 *]. Improving the interaction with such a device is not possible without augmenting its context awareness. The appliance will be capable of giving better defaults for the situation and could automatically make choices that the user normally would have to make. The application model is thus changed in structure as illustrated in Figure
4. The classic structure has a model of the user and a model of the context,
derived from solely the input of the application. In contrast, the context aware
application contains a context model that is now altered by the context aware
system, providing the application new and possibly convenient information.
Consequently, the context model has a higher value, resulting in a better
application.
Figure 4: The situation of Context Awareness in the application structure
Apart from the enhanced user-device interface, context awareness can also improve the quality of services that already exist, like providing context-related information, behaviour and communication. Another reason to implement (more) context awareness is the ability to add sensors that can enable the device to perform monitoring and extended sensing. A simple GSM could be turned into an appliance that observes one’s health if some biosensors were added. The apparatus would not have to stick to mere monitoring, but could also take appropriate actions if the user’s health would be endangered. Other possibilities are security- and safety- related applications. Chapter 2 describes the framework where the on-line adaptive context learning architecture is located in. Chapters 3, 4, 5 and 6 describe the layers or modules of the hybrid system (as illustrated in Figure 5) that is proposed in this thesis to transform raw sensor data to context profiles. Chapter 3 deals with various approaches and algorithms for clustering the
sensor data. Chapter 4 presents a hierarchy of clustering algorithms that is
able to solve some of the problems that emerged in Chapter 3. Chapter 5
introduces a classification layer that labels the clusters from the previous
layer with short context descriptions (context profiles). Chapter 6 describes
the final layer, which provides an element of supervision to the system.
Figure 5: The parts of the architecture discussed in chapters 3, 4, 5 and 6.
Chapter 7 evaluates some possible improvements and gives an outline of
performance issues for the system. Finally, chapters 8 and 9 will conclude with
future work and the conclusions.
Chapter 2
This chapter will focus on both the basis and the goal of the system. A concise explanation of the concepts of sensors, cues and contexts will be followed by more practical information on how the experiments were conducted and initial analysis results. An elaboration on how the adaptive system fits into the bigger framework of the TEA project will conclude this chapter. The source of the system’s input is the set S of basic sensors. In theory, a sensor can be just about anything: a light sensor, a microphone, a global positioning system (GPS), a camera or even a human. In literature, this last kind of sensor is usually referred to as a logical sensor, while the other examples are called physical sensors. The only requirement a sensor should comply with is that it should periodically produce a value that resembles a measurement of a physical phenomenon in its real world context. The formation of such a signal is a digital and unstructured sub-symbolic value. A context description resides usually on a higher level than the outputs of the sensors. This section will provide a general overview on how low-level sensor data can be transformed into a high-level context description. At every moment in time, the sensors provide the system with a set of values that give a description of the context at that time. However, this description is very difficult to interpret – and even unreadable as the number of sensors increases - for humans. This is not a problem if the application does not demand user-involvement or user supervision or if it is merely used for controlling purposes. If the user has to be kept informed on the context-profiles, however, it is necessary to build a bridge between the human and machine context description. The solution used in this thesis is to transform the sensor readings to a context profile, for instance "in a busy office" in stead of (light=90%, microphone=77%, ..., infrared=89%) and "in a meeting" in stead of (light=90%, microphone=15%, ..., infrared=8%). The output of all sensors should thus lead to a context profile that was pre-defined by the user. This profile is usually a crude generalisation that leaves out a lot of other information, so that the user can easily and quickly define a certain context and recognise it later on. Profiles like "running", for example, will probably only need the data from infrared and acceleration sensors – the contribution of all other sensors would be minimal. Another property of the context profiles is that they have to reflect the application they are used for. It is in general very common to have contexts that might overlap, like for example "running" and "at the park". If the application should demand that one context must be chosen, it will depend on the nature of this application which context will finally be elected as the winner. The use and recognition of exclusive or non-exclusive context profiles depends only on the application the system is intended for. The architecture in this thesis is able to implement both, but is restricted to non-overlapping context profiles. Overlapping contexts and hierarchies of contexts are treated in the future discussions chapter (Chapter 8). To recapitulate, the objectives of this system are:
2.2. Acquiring data from sensors A small, portable hardware device called "TEA Sensor-Board" was responsible
for acquiring the data that was used in this thesis. It was developed within the
TEA project, to provide a prototype that enables to acquire raw sensor data (see
chapter one for a more complete description). The sensor board is equipped with
many different low-level sensors and a chip that digitises the analog signals
from the sensors at a certain rate. This frequency can be different for every
sensor and depends on how quickly a value of a certain sensor can change. An
acceleration sensor for instance is probed about 100 times per second, while a
temperature sensor is probed only once a second. This means it could be
theoretically possible to create a set of input values from the sensors every
millisecond, although this probably would be too frequent for the system.
It is not necessary to update the input of the system that often, and in practical applications a second will be enough. Beside the raw sensor values, it is also possible to use the mean, the variance or another value that describes the output over that second for sensors that are highly time varying such as the accelerometers, infrared sensors or microphones. This will partially make up for any critical values that could be thrown away by enlarging the update period. This extraction of information from raw sensor data will be discussed in section 2.3. (Extracting Cues). Used sensors include light sensors, accelerometers, infrared sensors, microphones, CO gas sensors, temperature sensors and atmospheric pressure sensors. The TEA board was connected to a fast laptop computer during the experiments, rendering the whole system mobile to carry it around from context to context. On the laptop, monitor software (TEAreader, programmed by one of the TEA partners) was installed to read the raw data and write it out to a data file on the harddisk for later analysis. Before a neural network clustering layer can process the data, the sensor values have to be copied into an input table that groups the numbers by timesteps. This input table will also contain cues from the frequently probed sensors, which will be discussed later in the next section. Figure 6 shows the major steps in the acquisition process. Every sensor that is provided with the necessary power will spit out a large amount of values over time. Some of them will produce such a large stream of data that it is almost impossible to use this directly as input for the final recognition system. The values of the light sensor can be replaced (as described earlier) by the mean and the variance. Other transformations and filters can also be applied for this purpose. Another – and perhaps better - example is the microphone since it produces an even larger amount of values. Chen [7 *] has researched the output of this particular sensor and noticed a few values that can be calculated in real-time and that are able to give a very clear idea of the current context. Filters and transformations of the raw sensor data are also examples of possible cues. These values are called cues (sometimes the term features is also used in this context), because they indicate what kind of data the sensor is producing without actually giving directly the output. The cues can be interpreted as values that are the results of often small and quick calculations on data that is sent very frequently. This method does not just enable the learning system to be as near real-time as possible, it also optimises its performance since the cues usually give a better interpretation of the data. Although the values from not all the sensors are being described by cues, it is possible to state that the recognition system starts with cues instead of sensor data. Even if the values from a sensor are not pre-processed to a cue, it could be called a cue if other sensors were pre-processed. The mere fact that the choice was made to let the value be a cue of itself justifies this. 2.4. A first analysis of the data A first visualisation of the data is established by just plotting the time
series of all sensors in the same plot. This graph is essential in evaluating
the acquisition process in two ways. First, it will usually give a pretty good
clue whether the system is able to distinguish contexts from each other or not,
since the different contexts show different patterns in the sensor readings.
Secondly, it is a good method to visualise and evaluate the performance of a
particular sensor. The plot in Figure 7 shows an example of 8 sensor outputs
during 832 seconds, taken in five basic contexts: "inside a dark office",
"inside an artificially lit office", "moving inside an artificially lit office",
"outside still" and "outside moving". The TEA sensorboard was turned off after
each context was ‘recorded’ and turned back on when it was brought in a new
context. This is mainly the reason why the transitions of contexts are very
short or even invisible. The time series plot is also a good tool to predict
which contexts will be hard and which contexts will be effortless in learning:
the last two contexts will probably demand more resources (time, memory and/or
processing) than the others. In this graph is also the importance illustrated of
temporal pattern recognition: some of the sensors in the first contexts are
harmonic signals that could be recognised by specific algorithms that are
time-sensitive.
Figure 7: Example of a time series plot of sensor outputs in five different contexts.
When a sensor produces a value, this value could be considered as a component
of a vector, which has a dimensionality equal to the number of available
sensors. This leads to a second method of visualisation, in which each point in
an Euclidean space could represent this vector. Plotting this space requires
that the dimension of the space is less than - or equal to three. Thus, some
sensors have to be ignored if there are more than three sensors available. The
next plot (Figure 8) shows a three dimensional phase space, in which the input
vectors from the first three contexts of the picture above are traced. The last
two contexts were not traced, because their vectors were placed very far away,
and there was no observable distinction between them. The reason for this
correlation can be deduced from the time series plot: the two contexts have too
many values in common to be separable in this clustering visualisation.
Figure 8: Phase space plot of the data coming from three contexts.
2.5. Architecture of the TEA-system The TEA system consists of the four layers that are depicted in Figure 9. The first layer is the sensor layer, which represents the acquisition of the raw data from the low-level sensors. In the second layer, the output of every sensor is described by one or more cues, which are usually more abstract than the sensor data itself. The third layer is probably the most difficult one: mapping the cues to a context profile that was given by the user, since it involves user-interaction. Finally, the fourth layer uses the context information to change the behaviour of the application or device according to the current context. The result is four-layer system that is "context aware". This thesis will focus mainly on the third layer since it is there that the learning system is situated. This layer is probably the hardest one to implement since it is required to be:
Currently, the sensor layer is the only one that has been integrated into hardware, while the other layers are implemented in software via a portable PC, connected to the hardware component. The objective is to eventually integrate all layers into hardware. There is no sharp distinction between sensor data, cues, and contexts. In
fact, it is very well possible to define the previous context as a sensor to
obtain a recurrent structure in the architecture.
Chapter 3
As seen in the previous chapter, the values from all sensors are available to the system at any moment in time for processing, and these values can be treated as components of a –usually high dimensional- vector. If there were only three (or less) sensors present in the system, the outputs of these sensors could be plotted in a three-dimensional phase space by drawing a point with the values of the sensors as x-, y- and z-coordinates. The resulting phase space plot could be of some value for visualisation, but the limit of three sensors makes it practically not useful. It does illustrate however, the benefits and possibilities of clustering. This chapter is divided in five parts:
3.1. Clustering algorithms in general Fritzke [11 *] defines clustering as searching a partition of the data into subgroups, such that the distance of data items within the same cluster is small and the distance of data items stemming from different clusters is large. Many different kinds of clustering algorithms exist depending for instance on whether the number of clusters is pre-defined or not. Two fundamental and closely related terms from computational geometry are
important to understand in the context of clustering: Voronoi tesselation and
Delaunay Triangulation. The first concept is the partitioning of a space in
Voronoi regions (hence the name tesselation), which are the sets of all points
which are closest to the point each Voronoi region belongs to. If one connects
all pairs of points for which the respective Voronoi regions share an edge, the
Delaunay triangulation will be the result. Figure 10 represents a point set in
 ², its corresponding Voronoi tesselation and Delaunay triangulation.
These two expressions will be used later on in this chapter.
Figure 10: a) points in  2, b) corresponding Voronoi tesselation and c) corresponding Delaunay triangulation. (source: [11 *])
Clustering algorithms (and learning algorithms in general) can be divided in two classes: those who are ‘eager’ to learn, and those who are ‘lazy’. The first class wants to generalise before the user made any query, while the second one would delay the processing until a query is encountered. Lazy clustering algorithms are not very feasible for this purpose, since they generally require a lot of memory to store the training examples and more computation time when it needs to predict the target value for a new query. This could damage the real-time property of the learning system. The user query in this case is approximately every second, so it would be useless to postpone the clustering until a query is made. Another reason to choose for immediately clustering the incoming data is the high dimensionality (or the high number of sensors) of the input. The basic requirements of this thesis demand to implement an eager learning system, but this does not mean that some modules can be implemented by using lazy algorithms. Every clustering algorithm works with a measure of similarity and dissimilarity to estimate if a certain pattern belongs to a cluster or not. Usually the Euclidean distance between the two vectors is used for this purpose. One of the simplest algorithms is the sequential leader algorithm. Every cluster has a so-called prototype -or codebook vector, which is used for comparison with the current input vector, and which represents the prototype value of the weights of the cluster. If the codebook vector and the input vector are similar enough, the input vector belongs to the cluster of that codebook vector. In the case that no such cluster can be found, a new cluster is constructed with the input as its codebook vector. The leader-clustering algorithm is usually very fast: in the best case, all n inputs belong to the same cluster, resulting in time complexity n. A worst case scenario would result in every input being not similar enough to its predecessors, which will take (n!) time steps.
Nevertheless, the leader clustering algorithm is not adaptive enough to use it on data coming from the low-level sensors. This data will always contain some noise, which will result in a poor performance of the algorithm. Clustering algorithms were originally designed for handling discrete input
vectors, but they can also be extended to work on time series. Various
approaches exist that enable the clustering algorithms to compare time series,
but the one that is most used is the calculation of the Euclidean distance
between two time series:
These approaches need to define a suitable time interval however, which is not very easy to determine. It should be long enough to hold specific time-related patterns, but if this interval is too large, the performance of the adaptive system will go down rapidly. The sensor outputs have usually a graph, which is too complex to be recognised by this method. 3.2. Sensor weighting and other pre-processing techniques To optimise the system’s ability to distinguish as many contexts as possible, a large variety of sensors is usually applied. This, however, adds a few difficulties since not all sensors are equally important and they do not necessarily have the same units and the same ranges of values. One solution could be attaching weights to each value, but the next difficulty will be finding a method to fix these weights in an appropriate way. Simple applications might rely on the user to handpick the weights, but this method becomes harder and harder when the dimension of the input gets higher. An algorithm might thus be required to find a suitable set of weights and adapt it if necessary. Rosenstein and Cohen (in [23 *]) tackled this problem by first clustering all outputs of sensors individually and constructing a signature afterwards that enables unit- and scale-independent comparison. The signature stores the pattern of similarity between the new time series and the alphabet that was created in the clustering process. After a clustering algorithm is chosen, another decision is to be made: should the input values be standardized, normalized, rescaled or maybe pre-processed in any other way? The difference with this type of pre-processing and the cues from chapter 2 is that the cue is calculated from the raw data that is omitted when everything is presented to the neural network. The calculations that will be discussed here will be applied on all sensors and cues without reducing the information itself. A lot of clustering algorithms (including the SOM and RSOM discussed in the next section) depend highly on Euclidean distances. The contribution that an input makes will depend heavily on its variability relative to the other inputs. If one input has a range of zero to one, while another input has a range of zero to one million, the contribution of the first input to the total Euclidean distances will be negligible compared to the second input. In this case, it is essential to rescale the inputs so that their variability reflects their importance, or at least is not in inverse relation to their importance. It is very common to standardize each input to the same range or the same standard deviation if better prior information is not available. If some inputs are known to be more important than others, it may help to scale the inputs such that the more important ones have larger variances (see sensor weighting). The values from the sensors are already limited between 0 and 127, so it is not actually necessary to pre-process the input values for the clustering to perform better. However, in these experiments not only the normalized sensor values are taken, but also cues like the mean and variance of the rapidly changing sensor values over the period of approximately a second (see the previous chapter). Therefore, all values are rescaled to the interval [0,1] to make things less complicated. An artificial neural network refers to a (usually large) number of simple,
interconnected computational elements: neurons. In this respect, it is based on
the biological nervous systems. A typical neuron sums the weighted inputs and
passes the result of this calculation through an output link to another neuron.
This result is often initially put through a so-called activation function,
which is usually a non-linear function (e.g. a sigmoid) that limits the final
output of the neuron. A bias is also included to provide a threshold that is
subtracted from the sum of weighted inputs, although it could also be
implemented as a connected neuron with a constant value. Activation functions in
a hidden (or intermediate) layer are needed to introduce the possibility of
solving non-linear problems into the network, while they are needed for the
distribution of the target values in the output layer. Most neural networks have
some sort of training rule whereby the weights of the connections are adjusted
on the basis of input data. In other words, neural networks learn from examples
and exhibit some capability for generalization beyond the data from the training
phase. Another property of neural networks is their great potential for
parallelism, since the computations of the neurons are largely independent of
each other. Very good introductions about the theoretical aspects of artificial
neural networks are given in [13
*], [17 *], [18
*] and [19
*].
Connectionist methods are widely used in signal processing, certainly when the input signal reflects real-world data. The data from sensors have several properties that make it very suitable for artificial neural networks. One of the biggest problems in signal processing is the presence of noise, which is inherent to the real-world data and this often causes the ‘classic’ algorithms to fail. Artificial neural networks are known to perform very good under these circumstances. Apart from these benefits, some difficulties will remain when using neural networks. Especially the curse of dimensionality will be mentioned (and dealt with) later. The training phase might also be a problem if the objective of the learning system is to need as little intervention of the user as possible. The classic multiple layer perceptron with backpropagation training needs a training phase, after which the network becomes fixed. The result of this method is that the user needs to frequently update the network by re-training it every time a context changes a bit. It would be better if the user had to indicate and train the contexts once, leaving the system to adapt itself further after this phase without needing supervision of the user. The multiple layer perceptron is also very slow in training and if the training contains some errors, the performance of the network will go down significantly. An unsupervised neural network that functions better in these domains will be thoroughly discussed in the next section. 3.4. Mapping with neural networks A first approach would be to directly map the combined sensor data to a space of context profiles. Since it is initially not necessary to know any targets, an unsupervised neural network is used to cluster the input vectors. The resulting (discrete) output space then becomes a lookup table for the actual outcome. A time-sensitive variant of the Kohonen Self-Organizing Map - "the Recurrent Self-Organizing Map" - was chosen for this purpose. 3.4.1. The Kohonen Self-Organizing Map The Kohonen Self-Organizing Map [15 *] (KSOM or just SOM) is an unsupervised neural network: this means it does not need any feedback from a teacher (or the environment as a teacher), unlike many other neural networks. The Kohonen SOM is not an ordinary neural network and the implementation of its neurons differs strongly from that of the usual neurons that were previously described. The Kohonen network (1982) is based upon earlier work of Willshaw and von der Malsburg (1976). The KSOM usually has an output layer of interconnected neurons that are fully
connected to the input layer, so every neuron from the output layer is connected
to every neuron from the input layer. This connection has a certain value, which
is just like in the domain of the supervised networks called a weight. Every
neuron from the output layer has consequently as many weights as the neural
network has inputs. The output neurons are furthermore ordered in a particular
way, usually a two-dimensional grid, where each neuron has ‘neighbours’ (to the
left, the right, up and down in the case of a grid). The KSOM was inspired by
the way in which various human sensory impressions are topographically mapped
into the neurons of the brain. Figure 12 illustrates the structure of a SOM,
while Figure 13 shows a typical organisation of the output layer of the KSOM;
the colour of each square indicates during what context it was activated.
To introduce order in the output layer, this algorithm is modified in such a
way that the neighbouring neurons of the winner are also allowed to update their
weights, but not as much as the winner itself:
where a is the learning rate and h depends on the distance to the winner. This algorithm is responsible for creating and adapting neurons that are
trained to trigger for a specific input: the values from the high dimensional
input space are mapped to a discrete output space (the grid of neurons). The
weights of the output neurons are very similar to the codebook vectors of the
leader clustering algorithm, but in this case they are frequently adapted. It is
thus very well possible that the weights are different from all previous input
vectors. Another big difference between the two algorithms is that the SOM has a
fixed structure that does not grow in time, while the leader-clustering
algorithm could theoretically keep on adding clusters.
The KSOM (like most of its variants) has limits that have a bigger impact on its performance, though. The inputs need to remain between a minimal and maximum value for the clustering to be successful and reliable. If the input has no upper and lower limits, locations of the clusters could keep on changing in time. Another well-known problem of the Kohonen network is that it is possible it will unlearn previous learned clusters, certainly if the learning rate is chosen relatively high. It is very common to implement the KSOM so that the learning rate and the neighbourhood radius get gradually smaller until they reach a certain minimal value, resulting in a ‘frozen’ status. It is not a very good idea to diminish the learning rate or the neighbourhood radius to zero, however, because this means that the neural network will loose its adaptive behaviour. Using extremely small learning rate and neighbourhood values is also no solution since this may critically hinder the training and thus the adaptivity. See section 3.4.3. (Stability versus Plasticity) for an elaboration on this subject.
The algorithm in Table 2 contains the necessary steps to produce a Kohonen
Self-Organizing Map. The neighbourhood function has usually a bell-shaped curve.
A common choice for this function would be
with ri and rc the coordinate vectors of the current neuron and the winning neuron. The function plot is bell-shaped around rc with a width depending on s . Step 6 in the algorithm is needed to order the codebook vectors of the neurons topologically, and transforms a normal competitive neural network into the Kohonen SOM. Some discussion exists about the necessity of the presence of topology-preserving abilities in the algorithm, but it has benefits as will be shown later. The Kohonen SOM has on the other hand also a lot to offer. Visualisation is one of the reasons why it is used frequently. Apart from the graphs that represent the output layer, there is another kind of plot that gives more information about the network’s point of view - this method will be explained at the end of this section. 3.4.2. Curse of Dimensionality The potentially high dimensionality of the data is a problem with this approach. As more sensors are added to the system, the dimension of the data gets higher, and the clustering algorithm needs more resources to be able to map the bigger input space to the output space. Unless the neural network is improved (by for example adding a – usually substantial – amount of neurons), the consequences are that the algorithm gets slower and less fault tolerant. What causes this effect? Suppose that the SOM is able to place the weights of the output neurons equally over the input space. The average distance from a random point of the input space to the nearest weight vector could be measured as the goodness of the representation: the shorter the distance, the better is the representation of the data. The total number of units required to keep the average distance constant increases exponentially with the dimensionality of the input space. This also causes networks with lots of irrelevant inputs to behave relatively badly: the high dimension of the input space causes the network to use almost all its resources to represent irrelevant portions of the space. This problem is well known in the domain of neural networks (or even more generally in the field of machine learning) as the "Curse of Dimensionality" (see [5 *] and [19 *]). 3.4.3. Stability versus Plasticity One of the major problems with KSOMs is to keep the map adaptive without ‘overwriting’ already learned instances. If the system is created to remain adaptive (which usually means it has a fixed, nonzero learning and neighbourhood radius) its neurons could gradually change to other clusters, thereby unlearning previous stored clusters. If, on the other hand, the KSOM’s learning rate and neighbourhood radius diminish over time, the system will converge to a stable one without unlearning. This method is suggested by Kohonen [15 *], but an immediate result of this is that the system will end up in a non-adaptive state. This trade-off feature is called the stability-plasticity dilemma. A simple and common method to overcome this problem is tweaking the learning rate and neighbourhood radius so that the overwriting becomes very unlikely. In the initial phase, it is still possible to present various contexts with a larger learning rate and neighbourhood to reach an optimal spreading of the weights of the competitive layer and to avoid "dead regions" on the competitive layer. In section 3.4.6., the adaptive resonance theory will be discussed which is said to actually solve the stability-plasticity dilemma. 3.4.4. Temporal variations on Kohonen’s SOM The input data from the sensors doesn’t just consist of vectors that can be presented at random to the KSOM network, the sequence in which they appear is equally important. The Kohonen Self-Organizing Map in its standard form is not able to distinguish time series. However, a number of approaches exist to improve the KSOM in order to enable it to recognise time series. One approach is to add temporal information to the input of the network by means of exponential averaging the input time series or using tapped-delay lines. Another method is to use layered or hierarchical KSOMs where the upper map needs to capture the spatial dynamics of the previous layer. Euliano and Principe modified the KSOM to a Spatio-Temporal Self-Organizing Feature Map [10 *], which uses temporally and spatially localized neighbourhoods. Each time a node wins, it starts waves of activity that flow over the output layer and which are attenuated over time. These waves can reinforce each other if they correlate temporally. The activity wavefronts are used to enhance a node’s possibility of winning the next competition. The Temporal Kohonen Map (TKM) uses output neurons that do not immediately restore their values to zero, but that gradually lose activity. Another approach - based on the TKM - is to add a context layer to the SOM’s input architecture, enabling it to memorize past inputs, similar to the supervised Elman or Jordan neural networks. Each neuron has, apart from a weight vector, a recurrent difference vector, which keeps track of past input vectors. The map is able to store information about the change of the input vector, and it can thus cluster the time series temporally. The resulting network, the Recurrent Self-Organizing Map (RSOM) [16] [14] [29], is able to learn time-varying patterns and has proved valuable in various studies. The RSOM uses difference vectors that are associated to each neuron in
the network. The difference vector yi(t) of neuron i of the map VM
is updated as follows, with x(t) the given input:
where wi(t) is the weight vector of neuron i at time t and
a
Î ]0,1[ is the weighting factor (WF). The adaptation of
the weight vectors is changed into:
where iÎ VM, g is the learning rate and hib is the neighbourhood function. 3.4.5. Tracking the activation When keeping track of the number of activations per neuron, the result will
be a landscape-like surface with the exact same size of the competitive layer.
The activation landscape proves to be an efficient visualisation method, which
gives a lot of information about the clustering – not just the size and position
of clusters, but also the density becomes visible. The main reason why hills are
formed on this surface, is the topology-preserving ability of the Kohonen Map.
Figure 14 a: The SOM output layer. b: The corresponding activation surface.
In Figure 14a, the competitive layer of an already trained SOM is depicted, while the activation surface for the same map is plotted in Figure 14b. Four clusters are located at the edges of the SOM, one is located in the middle. The right side of the map is divided in two clusters, although these are more or less merged in one hill in the surface plot. It is also clearly visible in Figure 14b that the middle cluster is a combination of two or three hills. When the sensor outputs are mapped onto this surface, one could deduce what the probability is that the learning system will be able to classify this. If it is mapped in a valley, it will be very likely that the learning system will have difficulties in establishing correct context recognition. It is possible to mark the response of the Self-Organizing Map to the current input on this activation surface or activation landscape. A problem is that this surface will still adapt itself during the time these inputs are fed to the network. If the adaptive behaviour of the SOM is freezed (by making the learning rate equal to zero) after a first pass of the data however, the response of the SOM in a second pass can be visualized by drawing marks on the surface (as depicted in Figure 15). Since the height of every x-y position is determined by the number of times
the neuron on those coordinates has been activated, the hills or bubbles on the
activation surface plot represent clusters of neurons which all respond to a
similar input. In fact, it would be more correct to call the bubbles on the
surface clusters of clusters, since every discrete position on the landscape
represents a neuron of the KSOM’s competitive layer, which is in fact already a
cluster. The next step in the recognition process could be described as
labelling the hills on the activation plot, in stead of putting a label on every
discrete position (neuron). The latter method would generally need more
processing time and memory. This classification will be discussed in chapters 4
and 5.
Figure 15: Tracking the input while walking from the inside to the outside context.
The activation landscape is an excellent way to track the winning neurons and visualise what is happening, comparable to the phase space graph discussed in chapter 2. This time, however, there is no limit in the number of sensors (or input dimension) since the input dimension has already been reduced by the Kohonen Self-Organizing Map. 3.4.6. The Adaptive Resonance Theory Carpenter and Grossberg solve the stability-plasticity dilemma in their Adaptive Resonance Theory (ART) [6] by introducing top-down mechanisms that enable the system to stabilise itself. They designed a network which forms clusters and is trained without supervision: the Carpenter/Grossberg classifier [17]. The architecture embodies (like the SOM) a parallel search scheme, which updates itself adaptively as the learning process unfolds. After the learning self-stabilizes, the search process is automatically disengaged, and input patterns directly access their recognition codes without any search. The consequence is that recognition time does not grow as a function of code complexity. The architecture possesses a self-scaling property, which enables its codebook vectors (or ‘prototypes’ in the vocabulary of ART) to form. They detect and memorize predictive configurations that are derived from the set of all previous input patterns. To keep the system stable yet plastic, ART has some variables - priming, gain control and vigilance - build in the architecture. Priming and gain control are needed for code matching and self-stabilization, while vigilance determines how fine the learned categories will be. If the vigilance increases, then the system will automatically search for finer learning categories. Vigilance is actually a threshold, that determines how close a new input pattern must be to a stored instance to be considered similar. It is usually kept in a range between 0 and 1: a value near 1 requires a close match and smaller values accept a poorer match. ART comes in several flavours, supervised as well as unsupervised. The unsupervised ARTs are basically similar to many iterative clustering algorithms (such as the leader-clustering algorithm discussed in the beginning of this chapter) in which each case is processed by these steps:
where "nearest" and "closer" can be defined in many different ways. In ART, the framework is modified slightly by introducing the concept of "resonance" so that each case is processed by:
"Resonance" is just a matter of being within a certain threshold of a second similarity measure. A crucial feature of ART is that if no seed resonates with the case, a new cluster is created as in the leader-clustering algorithm. As a consequence, the number of clusters will grow in time, depending on both the threshold and the distance metric used to implement similarity. The feature of adding a cluster is said to solve the "stability-plasticity dilemma". The current training case is stored in a short-term memory (STM) and cluster codebook vectors are part of "long term memory". The two stages of finding the nearest seed to the input are performed by an Attentional Subsystem and an Orienting Subsystem. The latter of which performs "hypothesis testing", which refers to the comparison with the vigilance threshhold. This does not refer to hypothesis testing in the statistical sense. The ART learning method is stable, which means that the algorithm converges; it does not mean however that the clusters are insensitive to the sequence in which the training patterns are presented. The ART network can perform well with perfect input patterns, but even a small amount of noise can cause problems. Due to the noise, the level of vigilance can be set too high resulting in a high number of stored instances, which can rapidly grow until all available nodes or memory is used up. Modifications to this algorithm are necessary to perform the clustering in this application since noise will always be present in the output of the sensors. These could include adapting weights more slowly and changing the vigilance threshold during training and testing, rendering the system either very slow, or very complex, and this is why the ART network was not used as the clustering algorithm. It might be worthwhile keeping in mind how it overcomes the stability-plasticity dilemma, though. 3.4.7. Other unsupervised neural networks A large variety of unsupervised neural networks exists. A compact list of some of the most prominent kinds of unsupervised neural networks is included in this section plus the reason why they were not considered in this thesis.
In the first experiment the objective is to test the robustness of the context classification when using adaptive cue to context mapping, based on the Kohonen SOM network. In a training phase, data was collected while taking the TEA board through a sequence of contexts, related to "being inside in the hand", "being inside on the table", "being inside in a suitcase", "being outside on a table" or "being outside in the hand". Acquiring data from a sequence of contexts has the advantage that the experiment will be able to demonstrate not only the clustering of a context profile, but also the sensitivity to distinguish contexts. The KSOM clustering algorithm successfully mapped sensor readings into a two-dimensional grid in which areas of high activation correspond to these contexts. Figures 16a till 16e show the clustering during training of the self-organizing map for five passes through the training data; one pass (or epoch) is recorded data from all five contexts. This could be viewed as training the Kohonen network by putting the device in the five contexts, and this for five times in a row. Each iteration consists out of approximately five times 2 minutes per context (transitions were not counted). It should be pointed out that parameter settings in the clustering algorithm (i.e. neighbourhood radius and learning rate) sometimes lead to the creation of spurious clusters, but in general the parameters do not need time-demanding fine tuning. The initial (random) weights are very important to the topology of the KSOM, but this feature is actually not necessary for the clustering itself. For the visualisation, however, this could make a difference. The labelling of these clusters in this experiment was added by hand, but in a next chapter, this will be done in a more realistic, feasible and automated way. After this first phase, the recognition of these contexts was tested when taking the board through similar (but not identical) context changes. In informal evaluation, a robust classification can be observed when attempting to follow similar contexts as the ones in the training set. Figures 17a to 17e show the tracking of the inputs of the second phase on top
of the ‘frozen’ surface from the first phase.
Figure 16a: The output layer of the KSOM after one epoch.
Figure 16b: The output layer of the KSOM
after two epochs.
Figure 16c: The output layer of the KSOM after three epochs.
Figure 16d: The output layer of the KSOM
after four epochs.
Figure 16e: The output layer of the KSOM
after 5 epochs.
Figure 17a: Tracking the input on a frozen
activation surface during the first context.
Figure 17b: Tracking the input on a frozen
activation surface during a second context.
Figure 17c: Tracking the input on a frozen
activation surface during a third context.
Figure 17d: Tracking the input on a frozen
activation surface during the fourth context.
Figure 17e: Finished tracking the inputs on an activation surface during five contexts. Chapter 4
When the number of sensors increases, the size and complexity of the Self-Organizing Map (and a lot of other clustering algorithms) might become too large for an on-line application. In that case, it is possible to use multiple smaller SOMs for only subsets of the total set of sensors. It is trivial that the choice of the subsets will be crucial to the final performance of the clustering. 4.1. Dealing with high dimensionality To evade the curse of dimensionality, a large variety of approaches exists, but one of the most effective ones is by simply reducing the dimensionality of the problem by splitting the inputs up into smaller segments. If the subsets are chosen in an appropriate way, the combined processing resources needed for these smaller inputs will be far less than the resources needed by the initial system. This divide-and-conquer technique will usually need an additional processing layer to combine the results of processing the subsets of the input data. In this particular case, the solution is to use intermediate mappings and to exclude sensors for each mapping. Rather than using one big neural network that needs all sensor readings as input, multiple smaller neural networks are used that start from a subset of the set of all sensors. There is no limit in the number of subsets that have their own clustering algorithm, but using multiple algorithms in parallel can be inopportune in the event that the processing must be done in soft real-time. A trade-off can be observed between clustering precision and necessary resources to obtain real-time processing. This on-line specification is mainly necessary to limit the amount of user involvement during the use of the system. The possibility exists to use more than one kind of SOM or even more than one kind of clustering technique in parallel. One can for instance choose to cluster sensors that are not frequently being updated with the regular SOM, while the RSOM or a similar algorithm could process data that possibly contains time dependent patterns. Again, the choice of these algorithms can be very important to the overall performance of the system. An additional benefit of this approach is that it is not necessary to use very large output layers, since there will be generally less clusters in the output space. Reducing both the size of the input vectors and the size of the output layers should lead to a faster and more precise creation and adaptation of the clusters. These clusters are, however, numerous compared to those of the initial approach, and it might be necessary to – again – use a clustering algorithm if the output space of the first layer is too large. This method results in a change in the output of the clustering layer, since
there are more than one output layers (or two-dimensional cluster spaces). When
using just one KSOM, the result of the layer is equal to the positions of one or
more neurons in the output layer. In this new approach the output is made out of
positions in multiple output layers, possibly too many to handle directly. In
that case a second clustering algorithm could prove useful – a smaller and
simpler one, like the k-nearest neighbour, is preferred for this purpose in the
experiments.
Figure 18: An example of the architecture of the system so far.
An imaginary implementation of this approach is illustrated in Figure 18. Notice that it is possible (and feasible) to use the output of a sensor for more than one sub-clustering algorithm. The weights are only drawn for one neuron in the grid per clustering algorithm to enhance the visualisation, normally every neuron should have links to these sensors. A trade-off exists between having a slow, bulky algorithm that is able to combine all sensor outputs, and having several small and fast algorithms that omit (possibly relevant) combinations. The next experiment will demonstrate the usefulness of dividing the KSOM into smaller ones when the input space grows larger. The original clustering layer has one RSOM with an output layer of 400 neurons (20x20 grid), while the second version has two smaller ones – one KSOM and one RSOM – both with an output layer of 100 (10x10) neurons. The shrinking of the number of neurons is justified because the number of clusters in the output space will be less as explained in the previous section. The output layer is not the only element that becomes smaller, the weights attached to each neuron become smaller as well since the input space has become smaller. The output layer is reduced from 400 (20x20) neurons to 200 neurons (2x10x10). The total number of weights becomes thus 3200 (8x400) for the first architecture, while it is 800 (2x4x100) for the second. The outcome is a faster algorithm that is also less demanding on the hardware resources. Table 3 gives a summary on the comparison of both systems in terms of numbers.
The choice of the subset is done more or less randomly for all sensors but the light-related and motion-related sensors and cues were kept together in the RSOM (since they show short time-varying patterns). The WF threshold of the RSOM was kept close to one, since a longer-term memory was not really needed in this case. The next two graphs (Figures 19 and 20) show the results in recognition, with a minimum amount of training. The upper plot shows the ID given by the clustering-classification layers. Valid IDs are 1, 2, 3, 4 and 5, while zero was the ID to represent the ‘not recognized’ outputs. The fact that the classification fails frequently is the result of a minimal training (a few seconds per context). This will provide a good view on the responsibilities and qualities of the supervising layer, described in chapter 6. Both the supervising layer and the classification layer are discussed in following chapters, so they will be explained later. It is now sufficient to look at the lower plots to see the improvements when using the two smaller SOMs. The function in the lower plot depicts the context ID that was given by the supervising layer. This would be the actual result of the whole system if the ID number were labelled to a short description like "jogging in the park". The vertical lines through both plots represent the actual (added by hand) borders between the five contexts. The results of the divided input space approach are a bit better than those of the one clustering algorithm method. Especially the transition from the first to the second context becomes better, while the others are less visible. Notice that the results from the classification layer are also in general better. This of course does not prove that results become better in every case; the experiment is a very extreme example since the amount of training was intentionally kept very low. Dividing the sensors into groups results into omitting potential combinations
of sensors, that are useful in distinguishing contexts. It is thus possible that
dividing the input space has negative effects on the quality of the clustering.
The grouping of the sensors should consequently be chosen very carefully, and
overlapping (one sensor or cue present in two groups) should be considered.
Omitting combinations can have either a good effect or a bad effect, depending
on whether these combinations are respectively relevant or not.
Chapter 5
The work is not entirely done after the clustering algorithm has finished processing the input. The clustered input vector still needs to be associated with a context profile, in order to complete a classification of the original sensor readings. The goal of this chapter is to look for a good classification method that can be used in the strategy of the thesis. This chapter elaborates on some classification algorithms that could be used. Another problem that will be discussed in this chapter is how the system can be trained to recognise context profiles in an easy way without the necessity of the user being too much involved in training. Every neuron in the SOM can be linked to a context profile if the number and size(s) of SOMs is limited. In general, every cluster that was created by the clustering algorithm will be given a label. This method was determined to work best with the experiments in this thesis, since both the number of classes and the size of output layer during the experiments was rather low. A complete overview of the algorithms that can be implemented is given in this chapter to anticipate the growing amount of sensors. This brute force method could result in a slow system if the number of output neurons grows too high. One solution would be to cluster the neurons (which are in a way already clusters) again and attach a context profile to each new cluster. This method is in fact similar to the nearest neighbour algorithm in the next section. 5.2. Distance Weighted K Nearest Neighbours Once the clusters are created and linked to a context profile, new values coming from the (R) SOM have to classified, preferably by a fast but accurate algorithm. The nearest neighbour algorithm is one of the simplest and fastest algorithms: it compares the resulting output (an activated cluster in this case) to a set of training instances and classifies it to the most similar instance. Normally the training data consists of a set of instances plus the classes they should be associated with; in this case the training data is created from the clustered codebook vectors of the SOM’s neurons.
Table 4: The distance-weighted k nearest neighbour algorithm.
Similarity is implemented as the Euclidean distance, so the algorithm classifies the input vector (which represents a cluster in this case) to its nearest neighbour (hence the name). The results will usually be better if multiple neighbours are considered in stead of just one. It is possible to take a vote among the neighbours to select the winning class, but the result will get more trustworthy if the vote is distance-weighted. This means that the distance of each neighbour involved will count in the ultimate decision of classification. The final algorithm that includes both these improvements is called the distance-weighted k nearest neighbour algorithm [19]. The distance-weighted k nearest neighbours algorithm belongs to the class of local weighted regression techniques. The target output is approximated based on the data available near the query point, and the contribution of each data point is weighted by the distance from the query point (hence the name distance weighted regression). The term regression is used because it is used in the domain of statistics in approximation of a certain real-valued function. The simple nearest neighbour algorithm is very similar to the leader clustering algorithm, but in this case there is no feature that makes a new instance if the similarity is not convincing. Both algorithms depend on the Euclidean distance to introduce similarity, however the nearest neighbour classifies data while the leader algorithm is restricted to clustering. 5.3. Radial Basis Function Networks Radial Basis Function (RBF) networks are related to local weighted regression, as well as neural networks. The architecture is similar to that of a neural network with a hidden layer for which the combination function is based on the Euclidean distance between the input vector and the weight vector. The hidden neurons produce an activation, which is determined by a Gaussian function with a certain mean and variance. Therefore the activation will only be significant if the input is close enough to the mean, otherwise it will be near zero. The neurons from the output layer combine the activations linearly. RBF networks do not really have anything that is exactly the same as the bias term in a multi-layer perceptron. Some types of RBFs have a width associated with each hidden unit or with the entire hidden layer; instead of adding it in the combination function like a bias, the Euclidean distance is divided by the width. The output activation function in RBF networks is usually the identity. The identity output activation function is a computational convenience in training but it is possible and often desirable to use other output activation functions. Although there are many types of radial basis functions, Gaussian RBF’s seem to be the most popular by far in the neural network literature. There are two distinct types of Gaussian RBF architectures. The first type uses the exponent activation function, so the activation of the unit is a Gaussian as a function of the inputs. This kind of RBF is usually referred to as ordinary RBF or ORBF. The second type of Gaussian RBF architecture uses the softmax activation function, which means that the activations of all hidden units are normalized to sum one. This type of network is often called a normalized RBF or NRBF network. In this kind of architecture, the output units should not have a bias since the constant bias term would be linearly dependent on the constant sum of the hidden units. The radial basis function network provides a global approximation to the target function, represented by a linear combination of many local kernel functions. The value for any given kernel function is non-negligible only when the input falls into the region defined by its particular center and width. Thus, the network can be viewed as a smooth linear combination of multiple local approximations to the target. One main advantage to RBF networks is that they can be trained much more efficiently than the more common feedforward architectures trained with backpropagation. This follows from the fact that the input layer and the output layer of RBF networks are trained separately. To use the radial basis function for classification, it is common to cluster the data first (as is done in this thesis), and use the resulting clusters as radial centers of the Gaussians. If the Self-Organizing Map is to be used as a classification algorithm, the problem becomes a decision process. Kohonen [15] modified the KSOM into a classification algorithm, and called it Learning Vector Quantization (LVQ). The classes are predefined and the there is a set available of tagged data; every input vector is labelled with its correct class. It is very common to have several codebook vectors or prototypes per class. The original Kohonen SOM is primarily intended to approximate input signal values by codebook vectors that are placed in the input space to obtain the best clustering results. This is obtained by minimizing the quantization error function. If the inputs are to be classified into a finite set of classes or categories, the codebook vectors are linked to a class and their identity within the classes is not important anymore. Only the decisions at the borders of classes will count in that case. The winning rule from the KSOM algorithm remains unmodified, but the update rule depends on whether the class of the winner is correct or not. The correction comes from a ‘teacher’, so this algorithm is a supervised one. The first version of LVQ (LVQ1) [15] starts with a classification of some neurons, while others have no class attached to it. The class regions are defined with the nearest neighbour algorithm. The classification accuracy is enhanced if the codebook vectors are updated to the algorithm in Table 5. The idea behind the algorithm is to pull codebook vectors away from the grey zones so that the class borders become more apparent.
The second version of LVQ adds a symmetric window of nonzero width between
two codebook vectors. Figure 21 illustrates the use of this window: the two
circles depict two codebook vectors that belong to different classes and are
nearest neighbours. The two are initially in a wrong position. The curves around
them represent the according class distributions. The classification will be
adjusted only if the input falls into the window and the classification was
incorrect.
The optimal size of the window depends on the number of inputs that are presented to the network during the training phase. A narrow window guarantees the most accurate location of the border between classes. However, if the number of input samples is small and the window is very narrow, problems could occur in updating misclassifications. The input has to fall often enough in the window. The final version of the algorithm provides a remedy for two problems that remain with the LVQ2 algorithm: The update rules have the effect that the correction on the correct class is larger than that of the incorrect class, since the differences between the codebook vectors and the input are considered. This results in a decreasing distance between the two codebook vectors. LVQ2 could lead to an asymptotic equilibrium of the wrong class that is not optimal. Corrections seem needed to ensure the continuing approximation of the class distributions. The pseudo code of the LVQ3 algorithm is shown in Table 6. Kohonen found the optimal value of e to depend on the size of the window, being small for narrow windows (between 0.1 and 0.5). The second and third version of LVQ change two codebook vectors at once, whereas in the first version only one was changed.
5.5. Minimizing the user intervention Since the clustering is adaptive, it allows the further classification to be maintained and adjusted when necessary. This results in a minimal amount of user-intervention, since linking the clusters to context profiles is the only action a user has to be involved in. In contrast of the training-phase/test-phase method, where the recognition is executed with a stable but non-plastic system after it has been trained, the system stays plastic. There is in fact no distinction between a training- and test- phase of the system. This also means that the network trains itself for new contexts and the user can label the newly formed clusters afterwards. Thus, the system does not need to wait for the user to begin adapting itself to new contexts. The goal is to make the whole process as autonomous as possible, which will render it simultaneously more user-friendly. Figure 22 shows the two places where the user has to intervene in the process
of context recognition. The first one is the linking of clusters to
context-identifiers that were initialized by the system. The second interaction
is the labelling of these context identifiers with short descriptions.
Chapter 6
The system that has been discussed so far is able to associate a predefined context profile to a set of sensor values at a certain moment in time. In practice, there are no guaranties that the system that has been described up until now will always recognise the current context at exactly every moment in time. Another layer is needed to reach constant recognition of a context: the supervising layer. This top-down part of the system will also make user supervision possible and will improve the adaptive behaviour and recognition success of the learning system. Furthermore, it becomes possible to model sequences of contexts and thus a crude model of user behaviour. 6.1. A probabilistic finite state machine The probabilistic finite state machine contains two elements: states and actions. The difference with a normal finite state machine is that the actions are labelled with the probability that they could be chosen. Every user-defined context profile will be represented by a state, while transitions of states will be indicated by an action that goes from one context to another. Every transition has a label attached to it that contains an indication of how probable it is that a transition between the two contexts occurs. This way, transitions of states can be checked on probability and it is possible to keep track of the probability of entire sequences of states. This information can be used as a sort of emergency backup if the recognition system has substantial problems with the classification. The probabilistic finite state machine can pick out the most probable state, based on the previous states. The probabilistic finite state automaton is very general since it also keeps track of previous states. This is not really necessary and it is sufficient to look at the previous state for the choice of actions in this application. Reducing the probabilistic finite state machine in this fashion leads to a Markov chain model. A Markov chain is a sequence of events or states where the chance of each occurring is solely dependent on which state preceded it. In a probabilistic finite state automaton multiple previous states count in the decision process. In the diagram below (Figure 23) a Markov chain is described which consists solely of the words in the sequence ‘the cat sat on the mat’. Each word represents a state that the system can adopt, and the arrows joining them show the probabilities of shifting from one state to another one. Many names exist in literature to label this graph, although Markov chain and
Markov model are most used. Among their synonyms are first-order Markov chain or
first-order Markov model (in contrast, a higher order Markov model would be
closer to a probabilistic finite state machine) which are also used frequently.
For example, as far as the laws of grammar are concerned, the only word in this selection that can follow ‘on’ is ‘the’. To put it in a more general way, the probability of ‘the’ being the next word in the sequence is 100 percent, while the probabilities of ‘cat’, ‘sat’ or ‘mat’ are zero (these links are not shown in the graph). By contrast, ‘the’ is just as likely to be followed by ‘mat’ or ‘cat’, so these paths are given an equal 50 percent chance. In practice, a Markov model would be more complex, it would involve more states and their probabilities would not be nearly so evenly apportioned. Imagine that the Self-Organizing Map (or a similar clustering algorithm) has correctly identified the first few context profiles, but there remains some doubt on what the next will be. A Markov model could quickly decide between the two. If the previous context was for instance ‘in a train’, the Markov chain would know that it would be more likely that the next context would be ‘in a train station’. It would be highly unlikely that the next context would be ‘in a bedroom’ for example. The Markov chain is often implemented as a matrix, where each column and row represent the probabilities to go from one state to the other. Using the Markov model to add learning of long-term variations to the system is widely used in various recognition systems. The most important advantage of the Markov chain is that it is able to deal with variable length sequences. Hand sign recognition [28] and lipreading [12] using a Markov model were implemented with good results. The user can supply the data for this supervising layer, both for the context descriptions and for the probabilities of the transitions. Nevertheless, it is also possible –and perhaps preferable- to make the Markov Model adaptive towards the probabilities. If the system keeps track of the number of transitions between two certain contexts, it will become possible to update automatically the probabilities in the model. It is also possible to combine user interaction and these self-adapting probabilities. In Table 7, the simple algorithm is summarised to update the probabilities in an adaptive way. Every directed link from one context profile to another has two numbers attached to it: a variable that counts how many times the transition has been made, and the probability of that tr |