Understanding High Efficiency AI in 2021
High efficiency AI contradicts with the traditional goal of AI to get more and more precise systems. An important point is that precision is not the enabler for deploying functioning systems, it’s the efficiency. We’re in something called a Zettabyte age, experiencing a real explosion of data. In terms of number of bytes, it’s 1 followed by twenty-one 0’s. This is incredible amounts of data and it has been accelerating. The biggest segment at the moment seems to be human-generated files, the way it comes in from email to documents to social media. Specific to this is that text data, the expression of opinions, knowledge and the needs of people is increasing so rapidly that we need to automate it to keep up with it.
What a Piece of Text, an Email or a Message Is About
The problem we are facing on the other hand is that when the amount of text data increases, we are never behind in a technical point of view. On a stagnating performance on the basic computers or processing elements. Since a while, the improvement is not what it used to be, doubled from one release of generation to the next. It became sort of stable at a level and the actual increase of performance could only by adding more computational power to it.
There’s a consequence. What we see is that gently but steadily lose productivity in the context of that data. There needs to be a way to work around this. One of the two reasons is the classical Von Neumann Computing Concept where you have a part which is the processor and another one where the actual data is stored. The bottle neck is that we actually have to shift and shuffle around data and commands and it is done through a tiny little bus that can barely keep up. That’s an architectural problem that we face and interestingly we are using it now for many decades and the best we could come up with other than getting ore transmitters in them but to have more processing units to them but there is always a next level of bottleneck when it comes to the communication between those clustered units. That’s sort of a built in constrain that we have, and we have to overcome it to be freely scalable again and to reasonably and efficiently grow with the growing number of data and needs.
The second level is at the machine learning approach that sits on top of the Von Neumann issue that we just saw. This is your very generalized model and there are many flavors to it and at that point, from a very high-level, in what context those systems are embedded. When we do machine learning, we want to transform some data that flows in and some useful data that goes out of it and we want to do this because the processing would be extremely complicated to do by an analytical way. Trying to understand a problem and understanding all the formulas you will need to actually generate some useful output. There are a lot of problems that we cannot design systems that way, so we started to use data to have the computer find itself how the transformation has to happen. It happens based on training data which is supposed to show the system how to transform particular examples from input to output. The efficiency of that model tells us how much of that training data needs to get to a model that works reasonably well for such a transformation.
The limitations that we see here we see them from all stages. You have to have enough data to do this. You have to provide (specifically for text problems) human input to generate training data to qualify a piece of text to be used, for example, a customer who is angry that a product didn’t work as expected. Someone has to read these messages and associate a feature that is relevant, and you have to do this over and over until the model can do a reasonably good prediction. Once you ended up having a model doing this, as there are billions of features you need to process, especially in language, it can be seen as an open system. There are so many words and there can always be new words so if you want to capture the characteristics/ semantics’, you need an enormous amount of examples.
What we see here is that, in my opinion, we are facing an architectural problem – so we have stagnating computer power and the only way in getting more complex models to be trained is to provide more data which leads to more computer power and the question is if it will be efficient enough to actually become useful.
So that problem is basically what I’ve tried to bring up in sort of a metaphor of a car that works based on water and on the left side you see one implementation of that where water is actually used as steam and on the other side you see sort of the latest principle namely bringing car and water together around the hydrogen fuel cell which is precisely more efficient and has a lot of advantages.
So, the sort of question I’m after is at what fundamental level do we need to change parts of the architecture of the approach to bring efficiency back sort of on the road again
because by just building bigger and bigger steam-based cars that’s not the way to end up with the most efficient use of this.
So, the problem in this, I call it typically Statistical Modelling by getting features and just multiplying your way through these features which ends up in huge computational effort to sort of mass combine all the data you have, which precisely leads to a very large combinatorial space hence the computer power which is needed. So, on the side of text there is a for example one quite famous example that came out from a scientific study where they tried to find out how well can you possible find a specific document in a collection given that you create a statistical representation of the text within this document. Interestingly what they did is actually they took very large collections of documents specifically I think it was patent documents mostly they extracted all the terms they found within these documents and then they generated mass queries. So, by basically taking all the terms creating search queries with those terms searching against the collection of patents and measuring how well they actually find certain patents and what combination of keywords that they extracted what document pops up at the top of the list. The very astonishing thing was that there was a huge blind spot that was sort of generated. So, the bigger the collection, the bigger the blind spot becomes which basically means there are a large number of documents where whatever combination of keywords you might come up with you never get them to end up very much on the top of the list.
In fact, the study precisely found out that in some applications you could sort of delete 80% of the data and people won’t notice it and that just shows the limitations of applying statistics to something like language. I mean language is statistical to some degree, but you cannot fully describe language using statistics and that’s where you actually sense this. So, just as an example what we have with a huge index like for example google is that we have the impression as we always get answers that we have the power sort of to search through the data, but in reality, there is an increasing amount of data that is in the background which cannot by any technical means be reached other than knowing that it’s there in the first place which would make searching obsolete.
So, the Von Neumann gap which is sort of the principle I described earlier. If you now compare the increase of the processing power with the increase of data that’s available, you see at some point that there is a gap which doesn’t look as if it would be closed by any time soon.
Another aspect that is related to statistical modelling of course like a secondary effect if you need a lot of computing power you need a lot of energy. I spoke to a number of people to justify that it’s very often underestimated how much energy our computer sphere actually uses currently.
So, if you take together all the digital devices and their energy consumption in the information sphere sort of globally then already today so in that kind, in that term the number actually comes from 2018, we are using for the computing sphere as much energy as we use for the global air transportation, which happens to be around four percent of the global energy.
So, this is not just a sort of small part of the energy consumption which is minor compared to any industrial and any other activities, but this is really one of the top factors that we should take into account especially when it comes to energy consumption. Especially if you see the growth rate and you find out that if we continue as we do, using energy and I would even say it’s probably not a linear development it will be a more towards an exponential growth that we will see.
But even if it’s linear we have to be prepared that we will have 8% doubling of the energy that’s needed for the information sphere sort of which would correspond to all the automobile transportation on the planet. Just again from the bird’s eyes view all the industries facing the global warming issues and so on are trying to get a more energy-efficient and the only sort of big industry that obviously goes in the other direction is the computing industry if you want because we haven’t found any way of being more efficient that mass combinatorics when we do modeling which actually leads to that amount of energy.
Last but not least – the way we do machine learning currently is actually through a train a model that is very specific to one specific problem and each of the models, each of the network architectures to be trained they are very local.
So, you have to have a local example of local data and you train a local model which is then used locally and if you extrapolate that we will end up in what I would call a million-model multiverse.
Basically, a completely fragmented environment where every user of every application uses a very specific model with all the problems of updating it and handling it, so on. The problem you cannot generate network effects anymore at that level and this has been from an economic standpoint a very strong driver over the last 20 years or so that the network effects precisely are the ones that create the big money around those innovations.
So, the impact that this has at the society level and this is just supposed to be examples sort of where this can lead is that for example the find ability making it hard to find information also makes it hard to find true information compared to maybe artificially propagated fake information. Fake news has a much easier time when finding is not so easy. the Von Neumann gap of course very obviously is a big counteractor to any effort. To cope with climate change the whole point of creating statistics which in the end leads to averaging to some common denominator, even if the denominator is very complex of course makes it hard to be one aware or to handle cases which are special cases or more extreme cases.
So, a whole economy that is just focused on averaging everything to the maximum to get the biggest impact might turn out to be problematic. The only data that is really there in endless abundance is actually data about consumers and how they behave. Those consumers that are just people, they are elements of the civil organization and if they’re the only ones which are perfectly modelled because they have endless data. There is of course the danger people with enough motivation can actually train those models to know more about the people than they might know themselves. With all the consequences in this so the approach we wanted to take with this is to go down to a very fundamental level and try to find alternatives and one area where it’s always to have a look at is biology because biology and evolution are definitely systems and sort of environments where everything is focused on efficiency because that’s what a living organism energy is the most precious resource.
So, evolution basically made organism a survive and do whatever they do with the least energy to use. The human brain works along ten same principles Humans in doing what they are doing and using their brain and their intellectual capacities and doing this with energy levels of something like 15 to 20 watts which is ridiculous compared to even a single microprocessor, you see how efficient it and is it is the highest energy consumer in the human body compared to its weight. It’s an interesting mechanism and the way we try to overcome this is precisely not to try and optimize and increase the algorithms but to rather focus on the representation and what we ended up with in our initial research is a representation that has been modelled along representation of data
in the human neocortex where actually thinking happens. I won’t go into much detail here but what turned out to be a key issue is that computation is actually making analogies when it comes to concepts and language use.
Those analogies can be calculated in the easiest way if you have two representations and you can literally by comparing a feature with a feature sort of find out how similar they are. We have developed a technology that is able to convert any piece of text into a semantic fingerprint that’s one of these squares that you see here and the rendering the distribution of the dots in this square – these are binary two-dimensional vectors.
The distribution always corresponds to the actual meaning of what the fingerprint has been taken. If I take a fingerprint of the phrase “signed contract” and another one of the phrase “done deal” and I do this based on a language model that is specifically trained on the language business and economic people are using, we find that this two sentences although they are not sharing any word have an overlap of 36%, which is quite substantial. They are nearly meaning the same thing if in contrast I change one side to the phrase Star Trek that overlaps you can literally see looking at the fingerprint representation how different they are. Just sort of as a high level comparison, you could imagine one of those fingerprints to represent the part of a human cortex and wherever there is a dot according area in the correct is actually activated and that’s looking extremely similar to actual pictures people are taking nowadays with brain imaging techniques while exposing people to concepts they can literally see on the FMRI for example, that they have specific image distributions and patterns for specific meanings so there is a very close relationship.
Nevertheless, although we are doing all of that without using any floating-point, without doing any massive matrix multiplications and so on, it turns out that we can achieve much lower use of computing resources, the same levels of precision and those levels of precision compared to statistical models can be achieved for not the fact that the algorithm we use which is Boolean operators on those fingerprints. They are not more precise cause in the end all these modelling techniques have at least the potential to reach very high degrees of precision. The reality is that you can only reach the high regions of precision if the system is efficient enough to allow you to do a large enough number of iterations and refinements.
An algorithm that theoretically can achieve a 98 percent precision, but that takes two and half months and five million dollars to train and that’s not overstated. I mean that’s where currently probably the state of the art in language modelling is. Then and you would find out to actually tune the system properly you would refine that 10 times, you see that even with that hundred percent inefficient system it will take forever to actually go there. Our language models for example compute in about one two hours on a laptop, to compadre that. That the actual advantage not so much uh what the synthetically measured precision would be but, how well can I actually achieve that precision in the real world.
So, to sort of round up the thing, why high efficiency I think will play a very important role in the future is because only if we become more efficient, we can easily scale up and hopefully once again in a more linear fashion, which I would call semantic supercomputing to figure out the meaning of all the data that’s out there. This is in some terms not even limited to language only but, to all data because all data has a meaning. It would allow us to include that meaning with the data. The other extreme if you have a very light weight and powerful implementation is that you can scale down so you can create edge systems that already do a bulk of the applying where the data actually occurs instead of feeding it back in some cloud system. Then the other aspect of course and I’ve spoken about this in the panel already, it would finally allow us to separate out the actual functional solution from for example the language in which it is applied so currently very often we have to re-invent a certain language AI function separately for every language which is of course cumbersome. It would be much better to develop the application in one end and then to plug in different language models and the system would behave exactly the same.
ast but not least, if this whole computation becomes more lightweight it would allow us to zoom in in more specific areas so a problem that we see today very much in a legal document processing for example is that all the documents are basically using all the same words. It’s very hard to differentiate and only when we bring in semantics in addition to the statistics, we can be able to, like humans do by the way, in much more clearly discriminate between two pieces of text that are basically using the same vocabulary but meaning different. That’s what I would call hyper expertise so that people who are highly specialized in a very specific domain, for example, still can have useful and strong search systems within that domain, differentiating at the very fine grain level. That would be my 30000 feet view on the topic of high efficiency AI.