Deep Learning vs. Machine Learning - Why Artificial Intelligence Is Not That Intelligent (Yet)

AI | 17 Mar 2020

Most probably, you’ve recently heard news about advancements in Artificial Intelligence (AI), or seen an ad of a TV or smartphone with ‘AI’. Or maybe you are planning to build an ‘intelligent’ home yourself? We are bombarded with these and similar buzzwords almost everyday. But what do they really mean? Are ‘intelligent’ appliances and devices really intelligent, and are the breakthroughs in AI research bringing us close to an inevitable, Terminator-like machine uprising?

In this short article, I will try to answer these questions, showing why your smartphone voice assistant is not necessarily that smart, and that many ‘AI’ solutions are, in fact, well-designed mathematical models.

Machine Learning In Practice - Real Estate Pricing

Imagine you want to create an application that will automatically estimate a piece of real estate’s price in a given city. A user would fill in data about their house or apartment and then receive an estimate of its value. With this in mind, let’s think about what factors should be included in the formula for the price. We would probably start with the type of real estate, its total area mass, number of floors, number of bedrooms and bathrooms, name of the city district it is in, etc. We could also use additional information, like distances to important places (e.g. city center, main train station, nearest bus stop or metro station, nearest primary school), presence of parks in the neighborhood, number of parking spots/garages available, total area of the backyard, etc.

So with all these factors in mind, we then need to think about how we can build an algorithm to calculate the price? We would probably collect as much data about the current real estate offerings as possible, and try to somehow find mathematical relationships between the offerings’ features and prices. An obvious example of such a relationship is the correlation between the larger real estate and higher prices. This is simple, but how can we then find relationships between all features and correlate them to the price, particularly when some features in general increase the price, and others decrease it (imagine a case of a large apartment far from the city center for example)? Finally, how do we find the joined, overall impact of the features on the price? This is a typical example of a Machine Learning (ML) problem.

In Machine Learning, we are usually given data that describes features (characteristics) of some entities (items/people/events), and a value we want to be able to estimate for new cases. We call this value a target variable or simply target, and the estimation process prediction, even though it doesn’t always relate to the future. Sometimes, we want the algorithm to give us insights about our dataset automatically, without providing it with any ‘hints’ or targets. This approach is called unsupervised Machine Learning, and an example of its application is automatic grouping of items, called clustering.

If we want to find the relationship between the feature variables and the target (as in our app’s case), we need to train an ML model. Model and training are two crucial concepts in Machine Learning. An ML model is a function that is used to predict the target based on the feature variables. ML models range from very simple functions (like linear regression’s y = a + b*x), to extremely complex objects with hundreds of millions of parameters, and complicated ways of transforming and joining features (called artificial neural networks).

If we want to use an ML model for prediction, we need to optimize its general structure and parameters to fit the given data. This is done in the process of model training. During this process, special algorithms either deterministically or by (often informed) trial-and-error, adapt the structure of the model and its parameters to best resemble the relationships between the features and the target that are present in the data. These adaptations can mean, for example, adding a higher weight to some feature’s value in the mathematical formula for the target, omitting a feature in the calculation, or adding some special rules (e.g. the larger the real estate’s area, the higher its price) to the prediction process.

Our real estate prices prediction problem is an example of a quite standard application of ML. First, we collect data about many instances of the object of interest (houses and apartments), including their features (total area, number of bedrooms, etc.), and target (price). Now we gather this data in a tabular form, where each row would correspond to one house or apartment, and consecutive columns would contain data about features and the price. Then we would select some ML models, and train them with appropriate training algorithms. This row-column approach is characteristic for the traditional (standard) applications of ML. It has been used for many years now in numerous practical applications. We can find standard ML models in credit scoring systems, churn prediction models, predictive maintenance models, and many more. They are used in all kinds of businesses - e-commerce, banking & insurance, manufacturing, energy production, automotive, and more. These solutions - although often very powerful and accurate - are very specialized in the sense that they are able to perform only one task (e.g. churn prediction) in a very narrow environment (e.g. a particular loyalty program). They are very far from what we would consider ‘intelligence’.

Deep Learning In Voice Assistants

Let’s now assume you want to add a voice interface to your price prediction app. We want the users to be able to ‘talk’ with a virtual assistant that would ask questions about their property, put their answers into the model in the correct format, and then tell them the estimated values of their real estate. This problem looks much more complicated than the price prediction itself, since we need to solve some pretty hard challenges. How can we make the computer understand human speech? How can it extract specific information from text? What about answers that do not contain the information required? These problems seem like they require some kind of intelligence. But do they really?

The solution for the first problem mentioned, automatic speech recognition, has been researched since the 1950's, but became widely available to the public in the early 2010’s, when the largest tech companies released their voice search options (e.g. Google) and voice assistants (e.g. Apple’s Siri) on smartphones. Speech recognition has become so accurate thanks to the application of Deep Learning.

Deep Learning is a part of Machine Learning, in which models of a special kind, called Deep Neural Networks (DNNs), are used. Deep Neural Networks are very complex. They often contain millions of parameters, and a very complicated structure and data flow (the way the data is transformed and mixed). Training them is very time and computation-consuming and requires vast amounts of data (often, millions of samples). These models can be used to convert speech (in the form of a digital audio file) to text, enabling our assistant to analyze it, and extract the required information.

In order to make our assistant extract information needed from the transcribed speech, we need to define the intent of the conversation. In our case the intent will be collecting data about the features of the real estate - its total area, number of bedrooms, or the address. For each feature, we define a question the assistant will ask, and provide it with a few sample answers the user might give, with placeholders for the extracted information, e.g. ‘My house has <number_of_bedrooms> bedrooms’.

The last piece of our puzzle is a language model that will allow our assistant to extract information from the user’s answers even if they do not exactly follow sample answer patterns. In a language model, each word is represented with a vector of numbers, and some context relationships between words are encoded. There are a number of language models used in practice, but the most accurate also use deep neural networks.

Voice assistants can be of great use in the tasks they are built for, and the process of creating one from scratch requires a vast amount of resources, talent, and work. But as complicated as they may seem (and ingenious as they are), they are merely sequences of neural networks, each trained to solve a concrete problem, passing their output to the next one until the expected output is returned. Voice assistants are not able to answer all of our questions, nor do things they were not specifically designed for. Although ‘talking’ with them can sometimes be an exciting sci-fi-like experience, they are not really intelligent, and have no ‘idea’ what they are doing.

Understanding Images With Deep Learning

Another area in which Deep Learning has helped drive tremendous advancement in recent years is image recognition and detection. These two areas are closely connected, and their revolutionary development was possible due to the application of convolutions and Graphics Processing Unit (GPU) computing. Without going into too much detail, convolution is a way of transforming pixel data from images in neural networks, in which the network ‘tries’ to capture both basic and more complex patterns in how the colors are arranged in neighboring pixels. A pattern found by a convolution can be a horizontal line, a dot, a semicircle, or something more complicated, like number eight or a circle with two smaller circles in the middle. The network learns how basic patterns build complex ones, and how the latter constitute images of objects the network should recognize.

In reality, these patterns (or features as they are called in Deep Learning) can have weird shapes we would not have thought of if we tried to figure them out by ourselves. The types and sizes of convolutions are defined by the neural network’s creators, and the parameters of the transformations are optimized by the training algorithm, in order to select features that help recognize the training objects as accurately as possible.

Just as in the case of deep neural networks used in Natural Language Processing, image recognition DNNs are very complex, and training them to a high degree of accuracy requires a lot of data. Since access to data is limited, and training time of a usable DNN can take weeks even on GPU-enabled devices, not everyone can create these models. The problem becomes even harder if someone wants to build a very specialized solution, but does not have a lot of labelled training data (e.g. to recognize car models on images). Luckily, artificial neural networks allow for transfer learning. Transfer learning is a process in which one takes an already trained model, applies some (usually small) changes to it, and trains it further on new data of the same type (e.g. car images) to perform the same task, but in a different (e.g. more specialized) setting. Since the pre-trained model already contains some general image-recognizing features, training it further on a relatively small sample of specialized data (even a few hundred or few thousand images) can often yield a quite accurate model. Transfer learning can be applied to all kinds of neural networks, and can also be used in language-related problems.

Creating Art and Fakes With Deep Learning

You have probably heard about deep fakes - fake videos made as if they were showing famous people doing or saying things they didn’t really do or say. These real-looking fakes have become possible due to the emergence of Generative Adversarial Networks (GAN) in 2014. Also, some image-changing apps (like the one that could make you look much older) use them. GANs consist of two main parts, both of them neural networks. One of the networks (a Generator) tries to generate real-looking objects (e.g. images), while the other (Discriminator) tries to tell generated objects from the real ones it is given in a training set. During training, these two networks compete, trying to ‘fool’ each other, and thus learn from each other how to improve in their tasks. As a result, with a well-designed setup and training parameters, we obtain a Generator that can achieve astonishing results, creating real-looking media or pieces of ‘art’ resembling those created by humans. Other types of GANs are able to transfer styles between paintings, transform sketches into real-looking images, or harmonize music. But, they are still just proper algorithms designed to perform their tasks, and specialize in them by trying to fool the discriminator. They do not really ‘create’ stuff.

Great Opportunities, But Still Limited

Machine Learning and Deep Learning give us great opportunities to rapidly, accurately, and cheaply perform tasks that were previously reserved only for humans. ML and DL models have great use potential in medicine, customer service, banking, and many more fields and businesses. They can definitely lift our civilization to the next technological level. But they are not intelligent. They are merely specialized agents trained at their tasks, and - although powerful in what they do - they are still quite limited. They cannot reason logically or generalize reliably to new situations. So we are safe - the AI apocalypse is not happening (soon)!

Editor's note: This is a guest post from Miquido. Miquido is an award-winning digital product development company that excels at building AI-driven apps and web services. The laureate of Deloitte Technology Fast 50 CE, the winner of UK App Awards 2018. Certified by Google, covered by TIME & Forbes.