Active expansion of unstructured data started in 2008. At that time, editor of Nature magazine Clifford Lynch published a famous article “How will the work with big data volumes influence the future”. The term ‘big data’ was mentioned there for the first time. Business English vocabulary contains many collocations with the word ‘big’ meaning a wide range of technologies in definite areas: Big Oil, Big Ore, etc. Therefore, the collocation ‘big data’ looked quite organic in the article and Lynch’s neologism caught on.
This term is used now implying not merely a package of data. Read further in the article how Big Data has expanded in our times.
Big Data: what it is and how to define it
Big Data is a complex notion that has different definitions. If we combine all existing versions, Big Data is a set of methods of processing of structured and unstructured information. Big Data is determined basing on three main characteristics (three Vs) offered by Meta Group:
- volume – big physical volume of data (above +/- 100 GB);
- velocity – high growth rate of data and constant necessity to speed up processing;
- variety – a possibility to process different data types simultaneously: images, photos, videos, texts.
The set of these criteria is rather old and later IBM offered its own interpretation with ‘four Vs’ adding veracity, the fact which the company leveraged in its advertisements. Later, the number of criteria grew to five: IDC added viability and value. Now seven criteria are applied: the list was supplemented with variability and visualization.
Despite different approaches to criteria, the general idea is that big data is characterized by not only physical volume but also categories that help to evaluate the difficulty of data processing and analysis.
Sources of Big Data may include:
- log files of users;
- social networks;
- data of automobile GPS sensors;
- sensor data of the Large Hadron Collider;
- transaction data of bank customers;
- data about purchases and buyers of a large retail chain, etc.
Besides, Big Data is used for machine learning purposes. Great amounts of unstructured data help AI to learn independently.
Big Data sectors: Big Data Engineering and Big Data Analytics
The work in the field of Big Data can be divided into two sectors – Big Data Engineering and Big Data Analytics. They are interdependent but different.
Big Data Engineering deals with the development of software for data collection and storage making it available for consumer and internal apps. Data engineers design and roll out systems that are further used for making computations.
At the same time, Big Data Analytics is the use of data derived from readymade systems developed by Big Data Engineering. This sector involves the analysis of trends, development of classification systems, data prediction and interpretation.
Techniques and methods of Big Data Analysis
The number of data sources is rapidly growing meaning that processing technologies are getting more in-demand. According to McKinsey, the most popular tools intended to work with Big Data are:
- Data Mining – used todetectnew information in raw data, which can be used for practical purposes;
- Crowdsourcing – toattracta bignumber of people to the solution of large-scale tasks;
- Blending and integration – to adapt data to one format to simplify its processing (for example, the conversion of video and audio files into text);
- Machine learning – among other things, to build self-learning neural networks for quick and more qualitative data processing;
- Predictive analytics, statistical analysis, visualization of analytical data – for further development of a readymade information product.
Why is Big Data important?
At the World Economic Forum 2019, CEO of IBM Virginia Rometty said, “All people say that big platforms like Facebook and Google own a huge array of customer data. In fact, they have only 20% of data collected globally”.
As a confirmation, Rometty presented statistics of the company: currently, services of IBM are used by almost all banks of the world, 90% of airlines and 50% of all telecommunications. However, even this reach succumbs to the performance level of the Chinese companies.
CEO of the major hardware and software producer also noted that democratic views of western countries were seriously outplayed by China in terms of citizens’ data collection.
Chinese companies collect all available data from fitness trackers, smartphones, and smart home systems located in China or abroad. At the same time, European and American companies do not have access to the Chinese Big Data, as it is protected by legislation.
It follows that democratic laws should be developed with the focus on personal data protection. However, unhampered data collection is still required for the quick growth of artificial intelligence – the most promising technology of our time.
Learn about other news of the artificial intelligence technology
at AI Conference held in Moscow on April 9.