What is Data?
As widely defined, Data refers to pieces of information or facts that are collected, stored, and analyzed for various purposes. It can come in many forms, such as numbers, text, images, or multimedia. Data serves as the foundation for generating insights, making decisions, and understanding trends in fields ranging from science and business to everyday life.
What is big data?
Big data just as the name implies refers to data that is so large, that you have to think about how to deal with or handle its size. It is so complex and difficult to process that it cannot be handled with traditional data management processes such as relational databases, excel spreadsheets and so on. To understand big data, the concept of the 5 V’s is used:
- Volume: This refers to the amount of data being generated. we can all agree that with the rise of the internet, social media and others, large amounts of data are constantly being generated. so big data in terms of volume speaks to the amount and quantity of data being generated.
- Velocity: This refers to how fast the data is being generated and processed. Of course, this data being generated happens rapidly sometimes it happens in split seconds.
- Variety: This refers to the different forms the data is coming in, it refers to the type and nature of the data, either as an image, video, audio etc.
- Veracity: This refers to how trustworthy the data being generated is, with the vast amount of data being generated, it is equally of utmost importance that the data is accurate and reliable.
- Value: This refers to how actionable the data is because the utmost goal of working with big data is the ability to extract reasonable insights from it to make better decisions and improve processes
So in simple terms, big data is about dealing with a lot of data, coming in fast and in large amounts, in different forms and making sure it is accurate and valuable insights can be extracted from it.
Now in dealing with such large and complex data, this data comes in different forms and can be classified into 3; Structured, semi-structured and unstructured data. and as usual, I’d like to use an analogy to explain this.
Structured Data:
Structured data is like a well-organized library with clearly labelled shelves, aisles, and cataloguing systems. Each book is assigned a specific location based on its genre, author, or title. Just as it’s easy to find a book in such a library by following the cataloguing system. structured data is organized in a predefined format, making it easy to locate and retrieve specific pieces of information.
Some characteristics of structured data include;
- It is easy to organize
- it has a consistent structure i.e. can be organized into rows and columns
- It has defined data types
- Can be grouped into tables and relations formed between tables
- Most structured data is stored in relational databases
Storage Format: Structured data is organized and stored in a well-defined format with a fixed schema. Examples include relational databases (like MySQL, PostgreSQL, etc), spreadsheets (Excel), and structured text files (CSV).
Storage Mechanism: Structured data is usually stored in tables, with rows representing individual records and columns representing attributes or fields.
Semi-Structured Data:
Semi-structured data is comparable to a library where books are organized by genre but lack a strict cataloguing system. While you can still find books based on broad categories like fiction or non-fiction, there may be variations in how books are sorted within each genre. Additionally, some books might have handwritten notes or bookmarks inserted by previous readers, adding a layer of flexibility and personalization. Similarly, semi-structured data may have some level of organization, such as tags or labels, but lacks the rigid structure of fully structured data.
Some characteristics of semi-structured data include;
- Relatively easy to search and organize
- It has different data types
- Data can be grouped but needs more work
Storage Format: Semi-structured data doesn’t adhere to a strict schema but has some organizational properties. It often includes metadata or tags to provide some structure. Examples include JSON (JavaScript Object Notation), XML (eXtensible Markup Language) etc.
Storage Mechanism: Semi-structured data can be stored in various ways. For example, JSON and XML can be stored in NoSQL databases like MongoDB, which provide flexibility in data storage without requiring a fixed schema. Alternatively, semi-structured data can be stored in columnar databases or document-oriented databases.
Unstructured Data:
Unstructured data is similar to a library with books scattered randomly across the floor, tables, and shelves, with no organization or categorization. You might find newspapers, magazines, journals, and loose papers mixed, each containing valuable information but without any order. While it may seem chaotic at first glance, with the right tools and techniques, such as text recognition and analysis, valuable insights can still be extracted from this jumble of information.
Some characteristics of unstructured data include;
- Does not follow a particular model or any particular format, cannot be stored in rows and columns
- It is difficult to search and organize
Storage Format: Unstructured data lacks a predefined data model or structure. It includes data types like text documents, images, videos, audio files, emails, social media posts, etc.
Storage Mechanism: Unstructured data is typically stored in object storage systems or distributed file systems. Examples include Amazon S3, Hadoop Distributed File System (HDFS), and Azure Blob Storage. These systems store data as binary large objects (BLOBs) without imposing any structure on the data itself. Metadata may be associated with unstructured data to aid in search and retrieval.
In summary, big data comes in three forms: structured, semi-structured, and unstructured. Structured data is neatly organized like a library, semi-structured data is somewhat organized but flexible, and unstructured data is like a messy library without any order. Despite the challenges, valuable insights can still be extracted from all of them with the right tools. Understanding these helps organizations make better use of big data to improve decision-making.
About the Author
You can connect with Ritmwa Bewarang to me on Twitter or LinkedIn