Data Science: Distributed Systems, Data Management, and Databases

Slides from Parthenope University of Naples about Data Science — 2nd Half of the Course. The Pdf covers key concepts like distributed systems, data management, CAP Theorem, and various database types, useful for university students in Computer Science and big data.

See more

37 Pages

Parthenope University of Naples
Science and Technology Department
Applied Computer Science (Machine Learning and Big
Data)
Data Science 2nd Half of the Course
Professor:
Antonio Maratea
Written by:
De Angelis Davide
Di Vicino Attilio
Fiorillo Giuseppe
Stefanelli Andrea
Academic year 2023/2024
Contents
1 Distributed Systems 2
2 Transactions 4
2.1 ACID transactions . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 BASE transactions . . . . . . . . . . . . . . . . . . . . . . . . 5
3 CAP Theorem 6
4 Data Temperature 8
5 Recoverability 10
6 Distributed Systems 12
7 Cost of Data Management 14
8 DynamoDB 17
9 Consistent Hashing 19
10 Data Warehouse 20
11 NoSQL 24
12 In-Memory Databases (IMDBs) 25
13 Streaming Data 27
14 Columnar Databases 30
15 Wide Column Store 31
16 Graph Databases 33
17 Diagrams 34
17.1 Key Aspects of Data Management . . . . . . . . . . . . . . . 34
17.2 Unified Modeling Language . . . . . . . . . . . . . . . . . . . 34
17.3 C.A.S.E Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1

Unlock the full PDF for free

Sign up to get full access to the document and start transforming it with AI.

Preview

Parthenope University of Naples

Science and Technology Department

DEGLI STUDI UNIVERSITA' DI NAPOLI PARTHENOPE Applied Computer Science (Machine Learning and Big Data) Data Science - 2nd Half of the Course Professor: Antonio Maratea Written by: De Angelis Davide Di Vicino Attilio Fiorillo Giuseppe Stefanelli Andrea Academic year 2023/2024

Contents

  1. Distributed Systems
  2. Transactions
  3. ACID transactions
  4. BASE transactions
  5. CAP Theorem
  6. Data Temperature
  7. Recoverability
  8. Distributed Systems
  9. Cost of Data Management
  10. DynamoDB
  11. Consistent Hashing
  12. Data Warehouse
  13. NoSQL
  14. In-Memory Databases (IMDBs)
  15. Streaming Data
  16. Columnar Databases
  17. Wide Column Store
  18. Graph Databases
  19. Diagrams
  20. Key Aspects of Data Management
  21. Unified Modeling Language
  22. C.A.S.E Tools

Distributed Systems Overview

The DBMS works and recovers data on distributed systems. So, it is par- ticularly important to understand the properties of a distributed system, the replication of the file across a distributed system and the definition of scale-out and scale-up of distributed system.

Properties of Distributed Systems

The properties that describe a distributed system are:

  • Consistency: Ensuring all users see the same data at the same time, maintaining uniformity across the system.
  • Availability: Ensuring the system is accessible and operational when- ever it's needed, minimizing downtime.
  • Partition Tolerance: The ability to continue functioning despite network failures or splits, allowing for seamless communication.
  • Reliability: Consistently delivering correct results, even in the face of failures or errors.
  • Latency: The time it takes for data to travel from one point to an- other within the system, affecting responsiveness.
  • Throughput: The rate at which the system can process and handle incoming requests or transactions, indicating its capacity.
  • Resilience: The ability to recover and adapt to failures, maintaining functionality and performance.
  • Access Control: Managing permissions and restrictions to ensure that only authorized users or components can access resources or per- form actions.

Replication in Distributed Systems

Replication, in the context of distributed systems, involves creating and maintaining multiple copies of data across different nodes or servers. This redundancy serves several purposes, including improving data availability, fault tolerance, and performance. It is possible to have:

  • Full Replication: In full replication, every piece of data is duplicated across all nodes in the system. This means that each node contains a complete copy of the entire dataset. Full replication ensures high availability and fault tolerance since any node can serve requests inde- pendently, even if other nodes fail. However, it comes with the cost of increased storage requirements and synchronization overhead, as every update must be propagated to all replicas to maintain consistency.
  • Zero Replication: Zero replication, also known as single-copy con- sistency, involves having only one copy of each data item in the entire distributed system. This approach simplifies data management by eliminating the need for synchronization between replicas. However, it also introduces a single point of failure since if the node containing the data fails, it becomes unavailable until it is restored. Zero replica- tion is suitable for scenarios where data consistency is critical, and the cost of replication overhead is deemed too high. However, it sacrifices fault tolerance and availability compared to replication strategies.

Scaling Distributed Systems

The last key point of distributed system is the scale-up and scale out:

  • Scale-up (vertical scaling): Scale-up refers to increasing the capac- ity of a single machine or node in a system. This is typically achieved by adding more resources to the existing machine, such as upgrading the CPU, adding more memory (RAM), or increasing storage capac- ity. Vertical scaling allows a system to handle increased workloads by making the existing hardware more powerful. However, there is a limit to how much a single machine can be scaled up, and it can become prohibitively expensive beyond a certain point.
  • Scale-out (horizontal scaling): Scale-out involves adding more ma- chines or nodes to a distributed system to increase its capacity. Instead of making individual machines more powerful, scale-out distributes the workload across multiple machines, each handling a portion of the overall workload. Horizontal scaling offers better scalability because it allows the system to grow by adding relatively inexpensive commodity hardware. It also provides improved fault tolerance since the failure of one machine does not necessarily disrupt the entire system. However, horizontal scaling may require changes to the architecture of the sys- tem to distribute and manage the workload effectively across multiple nodes.

Transactions in Distributed Systems

In distributed systems, we can choose between two kinds of transactions: ACID transactions (used in relational databases) and BASE transactions (used in distributed systems and NoSQL databases).

ACID Transactions

Starting with ACID transactions, ACID stands for Atomicity, Consis- tency, Isolation, and Durability, which are the properties that ensure reliable and predictable transaction processing in a database system:

  • Atomicity: ensures that a transaction is treated as a single unit of work, meaning that either all its operations are successfully completed and committed, or none of them are.
  • Consistency: ensures that the database remains in a valid state be- fore and after the execution of a transaction. In other words, trans- actions should only transition the database from one valid state to another.
  • Isolation: ensures that the execution of multiple transactions concur- rently does not interfere with each other, maintaining data integrity and preventing concurrency-related issues such as dirty reads, non- repeatable reads, and phantom reads.
  • Durability: guarantees that once a transaction is committed, its changes are permanently saved and will not be lost, even in the event of a system failure or crash.

Partially Committed State Committed State Read/Write operations Permanent Store Failure Active State Terminated State Failure Roll Back Failed State Abort Aborted State

Figure 1: ACID transaction scheme

BASE Transactions

On the other hand, in BASE transactions, BASE stands for Basically Available, Soft state, and Eventually consistent, which are principles of- ten used in distributed systems to achieve high availability and scalability, sacrificing some of the strict ACID properties:

  • Basically Available: the system should always be operational and accessible, providing a best-effort response even in the face of failures or network partitions.
  • Soft state: it implies that the system's state may be transient and can change over time, even without input. This allows for more flexibility and scalability, as data consistency is relaxed.
  • Eventually consistent: the system will eventually converge to a consistent state after all updates are propagated and reconciled across all nodes in the distributed system. This consistency model relaxes the immediate consistency requirements of ACID transactions in favor of improved availability and scalability.

CAP Theorem

The CAP theorem, also known as Brewer's theorem, is a fundamental prin- ciple in the design of distributed systems. It states that it is impossible for a distributed data store to simultaneously provide all three of the following guarantees:

  • Consistency: Every reader receives the most recent writing or an error. In other words, all nodes in the system see the same data at the same time.
  • Availability: Every request (read or write) receives a response (suc- cess or failure), regardless of the current state of any individual node. The system remains operational and accessible even in the presence of failures.
  • Partition Tolerance: The system continues to operate despite ar- bitrary message loss or failure of part of the system. This means the system can handle network partitions where communication between some nodes is lost.

CAP Theorem Trade-offs

According to the CAP theorem, a distributed system can only provide two of these three guarantees simultaneously, but not all three. Here is a breakdown of the trade-offs:

  • Consistency and Availability (CA): These systems ensure that every reader receives the most recent write and the system remains op- erational, but they cannot guarantee partition tolerance. This means that in the event of a network partition, the system may become un- available or inconsistent. Traditional relational databases often fall into this category.
  • Consistency and Partition Tolerance (CP): These systems en- sure consistency and can handle network partitions but may sacrifice availability. During a partition, the system may become unavailable until the partition is resolved to maintain consistency. Examples in- clude some distributed databases like HBase and MongoDB in certain configurations.
  • Availability and Partition Tolerance (AP): These systems ensure that the system remains operational and can handle partitions but may not always return the most recent write (eventual consistency). The system continues to function and respond to requests, even if it sacrifices immediate consistency. Examples include DynamoDB and Cassandra.

Data Temperature in Distributed Systems

Data temperature in the context of storage management within a dis- tributed system refers to how frequently and how recently data is accessed. They are categorized typically into hot, warm, and cold data:

  • Hot Data: frequently accessed and often consists of real-time or near- real-time data. It is typically stored on the fastest storage media available, such as solid-state drives (SSDs), to ensure quick read and write operations.
  • Warm Data: accessed less frequently than hot data but still needs to be readily accessible. It might be stored on slightly slower but still relatively quick storage media, such as high-speed hard disk drives (HDDs) or lower-tier SSDs.
  • Cold Data: infrequently accessed data, often archived or historical. It can be stored on slower, more cost-effective storage media, such as lower-speed HDDs.

HOT STORAGE IOT OPTIMIZED WARM STORAGE SCALE OPTIMIZED COLD STORAGE COST OPTIMIZED $ S $ $$ $ .- ------ > DEVICES AGGREGATIONS DASHBOARDS CUSTOM UIs OFFLINE ANALYTICS AI/ML TRAINING DATA LAKE BACKUPS/ ARCHIVING COMPLIANCE

Figure 2: Categories of data

Managing Data Temperature

Managing data temperature effectively in a distributed system involves:

  • Data Lifecycle Management: Automating the process of moving data between several types of storage as its "temperature" changes.
  • Tiered Storage Architecture: Setting up a hierarchical storage system where data can be dynamically allocated to different storage types based on access patterns.

Can’t find what you’re looking for?

Explore more topics in the Algor library or create your own materials with AI.