Parthenope University of Naples
Science and Technology Department
DEGLI STUDI
UNIVERSITA'
DI
NAPOLI
PARTHENOPE
Applied Computer Science (Machine Learning and Big
Data)
Data Science - 2nd Half of the Course
Professor:
Antonio Maratea
Written by:
De Angelis Davide
Di Vicino Attilio
Fiorillo Giuseppe
Stefanelli Andrea
Academic year 2023/2024
Contents
- Distributed Systems
- Transactions
- ACID transactions
- BASE transactions
- CAP Theorem
- Data Temperature
- Recoverability
- Distributed Systems
- Cost of Data Management
- DynamoDB
- Consistent Hashing
- Data Warehouse
- NoSQL
- In-Memory Databases (IMDBs)
- Streaming Data
- Columnar Databases
- Wide Column Store
- Graph Databases
- Diagrams
- Key Aspects of Data Management
- Unified Modeling Language
- C.A.S.E Tools
Distributed Systems Overview
The DBMS works and recovers data on distributed systems. So, it is par-
ticularly important to understand the properties of a distributed system,
the replication of the file across a distributed system and the definition of
scale-out and scale-up of distributed system.
Properties of Distributed Systems
The properties that describe a distributed system are:
- Consistency: Ensuring all users see the same data at the same time,
maintaining uniformity across the system.
- Availability: Ensuring the system is accessible and operational when-
ever it's needed, minimizing downtime.
- Partition Tolerance: The ability to continue functioning despite
network failures or splits, allowing for seamless communication.
- Reliability: Consistently delivering correct results, even in the face
of failures or errors.
- Latency: The time it takes for data to travel from one point to an-
other within the system, affecting responsiveness.
- Throughput: The rate at which the system can process and handle
incoming requests or transactions, indicating its capacity.
- Resilience: The ability to recover and adapt to failures, maintaining
functionality and performance.
- Access Control: Managing permissions and restrictions to ensure
that only authorized users or components can access resources or per-
form actions.
Replication in Distributed Systems
Replication, in the context of distributed systems, involves creating and
maintaining multiple copies of data across different nodes or servers. This
redundancy serves several purposes, including improving data availability,
fault tolerance, and performance. It is possible to have:
- Full Replication: In full replication, every piece of data is duplicated
across all nodes in the system. This means that each node contains
a complete copy of the entire dataset. Full replication ensures high
availability and fault tolerance since any node can serve requests inde-
pendently, even if other nodes fail. However, it comes with the cost of
increased storage requirements and synchronization overhead, as every
update must be propagated to all replicas to maintain consistency.
- Zero Replication: Zero replication, also known as single-copy con-
sistency, involves having only one copy of each data item in the entire
distributed system. This approach simplifies data management by
eliminating the need for synchronization between replicas. However,
it also introduces a single point of failure since if the node containing
the data fails, it becomes unavailable until it is restored. Zero replica-
tion is suitable for scenarios where data consistency is critical, and the
cost of replication overhead is deemed too high. However, it sacrifices
fault tolerance and availability compared to replication strategies.
Scaling Distributed Systems
The last key point of distributed system is the scale-up and scale out:
- Scale-up (vertical scaling): Scale-up refers to increasing the capac-
ity of a single machine or node in a system. This is typically achieved
by adding more resources to the existing machine, such as upgrading
the CPU, adding more memory (RAM), or increasing storage capac-
ity. Vertical scaling allows a system to handle increased workloads by
making the existing hardware more powerful. However, there is a limit
to how much a single machine can be scaled up, and it can become
prohibitively expensive beyond a certain point.
- Scale-out (horizontal scaling): Scale-out involves adding more ma-
chines or nodes to a distributed system to increase its capacity. Instead
of making individual machines more powerful, scale-out distributes
the workload across multiple machines, each handling a portion of the
overall workload. Horizontal scaling offers better scalability because it
allows the system to grow by adding relatively inexpensive commodity
hardware. It also provides improved fault tolerance since the failure of
one machine does not necessarily disrupt the entire system. However,
horizontal scaling may require changes to the architecture of the sys-
tem to distribute and manage the workload effectively across multiple
nodes.
Transactions in Distributed Systems
In distributed systems, we can choose between two kinds of transactions:
ACID transactions (used in relational databases) and BASE transactions
(used in distributed systems and NoSQL databases).
ACID Transactions
Starting with ACID transactions, ACID stands for Atomicity, Consis-
tency, Isolation, and Durability, which are the properties that ensure reliable
and predictable transaction processing in a database system:
- Atomicity: ensures that a transaction is treated as a single unit of
work, meaning that either all its operations are successfully completed
and committed, or none of them are.
- Consistency: ensures that the database remains in a valid state be-
fore and after the execution of a transaction. In other words, trans-
actions should only transition the database from one valid state to
another.
- Isolation: ensures that the execution of multiple transactions concur-
rently does not interfere with each other, maintaining data integrity
and preventing concurrency-related issues such as dirty reads, non-
repeatable reads, and phantom reads.
- Durability: guarantees that once a transaction is committed, its
changes are permanently saved and will not be lost, even in the event
of a system failure or crash.
Partially Committed
State
Committed State
Read/Write
operations
Permanent
Store
Failure
Active State
Terminated State
Failure
Roll
Back
Failed State
Abort
Aborted State
Figure 1: ACID transaction scheme
BASE Transactions
On the other hand, in BASE transactions, BASE stands for Basically
Available, Soft state, and Eventually consistent, which are principles of-
ten used in distributed systems to achieve high availability and scalability,
sacrificing some of the strict ACID properties:
- Basically Available: the system should always be operational and
accessible, providing a best-effort response even in the face of failures
or network partitions.
- Soft state: it implies that the system's state may be transient and can
change over time, even without input. This allows for more flexibility
and scalability, as data consistency is relaxed.
- Eventually consistent: the system will eventually converge to a
consistent state after all updates are propagated and reconciled across
all nodes in the distributed system. This consistency model relaxes the
immediate consistency requirements of ACID transactions in favor of
improved availability and scalability.
CAP Theorem
The CAP theorem, also known as Brewer's theorem, is a fundamental prin-
ciple in the design of distributed systems. It states that it is impossible for
a distributed data store to simultaneously provide all three of the following
guarantees:
- Consistency: Every reader receives the most recent writing or an
error. In other words, all nodes in the system see the same data at the
same time.
- Availability: Every request (read or write) receives a response (suc-
cess or failure), regardless of the current state of any individual node.
The system remains operational and accessible even in the presence of
failures.
- Partition Tolerance: The system continues to operate despite ar-
bitrary message loss or failure of part of the system. This means the
system can handle network partitions where communication between
some nodes is lost.
CAP Theorem Trade-offs
According to the CAP theorem, a distributed system can only provide
two of these three guarantees simultaneously, but not all three. Here is a
breakdown of the trade-offs:
- Consistency and Availability (CA): These systems ensure that
every reader receives the most recent write and the system remains op-
erational, but they cannot guarantee partition tolerance. This means
that in the event of a network partition, the system may become un-
available or inconsistent. Traditional relational databases often fall
into this category.
- Consistency and Partition Tolerance (CP): These systems en-
sure consistency and can handle network partitions but may sacrifice
availability. During a partition, the system may become unavailable
until the partition is resolved to maintain consistency. Examples in-
clude some distributed databases like HBase and MongoDB in certain
configurations.
- Availability and Partition Tolerance (AP): These systems ensure
that the system remains operational and can handle partitions but
may not always return the most recent write (eventual consistency).
The system continues to function and respond to requests, even if it
sacrifices immediate consistency. Examples include DynamoDB and
Cassandra.
Data Temperature in Distributed Systems
Data temperature in the context of storage management within a dis-
tributed system refers to how frequently and how recently data is accessed.
They are categorized typically into hot, warm, and cold data:
- Hot Data: frequently accessed and often consists of real-time or near-
real-time data. It is typically stored on the fastest storage media
available, such as solid-state drives (SSDs), to ensure quick read and
write operations.
- Warm Data: accessed less frequently than hot data but still needs
to be readily accessible. It might be stored on slightly slower but still
relatively quick storage media, such as high-speed hard disk drives
(HDDs) or lower-tier SSDs.
- Cold Data: infrequently accessed data, often archived or historical.
It can be stored on slower, more cost-effective storage media, such as
lower-speed HDDs.
HOT STORAGE
IOT OPTIMIZED
WARM STORAGE
SCALE OPTIMIZED
COLD STORAGE
COST OPTIMIZED
$ S $
$$
$
.-
------ >
DEVICES
AGGREGATIONS DASHBOARDS
CUSTOM UIs
OFFLINE
ANALYTICS
AI/ML
TRAINING
DATA
LAKE
BACKUPS/
ARCHIVING
COMPLIANCE
Figure 2: Categories of data
Managing Data Temperature
Managing data temperature effectively in a distributed system involves:
- Data Lifecycle Management: Automating the process of moving
data between several types of storage as its "temperature" changes.
- Tiered Storage Architecture: Setting up a hierarchical storage
system where data can be dynamically allocated to different storage
types based on access patterns.