Data Management Systems

This line of work is dedicated the design of data management systems, including traditional relational databases, distributed data stores, in-memory data stores, and software-defined storage systems.

In-memory data stores

The focus of our work is on a key aspect in which in-memory data stores have seen widespread use: data caching. Modern Web-scale applications attract very large numbers of users, which expect services to be responsive at all times: latency plays a crucial role in the perceived Quality of Experience (QoE), which determines to a large extent the popularity and success of competing services. While the problem of caching has received a lot of attention from the computer science and architecture communities, several problems have remained largely unaddressed:

  • Data calcification, which arise in realistic cases when the demand for data evolves in time
  • Non-uniform data access costs, which permeates most of modern Web services
  • Allocation of memory resources, which calls for provably optimal caching policies

We attack such problems both from a theoretical point of view (using tools akin to probabilisti algorithm design and analysis) and from a practical perspective, with applications to widely used software such as Memcached. Our work has also attracted the attention of key players in the content delivery domain, and in particular of Akamai Technologies.

RAW data processing

In recent years, modern large-scale data analysis systems have flourished. The batch-oriented nature of such systems has been complemented by additional components that offer (near) real-time analytics on data streams, and a serving layer to exposed processed data to the end-users. The communion of these approaches is now commonly known as the Lambda or Kappa Architecture. One of the major limitations of current approaches is that they require an expensive transform/load phase to, e.g., move data from a distributed file-system to a RDBMs, which might be impossible to amortize, in particular in scenarios with a narrow processing window, i.e., when working on temporary data.

Although many SQL-on-Hadoop systems emerged recently, they are not well designed for (short- lived) ad-hoc queries, especially when the data remains in its native, uncompressed, format such as text-based CSV files. To achieve high performance, these systems prefer to convert data into their specific column-based data format, e.g., ORC and Parquet. This works perfectly when both data and analytic queries (that is, the full workload) are in their final production stage. Namely, these self-describing, optimized data formats play an increasing role in modern data analytics, and this especially becomes true once data has been cleansed, queries have been well designed, and analytics algorithms have been tuned.

However, when users perform data exploration tasks and algorithm tuning, that is when the data is temporary, the original data format typically remains unchanged – in this case, premature data format optimization is typically avoided, and simple text-based formats such as CSV and JSON files are preferred. In this scenario, current integrated data analytics systems can under-perform. Notably, they often fail to leverage decades old techniques for optimizing the performance of (distributed) RDBMSs, e.g., indexing, which is usually not supported.

We thus propose DiNoDB, an interactive-speed query engine that addresses the above issues. Our approach is based on a seamless integration of batch processing systems (e.g., Hadoop MapReduce and Apache Spark) with a distributed, fault-tolerant and scalable interactive query engine for in-situ analytics on temporary data.

Data cleaning

A pervasive problem related to Big Data analytics is that of “dirty” data. Indeed, we live in a world where decisions are often based on analytics applications that process continuous streams of data. Typically, data streams are combined and summarized to obtain a succinct representation thereof: analytics applications rely on such representations to make predictions, and to create reports, dashboards and visualizations. All these applications expect the data, and their representation, to meet certain quality criteria. Data quality issues interfere with these representations and distort the data, leading to misleading analysis outcomes and potentially bad decisions. As such, a range of data cleaning techniques were proposed recently. However, most of them focus on “batch” data cleaning, by processing static data stored in data warehouses, thus neglecting the important class of streaming data.

In this line of work, we focus on stream data cleaning. The challenge in stream cleaning is that it requires both real-time guarantees as well as high accuracy, requirements that are often at odds. We thus set off to design a distributed stream data cleaning system, that we called Bleach, which achieves efficient and accurate cleaning in real-time, while tackling the complications due to the long-term and dynamic nature of data streams, by which the definition of dirty data could change to follow such dynamics.

Multi-query optimization

Users that interact with “big data” constantly face the problem of extracting insight and obtain value from their data assets. Of course, humans cannot be expected to parse through terabytes of data. In fact, typical user interaction with big data happens through data summaries. A summary is obtained by grouping data along various dimensions (e.g., by location and/or time), and then showing aggregate functions of those data (e.g., count, sum, mean, etc.). Even graphical and interactive visualizations of data very often show aggregated results.

On-Line Analytical Processing (OLAP) tools and techniques exist to facilitate exploration of data, allowing to perform “slicing and dicing” by grouping data along multiple dimensions. Despite the importance of data summarization, the field of data-intensive scalable computing (DISC) systems – where data can reach petabytes and be distributed on clusters of thousands of machines – has not seen much effort toward efficient implementations of OLAP primitives.

In this line of work, we tackle the general problem of optimizing data summarization: how to efficiently compute a set of multiple Group By queries. This problem is known to be NP-complete, and all state-of-the-art algorithms use heuristic approaches to approximate the optimal solution. However, none of prior works scales well with large number of attributes, and/or large number of queries, which is the problem with explicitly attack in our work.

Software-defined storage systems

In this line of work, we study the I/O performance and the fairness properties that private cloud deployments expose to a very specific and demanding class of applications, namely “Big Data” applications. In particular, we have developed a software framework to proceed with the instrumentation, execution of a measurement campaign, and collection of raw log files to gain a better understanding of the overheads associated to virtualization, when compared to bare-metal performance. This work proceeds in two concurrent threads. 

Simple storage scenario: in this case, we consider a single host, and focus on low-level measurements to understand the intricate dependence between performance and system configuration. In particular, our active measurements emulate the workload imposed by a simple distributed file system, from the viewpoint of a single host. Our findings question current beliefs and the current best practices to configure Big Data applications running in virtualized environments. Indeed, our main results show that virtualization overheads are often negligible from the I/O perspective, but it is important to configure appropriately both operating systems and applications.

Software-defined storage scenario: in this case, we consider the flexibility offered by the software-defined storage paradigm, and perform application-level performance measurements to understand the implications of a variety of storage layers, including: traditional distributed file systems, object storage and volume-based storage systems, such as the one commonly offered by public cloud vendors. Our findings indicate that data locality seems not to play a major role (contrary to the common belief) in determining the performance of individual Big Data applications. Instead, when a multi-tenant scenario is considered, concurrency hinders the task of allocating I/O resources; our analysis also indicate the existence of several impedance mismatch due to duplicate mechanisms (e.g. data replication) that interfere, causing performance degradation.

Syndicate

Syndicate content

Data Science