Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.
If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let’s face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.
Let’s take the case of HDFS, a distributed file system that is part of the Hadoop infrastructure, designed to handle large data sets. In HDFS, data is distributed over several machines and replicated to optimize parallel processing. As the data and metadata are stored separately every file created irrespective of size occupies a minimum default block size in memory. Small files are files size less than 1 HDFS block, typically 128MB. Small files, even as small as 1kb, cause excessive load on the name node (which is involved in translating file system operations into block operations on the data node) and consume as much metadata storage space as a file of 128 MB. Smaller file sizes also mean smaller clusters as there are practical limits on the number of files (irrespective of size) that can be managed by a name mode.
As the admin for an HDFS system you may already understand why and how small files are created. It could be every clickstream event dumped in the data lake, large volume of individual images (which can’t be logically merged into a larger file), practice of storing every configuration file, or simply placeholder files (basically empty files) used to track output status of every job and are never deleted. There are a number of tasks that Hadoop admins perform to (1) identify the number of small files, (2) identify who is creating the small files, and (3) perform general cleanup of the small files, including compaction and deletion.
There is no easy tooling on top of HDFS to see how many files are present, what the size of each file or directory is, or how and where the users are creating files. Now think of this in terms of a system that spans multiple clusters and regions, and petabytes of data taking up storage and slowing down performance. Not only are you left with wasted storage but also any jobs that you run like Hive and MapReduce are also slowed down. Now the requirements have changed and with the data lake concept users want to bring it down so that they can even serve a mobile application from their data lake platform (i.e., sub second response times). The time boundaries are falling exponentially for which people want to use this infrastructure. So, your optimal block size may even be 64 MB, all depending on the use case. An extreme use case is where you are dealing with file sizes of 1 KB and this is a common use case when you are dealing with IoT data or sensor data where you might be getting a file every 200 milliseconds and you want to create a file for every minute. These will still be very small files and won’t cross 10 KB. So, while small files can’t be avoided they can be managed so that you keep them at lower % than the bigger files. It’s a continuous process and requires a maintenance cycle.
An ideal solution would be a system that is software agnostic and gives a cross-sectional view across a firm’s entire Big Data infrastructure. The problem companies currently face with existing tools is three-fold:
A company will have multiple single-threaded application management tools that do not collectively address the problems that scaling a Big Data system can bring. What would be ideal is a software-agnostic tool monitoring across Spark, Kafka, Hive, Hadoop, Presto, etc. offering a cross-sectional multi-dimensional real-time view with the following characteristics:
To answer the question we started with, yes, small files can crush big data but there are steps that you can take to manage them, including timely identification, compaction, compression and deletion. Some are manual, time-consuming and use makeshift software, while others require you to make an investment in data observability and your future.
We recently worked with a customer that had over 40 PBs of data with Spark and MapReduce workloads. This volume was causing reading through the catalog to take many seconds which is 100 times more than it should ideally take. They also had a maintenance cycle which was resource intensive which involved going through each folder, figuring out what files are there and what are the locations where they may need to compress the files. Just figuring out which were the files to compress was taking a lot of time. Data observability reduced the time taken for identifying small files by making it dead simple. So, from a maintenance cycle of 12 hours we have reduced it to under 15 minutes.
Maintenance comes with a cost, which when done every seven days is very high. It also is carried out in cases every day. With Pulse even once you factor in licensing and compute it comes down to a very small fraction of this for this particular feature.
Another client wanted to send regular reports about their maintenance cycles. Earlier log files were collected, collated and sent back to the modeling team for analysis. With a data observability application you can get all of this context at a single place – with multi-dimensional visibility. So, an Ops resource could see a problem as it comes in and not hours, certainly not in months. Managing small files in your data lake offers significant benefits from reduced cost to faster problem resolution and this cascades down to the rest of your business. To realize these benefits or if you would like to know more about optimizing your data lakes and explore how data observability can help.
Department of Information Technologies: https://www.ibu.edu.ba/department-of-information-technologies/