top of page

Mastering Data Lake Federation with Open-Source Tools: a case study developed by Beni Gal-Janko, Lead Systems Engineer at EPAM Romania

Updated: Jul 1


In today’s data landscape, organizations face a complex challenge: how to unify access and governance across sprawling, multi-cloud data lakes. Beni Gal-Janko, Lead Software Engineer at EPAM Romania elaborated on data lake federation—a strategy that brings structure to chaos using proven open-source tools.



At the core of this approach is Apiary, a Terraform-powered framework designed to deploy a federated data lake with enterprise-grade scalability. It orchestrates key services like Hive Metastore, Apache Ranger and integrates with tools such as Waggle Dance and Beekeeper. Its modular architecture and support for containerized deployments make it a go-to foundation for handling petabyte-scale datasets.



Beekeeper steps in as your lake’s cleanup crew. This event-driven service monitors changes in the Hive Metastore and automates the deletion of orphaned data files and expired metadata. It comprises multiple components—including a scheduler, path cleanup, and metadata cleanup services—that work together to keep your storage lean, performant, and compliant. Beekeeper is designed for high availability and integrates seamlessly with Apiary, ensuring that cleanup operations scale with your data.



Waggle Dance simplifies metadata access across multiple Hive deployments. Acting as a request-routing proxy, it provides a unified Thrift endpoint that allows clients to interact with multiple metastores as if they were a single system. It dynamically maps virtual databases to their corresponding backends, making it easier to query federated data without needing to manage the complexity underneath. With support for platforms like EMR, Qubole, and Databricks, Waggle Dance improves collaboration, eliminates duplication, and brings consistency to distributed analytics environments.



And for when data needs to move? Circus Train handles safe, selective replication of Hive tables between clusters and clouds. With features like snapshotting, encryption and partition-level control, it’s an essential piece for DR, migration, or hybrid deployments.


Together, these tools transform fragmented infrastructure into a cohesive, scalable and efficient data lake ecosystem—open-source style. They empower teams to move faster, reduce operational overhead, and focus on extracting value from data instead of wrestling with it. Whether you're starting your cloud journey or modernizing an enterprise-scale platform, mastering these open-source technologies can be a game-changer.


Razvan Podovei, Head of Cloud Practice at EPAM Romania emphasizes on Beni’s work:


I am proud to endorse the work of one of our AWS Senior Engineers who has put the effort into explaining exactly how we here, in EPAM Romania, not only contribute to the growth and success of our customers but also provide valuable deliverables that are now open source materials and used across the world in Big Data scenarios. This is above everything else a testimony to the commitment of EPAM towards pushing for world class quality in its project delivery strategies but also making the Dev domain a better place through contributions that make a difference in real-world scenarios.”

 

Since 1993, EPAM Systems, Inc. (NYSE: EPAM) has used its software engineering expertise to become a leading global provider of digital engineering, cloud and AI-enabled transformation services and a leading business and experience consulting partner for global enterprises and ambitious startups. We deliver globally, but engage locally with our expert teams of consultants, architects, designers and engineers, making the future real for our clients, our partners and our people around the world. EPAM extended its competencies in Romania in 2020 and for the past 5 years we have been operating remotely from several cities in Romania, having offices in Bucharest, Iasi and Cluj. Here we collaborate with many of the worlds’ leading global brands, while having unique opportunities for growth and development.


Added to the S&P 500 and the Forbes Global 2000 in 2021 and recognized by Glassdoor and Newsweek as Most Loved Workplace, our multidisciplinary teams serve customers across six continents. We are proud to be among the top 15 companies in Information Technology Services in the Fortune 1000 and to be recognized as a leader in the IDC MarketScapes for Worldwide Experience Build Services, Worldwide Experience Design Services and Worldwide Software Engineering Services.

 
 
 

6 Comments


Data lake federation can be quite complex, so learning how to approach it with open-source tools is extremely valuable for many organizations. I’d be interested to know which specific tools Beni Gal-Janko’s team used, and how they tackled data governance and security challenges. By the way, has anyone tried using Cookie Clicker as a fun way to teach concepts of data flow or accumulation in such tech workshops?

Like

sheetal
sheetal
Jul 08

Craving real adult fun? Our Escort Service Delhi brings you exclusive sexy girls for unlimited pleasure. Russian escorts girls, seductive housewives, and naughty college girls are waiting for your call. Tonight book and enjoy luxurious moments with top-notch beauties.

Like

This sounds like a cool solution for data management! It's amazing how open-source tools like Apiary and Beekeeper are tackling the complexities of data lake federation. Imagine navigating a data lake as easily as shredding fresh powder in Snow Rider – smooth and controlled. Definitely need to explore how this Terraform-powered framework can help manage petabyte-scale datasets more efficiently.


Like

Enjoy a first class, unlimited experience at affordable prices while your privacy and satisfaction are our number one priority. So stop waiting, book Escorts in Lajpat Nagar tonight and you will understand why Lajpat Nagar Escorts are preferred choices among clients looking for adult entertainment.

Like

Andriy
Andriy
Jun 17

Eram la un festival, dar a început ploaia și ne-am retras într-un cort. Am deschis telefonul și am intrat pe https://casinohub.ro/casinos/bani-reali. Am citit acolo despre cazinouri unde poți să câștigi bani adevărați și despre oferte fără depunere. Mi-a plăcut că site-ul nu te învârte în jurul cozii, îți dă totul clar. Pentru români care vor să încerce ceva serios, e chiar util.

Like
bottom of page