Mastering Data Lake Federation with Open-Source Tools: a case study developed by Beni Gal-Janko, Lead Systems Engineer at EPAM Romania
top of page

Mastering Data Lake Federation with Open-Source Tools: a case study developed by Beni Gal-Janko, Lead Systems Engineer at EPAM Romania

Updated: 3 days ago


In today’s data landscape, organizations face a complex challenge: how to unify access and governance across sprawling, multi-cloud data lakes. Beni Gal-Janko, Lead Software Engineer at EPAM Romania elaborated on data lake federation—a strategy that brings structure to chaos using proven open-source tools.



At the core of this approach is Apiary, a Terraform-powered framework designed to deploy a federated data lake with enterprise-grade scalability. It orchestrates key services like Hive Metastore, Apache Ranger and integrates with tools such as Waggle Dance and Beekeeper. Its modular architecture and support for containerized deployments make it a go-to foundation for handling petabyte-scale datasets.



Beekeeper steps in as your lake’s cleanup crew. This event-driven service monitors changes in the Hive Metastore and automates the deletion of orphaned data files and expired metadata. It comprises multiple components—including a scheduler, path cleanup, and metadata cleanup services—that work together to keep your storage lean, performant, and compliant. Beekeeper is designed for high availability and integrates seamlessly with Apiary, ensuring that cleanup operations scale with your data.



Waggle Dance simplifies metadata access across multiple Hive deployments. Acting as a request-routing proxy, it provides a unified Thrift endpoint that allows clients to interact with multiple metastores as if they were a single system. It dynamically maps virtual databases to their corresponding backends, making it easier to query federated data without needing to manage the complexity underneath. With support for platforms like EMR, Qubole, and Databricks, Waggle Dance improves collaboration, eliminates duplication, and brings consistency to distributed analytics environments.



And for when data needs to move? Circus Train handles safe, selective replication of Hive tables between clusters and clouds. With features like snapshotting, encryption and partition-level control, it’s an essential piece for DR, migration, or hybrid deployments.


Together, these tools transform fragmented infrastructure into a cohesive, scalable and efficient data lake ecosystem—open-source style. They empower teams to move faster, reduce operational overhead, and focus on extracting value from data instead of wrestling with it. Whether you're starting your cloud journey or modernizing an enterprise-scale platform, mastering these open-source technologies can be a game-changer.


Razvan Podovei, Head of Cloud Practice at EPAM Romania emphasizes on Beni’s work:


I am proud to endorse the work of one of our AWS Senior Engineers who has put the effort into explaining exactly how we here, in EPAM Romania, not only contribute to the growth and success of our customers but also provide valuable deliverables that are now open source materials and used across the world in Big Data scenarios. This is above everything else a testimony to the commitment of EPAM towards pushing for world class quality in its project delivery strategies but also making the Dev domain a better place through contributions that make a difference in real-world scenarios.”

 

Since 1993, EPAM Systems, Inc. (NYSE: EPAM) has used its software engineering expertise to become a leading global provider of digital engineering, cloud and AI-enabled transformation services and a leading business and experience consulting partner for global enterprises and ambitious startups. We deliver globally, but engage locally with our expert teams of consultants, architects, designers and engineers, making the future real for our clients, our partners and our people around the world. EPAM extended its competencies in Romania in 2020 and for the past 5 years we have been operating remotely from several cities in Romania, having offices in Bucharest, Iasi and Cluj. Here we collaborate with many of the worlds’ leading global brands, while having unique opportunities for growth and development.


Added to the S&P 500 and the Forbes Global 2000 in 2021 and recognized by Glassdoor and Newsweek as Most Loved Workplace, our multidisciplinary teams serve customers across six continents. We are proud to be among the top 15 companies in Information Technology Services in the Fortune 1000 and to be recognized as a leader in the IDC MarketScapes for Worldwide Experience Build Services, Worldwide Experience Design Services and Worldwide Software Engineering Services.

 
 
 
bottom of page