LinkedIn has announced it is open sourcing its control plane for managing tables in data lakehouse deployments.
The tool, called OpenHouse, has been in use at LinkedIn for the past year. The company has 3,500 OpenHouse tables in production currently.
It was designed to offer self-service management of tables in open data lakehouses. According to LinkedIn, it was running into challenges internally because it didn’t have a good managed experience for running data lakehouses, which meant that end users were often dealing with low-level infrastructure concerns, which took time away from time they should have spent working on their products.
“Overall, since rolling out OpenHouse, we’ve seen drastic reduction in operational toil for data infra teams, improved developer experience for data infra customers, and enhanced governance for LinkedIn’s data,” Sumedh Sakdeo, senior staff software engineer at LinkedIn and creator of OpenHouse, wrote in a blog post.
OpenHouse consists of a declarative catalog and a suite of data services. The catalog includes definitions of tables, their schemas, and associated metadata, and it integrates with Apache Spark. It supports standard syntax such as SHOW DATABASE, SHOW TABLES, CREATE TABLE, ALTER TABLE, SELECT FROM, INSERT INTO, and DROP TABLE. The catalog is also where users can specify retention, replication, and sharing policies for the table.
Another key element of OpenHouse is that it reconciles a table’s observed state and its desired state, and this is where invoking data services comes in. Data services are responsible for orchestrating table maintenance jobs.
According to LinkedIn, the goal was always to open source the project at some point, and therefore it was designed to provide pluggability with storage, authentication, authorization, database, and job submission services.
“Now that we’ve reached the open sourcing milestone, we invite you to explore OpenHouse and provide us with your valuable feedback. We’re keen on collaborating with users to understand how OpenHouse performs within different environments, whether it’s integrated into cloud infrastructures or adapted to preferred table formats,” Sakdeo wrote.