13 Problem Areas in Data Infrastructure to build NewCos

2021-03-07

8 minute read

open-source , infrastructure , ideas

A year ago, we decided to start open-sourcing our thinking around problems areas of interest, starting with 15 Problem Spaces in Developer Tools & Infrastructure We’re Excited About at Venrock. With the hope that it would spark conversations and bring teams together to go after exciting opportunities. With so much that has changed in the world, we decided to follow up almost a year later, with our next iteration.

13 problem areas

The last 12 months have been a technology tipping point for businesses in the wake of remote work, the increased need to leverage data for faster decision making, increased pressures of moving workloads to the cloud, and the realizations of technology investments as a competitive advantage in a digital world. The digital transformation we’ve seen play out over the last few years just compressed the next five years of progress into one.

Companies such as Astronomer, DBT, Decodable, Imply, Materialize, Superconductive, and more — built on core open source projects — have seen meteoric rises as a result of increased focus on data engineering and delivering business value through unlocking data velocity.

Private valuations soared for companies such as Databricks, Confluent, and Datarobot; ushering massive infrastructure transitions from the likes of legacy incumbents Cloudera, Talend, Oracle, and Informatica as they modernize their enterprise capabilities.

Public companies such as Snowflake, Mongo, Cloudflare* and Twilio are seeing historic 20x — 60x EV/revenue multiples as ‘digital transformation’ shifts into second gear. With a concerted focus on modernizing the infrastructure and data planes in order to unlock data as a competitive edge and reduce operational overhead in order to move faster. We’ve previously written about this as the evolution to an ‘everything as code’ model and the era of the programmatic infrastructure.

While a lot has changed in the last 12 months, much also remains the same, with continued opportunities to evolve how organizations build services, deploy infrastructure, distribute resources, increase data velocity, secure applications, and begin to leverage machine learning for workload-specific optimizations.

If the 2010s represented a renaissance for what we can build and deliver, the 2020s have begun to clearly represent a shift to how we build and deliver, with a focused intensity on infrastructure, data, and operational productivity.

As we look forward over the next 12–24 months, here are (13) more problem areas we’ve been excited about. If any of them resonate with you, or if you have comments/thoughts, please reach out

1. Persistence layer replication is still an unsolved problem in true multi-cloud deployments

The evolution of the multi-cloud is allowing applications to become more cloud-agnostic. You can deploy now wherever there is capacity or specialized services available. While you can elastically scale up your application servers, there is no way to auto-scale your persistence layer. As soon as you talk about disk storage, cross DC communication latency becomes untenable. The bigger your persistence layer footprint is, the more sharded your data becomes, the more replication becomes an architecturally limiting problem.

2. The AWS console is the worst

Not data infra specific, but a challenge that plagues the entire infrastructure workflow.

3. Streaming data continues to require significant resources to ingest, build, and manage new pipelines.

Streaming data requires a very different approach than batch; higher velocity of data, near real-time processing, unpredictable data loads; and limited out-of-the-box infrastructure exists. Companies such as Decodable* are making this easier, but it’s still early days.

4.Kubernetes management still relies on YAML.

For all the advancements containers and orchestration enable for the modern application stack, we are still managing and configuring them like we’re in 2008. Imagine a world without YAML? Companies such as Porter* are making this easier, but YAML continues to be the leading cause of insomnia for SREs and DevOps.

5. Snowflake is becoming the new Oracle (the good and the bad).

The move from hardware to a fully managed cloud data warehouse was a huge leap in capability, flexibility, and cost (we thought so at least), and has reinvigorated a ‘SQL renaissance’. At the same time, it has left a lot to be desired in an attempt to be the ‘catch-all’ data warehouse. Concurrency is often limited to 8–15 queries before needing to spin up a new node. No indexes exist so you must rely on system compression strategies and metadata options. Some queries can be painfully slow if you rely on any type of joins or scans. Setup requires explicit knowledge of what the data is, how it will be used, and when to separate ingestion from reporting, when to split deployments, etc. Migrating from a traditional database is riddled with data quality issues due to the lack of indexes, unique or primary keys, etc.

6. The Data Lakehouse idea of bringing the best of both Data Warehouses and Data Lakes together (engineers are terrific at naming).

Has been a meaningful step forward but continues to be too complex to deploy and manage. Data ingestion vastly differs between streaming and batch data. Compute and storage need to be decoupled. Storage layers need to be purpose-designed based on the data (document, graph, relational, blob, etc)… Active cataloging is required to keep track of sources, data lineage, metadata, etc. Do we need/want more one-size-fits-all solutions vs vertical/specialized?

7. Machine learning in infrastructure management and operations is still incredibly nascent.

While we significantly overestimated the likely number of machine learning models in production powering business-critical use cases in user applications, applying it to both stateful and stateless infrastructure would be a no-brainer. Available structured log data for training, known downstream and upstream dependencies, available time-series event data, bounded capacity constraints, makes for a perfect use case for supervised and unsupervised learning to build management planes that take the reactive blocking & tackling out of infrastructure management.

8. The underlying theory behind database indexes hasn’t changed in decades.

We continue to rely on one-size-fits-all, general-purpose indexes such as B-Trees, Hash-maps, and others that take blackbox views of our data, assuming nothing about the data or common patterns prevalent in our data sets. Designing indexes to be data-aware using neural nets, cumulative distribution functions (CDF) and other techniques could lead to significant performance benefits leading to smaller indexes, faster lookups, increased parallelism, and reduced CPU usage. Whether multi-dimensional or high volume transaction data systems, memory-based or in-disk, data-aware systems are already demonstrating step function benefits over the current state-of-the-art systems.

9. There is little-to-no machine learning used to improve database performance.

From improved cardinality estimation, query optimization, cost modeling, workload forecasting, to data partitioning, leveraging machine learning can have a substantial impact on query cost and resource utilization, especially in multi-tenant environments where disk, RAM, and CPU time are scarce resources. Gone can be the days of nested loop joins and merge joins + index scans!

10. Data quality and lineage is still mostly unsolved, despite many attempts at pipeline testing and like solutions.

Unlike software development, there is limited ‘develop locally, test in staging, push to production’. How do business users and analysts know when to feel confident in a certain dataset or dashboard? Can we apply tiers or ratings to certain data sources or pipelines in order as a way to determine confidence in uptime/lineage/quality/freshness? And how can engineering or ops track and remediate issues once models/workloads are in production?

11. Our modern data stacks have been overbuilt for analytics and underbuilt for operations.

While the current analytics-centric approach provides for a strong foundation, the shift to powering operational, more complex use cases, is still in its infancy. Enterprise executives are beginning to ask how these data infrastructure investments can begin to help speed up our supply chain fulfillment, connect demand forecasting to capacity planning, improve preventative maintenance, respond to user engagement/problems faster, clickstream data, and more. Where are the modern equivalents of Snowflake, DBT, Fivetran, etc for operational business needs?

12. Where does application serving fit into the modern data stack?

Where high concurrency & low latency is required (the opposite of a data warehouse). While for read-only workloads there are solutions, it usually means copying over data to Redis, Memcached, etc. Check out Netflix’s bulldozer for an idea of how this can be done with production scale, a self-serve data platform that moves data efficiently from data warehouse tables to key-value stores in batches, making the data warehouse tables more accessible to different microservices and applications as needed. The enterprise ‘bulldozer’ could be a massive hit.

13. “Excel is production” is the unfortunate standard for many critical business workloads.

The reason often is data engineering as a bottleneck in moving business-critical workloads that should be production services. The challenge is multifold. Data ingestion and processing are often managed through a series of highly sequenced and brittle scripts. Excel or Google Sheets is used as the data warehouse. Complex, 500 line queries are driving business processes. The migration to a production-quality service is untenable without it being completely rewritten by data engineering. How can we build services to enable data analysts and business users to create production-grade workloads from the start?