Building a Streaming Data Lake with RSocket

Modern data architectures demand more than traditional request-response patterns. When building data lakes that ingest millions of events per second, you need a protocol designed from the ground up for reactive, bidirectional streaming. Enter RSocket - the application protocol that transforms how we think about data movement.

The Challenge: Real-Time Data Lake Ingestion

Traditional data lake architectures often rely on batch-oriented approaches or polling mechanisms that introduce latency and inefficiency. Consider these common pain points:

Backpressure blindness: HTTP-based ingestion cannot signal upstream systems to slow down
Connection overhead: Establishing new connections for each batch adds latency
Unidirectional flow: Request-response patterns limit streaming capabilities
Resource exhaustion: Without flow control, fast producers overwhelm slow consumers

The streaming-datalake-using-rsocket project addresses these challenges head-on by leveraging RSocket's native reactive semantics.

Why RSocket for Data Lake Streaming?

RSocket is a binary protocol that provides four interaction models, each suited for different data lake scenarios:

Interaction Model	Use Case in Data Lake
Fire-and-Forget	High-volume telemetry ingestion where acknowledgment is unnecessary
Request-Response	Metadata queries, schema lookups
Request-Stream	Continuous data consumption from lake partitions
Channel	Bidirectional sync between lake nodes, real-time analytics feedback

The Bidirectional Streaming Architecture

Medallion Data Architecture

Reactive Streams: The Foundation

RSocket implements the Reactive Streams specification, providing a standard for asynchronous stream processing with non-blocking backpressure. This is crucial for data lake scenarios where:

Publishers (data sources) can produce at varying rates
Subscribers (data lake writers) have finite processing capacity
Backpressure must flow upstream to prevent memory exhaustion

The Reactive Streams Contract

public interface Publisher<T> {
    void subscribe(Subscriber<? super T> subscriber);
}

public interface Subscriber<T> {
    void onSubscribe(Subscription subscription);
    void onNext(T item);
    void onError(Throwable error);
    void onComplete();
}

public interface Subscription {
    void request(long n);  // Backpressure signal
    void cancel();
}

The request(long n) method is where the magic happens - subscribers explicitly request the number of elements they can handle, creating a pull-based flow control mechanism.

Implementing the RSocket Data Lake Server

Using Spring Boot and WebFlux, we create a reactive RSocket server that handles streaming ingestion:

@Controller
public class DataLakeController {

    private final DataLakeWriter dataLakeWriter;

    public DataLakeController(DataLakeWriter dataLakeWriter) {
        this.dataLakeWriter = dataLakeWriter;
    }

    // Fire-and-Forget: High-throughput ingestion
    @MessageMapping("ingest.fire-forget")
    public Mono<Void> ingestFireAndForget(DataEvent event) {
        return dataLakeWriter.writeAsync(event);
    }

    // Request-Stream: Continuous data consumption
    @MessageMapping("stream.partition")
    public Flux<DataRecord> streamPartition(PartitionRequest request) {
        return dataLakeWriter.readPartition(request.getPartitionId())
            .delayElements(Duration.ofMillis(10)); // Simulate processing
    }

    // Channel: Bidirectional streaming with feedback
    @MessageMapping("sync.channel")
    public Flux<SyncResponse> bidirectionalSync(Flux<SyncRequest> requests) {
        return requests
            .flatMap(this::processSyncRequest)
            .onBackpressureBuffer(1000,
                dropped -> log.warn("Dropped: {}", dropped),
                BufferOverflowStrategy.DROP_OLDEST);
    }
}

Configuration for Optimal Streaming

spring:
  rsocket:
    server:
      port: 7000
      transport: tcp

  # Reactive buffer tuning
  reactor:
    buffer-size:
      small: 256
      medium: 1024
      large: 4096

Backpressure Handling Strategies

Backpressure is not just a feature - it is the cornerstone of reliable streaming systems. RSocket provides multiple strategies:

1. Buffer with Overflow Strategy

flux.onBackpressureBuffer(
    maxSize,           // Maximum buffer capacity
    overflowHandler,   // Called when buffer overflows
    BufferOverflowStrategy.DROP_OLDEST  // Or DROP_LATEST, ERROR
);

2. Latest Value Strategy

When only the most recent value matters (e.g., sensor readings):

flux.onBackpressureLatest();  // Keep only the latest item

3. Drop Strategy

For scenarios where data loss is acceptable:

flux.onBackpressureDrop(dropped ->
    metrics.incrementDroppedEvents()
);

4. Request-Based Flow Control

The most precise approach - explicitly controlling demand:

flux.subscribe(new BaseSubscriber<DataEvent>() {
    @Override
    protected void hookOnSubscribe(Subscription subscription) {
        request(100);  // Initial demand
    }

    @Override
    protected void hookOnNext(DataEvent event) {
        processEvent(event);
        request(1);  // Request next item after processing
    }
});

The Data Flow Architecture

Medallion Data Architecture

Client Implementation

Creating a reactive client that respects backpressure:

@Service
public class DataLakeClient {

    private final RSocketRequester requester;

    public DataLakeClient(RSocketRequester.Builder builder) {
        this.requester = builder
            .rsocketConnector(connector -> connector
                .reconnect(Retry.backoff(10, Duration.ofSeconds(1))))
            .tcp("localhost", 7000);
    }

    // Fire-and-Forget ingestion
    public Mono<Void> ingest(DataEvent event) {
        return requester
            .route("ingest.fire-forget")
            .data(event)
            .send();
    }

    // Request-Stream consumption with backpressure
    public Flux<DataRecord> streamWithBackpressure(String partitionId) {
        return requester
            .route("stream.partition")
            .data(new PartitionRequest(partitionId))
            .retrieveFlux(DataRecord.class)
            .limitRate(100);  // Request 100 at a time
    }

    // Bidirectional channel
    public Flux<SyncResponse> syncChannel(Flux<SyncRequest> requests) {
        return requester
            .route("sync.channel")
            .data(requests)
            .retrieveFlux(SyncResponse.class);
    }
}

Performance Considerations

When building streaming data lakes with RSocket, consider these optimizations:

1. Connection Multiplexing

RSocket multiplexes multiple streams over a single connection, reducing connection overhead:

// Single connection, multiple concurrent streams
Flux.range(1, 10)
    .flatMap(i -> client.streamPartition("partition-" + i), 10)
    .subscribe();

2. Binary Payload Encoding

Use efficient serialization for high-throughput scenarios:

RSocketStrategies strategies = RSocketStrategies.builder()
    .encoder(new Jackson2CborEncoder())
    .decoder(new Jackson2CborDecoder())
    .build();

3. Resume Capability

RSocket supports session resumption for fault tolerance:

RSocketConnector.create()
    .resume(new Resume()
        .sessionDuration(Duration.ofMinutes(5))
        .retry(Retry.backoff(Long.MAX_VALUE, Duration.ofSeconds(1))))
    .connect(transport);

Real-World Architecture

Medallion Data Architecture

Key Takeaways

RSocket is purpose-built for streaming: Unlike HTTP, RSocket provides native support for bidirectional streams with backpressure.
Reactive Streams ensure stability: The demand-driven model prevents fast producers from overwhelming slow consumers.
Multiple interaction models: Choose the right model for each use case - fire-and-forget for telemetry, channels for bidirectional sync.
Spring WebFlux integration: Seamless integration with the Spring ecosystem makes RSocket accessible to Java developers.
Backpressure is essential: In data lake scenarios, proper backpressure handling is the difference between a stable system and cascading failures.

Conclusion

The streaming-datalake-using-rsocket project demonstrates how modern reactive protocols can transform data lake architectures. By embracing RSocket's bidirectional streaming and backpressure capabilities, we can build data pipelines that are not only faster but fundamentally more resilient.

The shift from request-response to reactive streaming is not merely an optimization - it is a paradigm shift in how we think about data movement. As data volumes continue to grow exponentially, protocols like RSocket will become essential tools in the data engineer's arsenal.

Explore the full implementation at github.com/mgorav/streaming-datalake-using-rsocket