Building a Streaming Data Lake with RSocket

A deep dive into implementing reactive, bidirectional streaming data pipelines using RSocket and Spring WebFlux for modern data lake architectures.

GT
Gonnect Team
January 15, 202412 min readView on GitHub
RSocketSpring BootWebFlux

Building a Streaming Data Lake with RSocket

Modern data architectures demand more than traditional request-response patterns. When building data lakes that ingest millions of events per second, you need a protocol designed from the ground up for reactive, bidirectional streaming. Enter RSocket - the application protocol that transforms how we think about data movement.

The Challenge: Real-Time Data Lake Ingestion

Traditional data lake architectures often rely on batch-oriented approaches or polling mechanisms that introduce latency and inefficiency. Consider these common pain points:

  • Backpressure blindness: HTTP-based ingestion cannot signal upstream systems to slow down
  • Connection overhead: Establishing new connections for each batch adds latency
  • Unidirectional flow: Request-response patterns limit streaming capabilities
  • Resource exhaustion: Without flow control, fast producers overwhelm slow consumers

The streaming-datalake-using-rsocket project addresses these challenges head-on by leveraging RSocket's native reactive semantics.

Why RSocket for Data Lake Streaming?

RSocket is a binary protocol that provides four interaction models, each suited for different data lake scenarios:

Interaction ModelUse Case in Data Lake
Fire-and-ForgetHigh-volume telemetry ingestion where acknowledgment is unnecessary
Request-ResponseMetadata queries, schema lookups
Request-StreamContinuous data consumption from lake partitions
ChannelBidirectional sync between lake nodes, real-time analytics feedback

The Bidirectional Streaming Architecture

Medallion Data Architecture

Loading diagram...

Reactive Streams: The Foundation

RSocket implements the Reactive Streams specification, providing a standard for asynchronous stream processing with non-blocking backpressure. This is crucial for data lake scenarios where:

  1. Publishers (data sources) can produce at varying rates
  2. Subscribers (data lake writers) have finite processing capacity
  3. Backpressure must flow upstream to prevent memory exhaustion

The Reactive Streams Contract

public interface Publisher<T> {
    void subscribe(Subscriber<? super T> subscriber);
}

public interface Subscriber<T> {
    void onSubscribe(Subscription subscription);
    void onNext(T item);
    void onError(Throwable error);
    void onComplete();
}

public interface Subscription {
    void request(long n);  // Backpressure signal
    void cancel();
}

The request(long n) method is where the magic happens - subscribers explicitly request the number of elements they can handle, creating a pull-based flow control mechanism.

Implementing the RSocket Data Lake Server

Using Spring Boot and WebFlux, we create a reactive RSocket server that handles streaming ingestion:

@Controller
public class DataLakeController {

    private final DataLakeWriter dataLakeWriter;

    public DataLakeController(DataLakeWriter dataLakeWriter) {
        this.dataLakeWriter = dataLakeWriter;
    }

    // Fire-and-Forget: High-throughput ingestion
    @MessageMapping("ingest.fire-forget")
    public Mono<Void> ingestFireAndForget(DataEvent event) {
        return dataLakeWriter.writeAsync(event);
    }

    // Request-Stream: Continuous data consumption
    @MessageMapping("stream.partition")
    public Flux<DataRecord> streamPartition(PartitionRequest request) {
        return dataLakeWriter.readPartition(request.getPartitionId())
            .delayElements(Duration.ofMillis(10)); // Simulate processing
    }

    // Channel: Bidirectional streaming with feedback
    @MessageMapping("sync.channel")
    public Flux<SyncResponse> bidirectionalSync(Flux<SyncRequest> requests) {
        return requests
            .flatMap(this::processSyncRequest)
            .onBackpressureBuffer(1000,
                dropped -> log.warn("Dropped: {}", dropped),
                BufferOverflowStrategy.DROP_OLDEST);
    }
}

Configuration for Optimal Streaming

spring:
  rsocket:
    server:
      port: 7000
      transport: tcp

  # Reactive buffer tuning
  reactor:
    buffer-size:
      small: 256
      medium: 1024
      large: 4096

Backpressure Handling Strategies

Backpressure is not just a feature - it is the cornerstone of reliable streaming systems. RSocket provides multiple strategies:

1. Buffer with Overflow Strategy

flux.onBackpressureBuffer(
    maxSize,           // Maximum buffer capacity
    overflowHandler,   // Called when buffer overflows
    BufferOverflowStrategy.DROP_OLDEST  // Or DROP_LATEST, ERROR
);

2. Latest Value Strategy

When only the most recent value matters (e.g., sensor readings):

flux.onBackpressureLatest();  // Keep only the latest item

3. Drop Strategy

For scenarios where data loss is acceptable:

flux.onBackpressureDrop(dropped ->
    metrics.incrementDroppedEvents()
);

4. Request-Based Flow Control

The most precise approach - explicitly controlling demand:

flux.subscribe(new BaseSubscriber<DataEvent>() {
    @Override
    protected void hookOnSubscribe(Subscription subscription) {
        request(100);  // Initial demand
    }

    @Override
    protected void hookOnNext(DataEvent event) {
        processEvent(event);
        request(1);  // Request next item after processing
    }
});

The Data Flow Architecture

Medallion Data Architecture

Loading diagram...

Client Implementation

Creating a reactive client that respects backpressure:

@Service
public class DataLakeClient {

    private final RSocketRequester requester;

    public DataLakeClient(RSocketRequester.Builder builder) {
        this.requester = builder
            .rsocketConnector(connector -> connector
                .reconnect(Retry.backoff(10, Duration.ofSeconds(1))))
            .tcp("localhost", 7000);
    }

    // Fire-and-Forget ingestion
    public Mono<Void> ingest(DataEvent event) {
        return requester
            .route("ingest.fire-forget")
            .data(event)
            .send();
    }

    // Request-Stream consumption with backpressure
    public Flux<DataRecord> streamWithBackpressure(String partitionId) {
        return requester
            .route("stream.partition")
            .data(new PartitionRequest(partitionId))
            .retrieveFlux(DataRecord.class)
            .limitRate(100);  // Request 100 at a time
    }

    // Bidirectional channel
    public Flux<SyncResponse> syncChannel(Flux<SyncRequest> requests) {
        return requester
            .route("sync.channel")
            .data(requests)
            .retrieveFlux(SyncResponse.class);
    }
}

Performance Considerations

When building streaming data lakes with RSocket, consider these optimizations:

1. Connection Multiplexing

RSocket multiplexes multiple streams over a single connection, reducing connection overhead:

// Single connection, multiple concurrent streams
Flux.range(1, 10)
    .flatMap(i -> client.streamPartition("partition-" + i), 10)
    .subscribe();

2. Binary Payload Encoding

Use efficient serialization for high-throughput scenarios:

RSocketStrategies strategies = RSocketStrategies.builder()
    .encoder(new Jackson2CborEncoder())
    .decoder(new Jackson2CborDecoder())
    .build();

3. Resume Capability

RSocket supports session resumption for fault tolerance:

RSocketConnector.create()
    .resume(new Resume()
        .sessionDuration(Duration.ofMinutes(5))
        .retry(Retry.backoff(Long.MAX_VALUE, Duration.ofSeconds(1))))
    .connect(transport);

Real-World Architecture

Medallion Data Architecture

Loading diagram...

Key Takeaways

  1. RSocket is purpose-built for streaming: Unlike HTTP, RSocket provides native support for bidirectional streams with backpressure.

  2. Reactive Streams ensure stability: The demand-driven model prevents fast producers from overwhelming slow consumers.

  3. Multiple interaction models: Choose the right model for each use case - fire-and-forget for telemetry, channels for bidirectional sync.

  4. Spring WebFlux integration: Seamless integration with the Spring ecosystem makes RSocket accessible to Java developers.

  5. Backpressure is essential: In data lake scenarios, proper backpressure handling is the difference between a stable system and cascading failures.

Conclusion

The streaming-datalake-using-rsocket project demonstrates how modern reactive protocols can transform data lake architectures. By embracing RSocket's bidirectional streaming and backpressure capabilities, we can build data pipelines that are not only faster but fundamentally more resilient.

The shift from request-response to reactive streaming is not merely an optimization - it is a paradigm shift in how we think about data movement. As data volumes continue to grow exponentially, protocols like RSocket will become essential tools in the data engineer's arsenal.


Explore the full implementation at github.com/mgorav/streaming-datalake-using-rsocket