Building A Continuous Delivery Pipeline for StreamSets

Posted by Ahmed Huwait on 07 July 2018

StreamSets DataCollector (SDC) is a data streaming tool that helps you move data from where it is generated or collected, to where you need it for analysis or processing. I personally found SDC interesting as it sits somewhere between systems integration, which is my background and big data, which was new to me at the time.

Coming from a Java development background, I was looking for the familiar delivery tools I am used to. As part of its out-of-the-box IDE, SDC provides a nice visual testing tool that helps developers test their pipelines and see sample data (snapshots) as they flow through the pipelines. This is very helpful for manual verification but not so helpful to build an automated continuous delivery (CD) pipeline.

In this post I discuss our team’s attempt to complement SDC with a CD pipeline the dev team can rely on. The post assumes some familiarity of CD frameworks, hence it focuses more on the conceptual side, where most of the challenges for us lay.

Pipeline? … which pipeline?!

First things first, I know it can get a little confusing as I use the term “pipeline” to refer to two different things but bear with me! I will try to clarify it a little bit here …

SDC Pipeline

The unit of work in SDC is called a pipeline, which is the flow, or piece of code, with an origin, (the data source), and one or more more destinations, where the data gets delivered, with optionally some transformers and processors in between.

CD Pipeline

Pipeline” is also used to refer to the unit of work in continuous delivery tools, like Jenkins or Bamboo, where a number of steps, normally around building, unit-testing and deployment a code base, are linked together in a sequence to achieve the required delivery goals.

These two pipeline terms are used in this article to refer to the above two concepts.

The Problem

As a development team in an agile environment, automated building, unit-testing and deployment tools are necessary to maintain the agility of the team to build and deliver their code with confidence.

While SDC provide the visual debugging/testing tools out of the box, it doesn’t support standard unit testing and auto deployment natively, and this is where our framework below tried to help.

Building a CD Pipeline around SDC


In our quest to work around this limitation in SDC, we came across a set of very useful APIs that come out of the box with any SDC installation, which you can see its swagger documentation on: http://<hostname:port>/collector/restapi.

In fact, the whole GUI you get on http://<hostname:port>/ uses this same API to do all its functions, hence, theoretically, using these APIs you should be able to automate all the out of the box code functionality. Hint: enabling the developer mode in your browser, you should be able to see each API call the GUI components make. That becomes very handy in mimicking the GUI functions through your code.

How the CD Pipeline Works for SDC

To achieve our CD pipeline target, the main missing piece in the puzzle was the unit test, which is a concept that SDC, so far (Jun 2018), doesn’t cater for.

A Recap of How the SDC Out-of-the-box Test, preview, Works

Before going into the details of how we added our unit-test bit, let’s do a quick recap on the steps to run a test on your SDC pipeline:

  1. Build a pipeline
  2. Run it against its configured origin
  3. While running, capture a snapshot
  4. Once you have a snapshot captured, then you can run the pipeline again in preview mode with the Preview Source configured to Snapshot Data, then select the snapshot you want from the drop-down list.
  5. Run your preview then you can step through your pipeline, stage by stage, and see the input and output of each stage.

A Few Challenges

Before we can automate the preview process we had to go through a few challenges which you would not expect when building a standard unit-testing framework.

Not-so-portable Code

SDC saves each pipeline as a number of json files in different directories under its home directory, which is not so portable.

To get that portable file, you have to export the pipeline, so you get a single json file that you can use somewhere else.

The exported pipeline, the-one-json-file-to-rule-them-all, then can be checked in to your source control system.

Snapshots vs. Normal Test Files

Snapshots, as the name suggests, are a full snapshot of what is going through the pipeline when they were captured. That makes them a little bit more than a sample input message, which you would normally use for unit testing.

A snapshot json file is a big file with all pipeline stages represented with headers and message body, which seems to be carried over in each stage. So, long story short, they are not the easiest json to understand.

Why is this a problem? Because you will need to understand it and use it as a reference for your unit test. We'll come to that later!

Unit-testing and Deployment, Which Comes First?!

Building a unit test framework, you would be normally looking to unit-test your code, using libraries on the machine that runs it, first, then as it passes all test cases successfully, you deploy that code to the target server.

With SDC to run a pipeline preview, that pipeline has to be already deployed to some instance, which seems to be the opposite of the order above.

To work around this we had a dedicated SDC instance on which our pipelines-to-be-tested are deployed, tested and then (upon success) they get deployed to the target server.

Importing Pipelines to SDC

SDC allows duplicate pipeline names, as it allocates each pipeline a unique ID. That basically means that you cannot update the deployed pipeline with a newer copy, you have to delete the old copy then deploy the new one, otherwise you will have duplicates.

Our CD Framework

Our CD framework utilised JUnit tests and a Jenkinsfile in a maven project. It expected developers to export their pipelines and snapshots, which should cover all their unit test cases, as json files under the project’s src/main/resources.

Jenkinsfile went like this:

  • checks out the project from the nominated source control
  • mvn -B -DskipTests clean package to compile and package the project, skipping it’s unit test.
  • mvn test to run the unit tests (we'll come to this in a bit).
  • publishes the generated archive, tagged with the build number, to Nexus, our repository archive.
  • deletes the old version of the pipelines from SDC server
  • deploys the pipelines to the SDC server.

To unit-test the pipelines, our JUnit single test method went like this:

  • zip up all pipelines in one archive
  • using SDC API (calling POST:/pipelines/import), imports the zip file to SDC server (i.e. deploys pipelines).
  • for each imported pipeline:
    • run it in preview mode by calling the API’s POST:/pipeline/{pipelineId}/preview with its snapshot json as body
    • retrieve preview’s response, GET:/pipeline/{pipelineId}/preview/{previewerId}
    • compare the response with the snapshot for each stage, where the pipeline’s stages are represented in nodes in both the response and snapshot
    • stop preview, by calling DELETE:/pipeline/{pipelineId}/preview/{previewerId}

Note that the actual deployment to the StreamSets server happens twice in this process, the first one is to run the previews, and the second is the actual deployment which required deleting the currently deployed version and importing the ones that just passed the unit test.

These two deployments don’t have to be on the same StreamSets server, they can use two different ones, where the first happens on a unit-test dedicated instance, while the second happens on the environment-specific instance, e.g. dev/sit/uat … etc.


Using StreamSets Data Collector (SDC)’s handy REST API, we were able to implement some core Continuous Delivery practices around automated testing and deployment.

It might make your life easier if you have a dedicated SDC instance for unit testing. If this is separate from your deployment SDC, you won’t need to deploy the pipelines for unit testing, only then delete them later.

You might also enjoy: