Spring cloud data flow что такое
Spring Cloud Data Flow
Getting Started with Spring Cloud Data Flow
This guide walks you through an overview of Spring Cloud Data Flow and the process of orchestrating an event-driven streaming and short-lived batch data pipelines.
What is Spring Cloud Data Flow?
A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes. You can learn more about the Spring Cloud Data Flow from the Microsite, documentation, and samples.
Furthermore, you can read Spring Cloud Data Flow’s architecture and the building blocks to familiarize yourself.
Spring Cloud Data Flow and Pre-built Streaming Applications
In the following guide, you will create and deploy a streaming data pipeline using pre-built applications.
Spring Cloud Data Flow and Custom Streaming Applications
In the following guide, you will create and deploy a streaming data pipeline using custom applications.
Spring Cloud Data Flow and Simple Task
In the following guide, we demonstrate how to register a Spring Cloud Task application with Data Flow, create a Task definition, and launch the Task definition on Cloud Foundry, Kubernetes, and your local machine.
Spring Cloud Data Flow and Batch Jobs
In the following guide, we demonstrate how to register a Spring Batch application with Data Flow, create a task definition, and launch the task definition on Cloud Foundry, Kubernetes, and your local machine.
Spring Cloud Data Flow and Composed Task
In the following guide, we demonstrate how to orchestrate a Composed Task with a combination of both a simple task and a batch job application.
Summary
Congratulations! You have completed Spring Cloud Data Flow’s high-level overview, and you were able to build, deploy, and launch streaming and batch data pipelines in Cloud Foundry, Kubernetes, and Local.
Get the Code
Table of contents
Get ahead
VMware offers training and certification to turbo-charge your progress.
Get support
Spring Runtime offers support and binaries for OpenJDK™, Spring, and Apache Tomcat® in one simple subscription.
Upcoming events
Check out all the upcoming events in the Spring community.
ETL с потоком данных Spring Cloud
Узнайте, как реализовать ETL с потоком данных Spring Cloud.
1. Обзор
В этом уроке мы рассмотрим пример преобразования и загрузки извлечения в реальном времени (ETL) с использованием конвейера потоков, который извлекает данные из базы данных JDBC, преобразует их в простые POJOS и загружает в MongoDB.
2. ETL и обработка потока событий
ETL – извлечение, преобразование и загрузка – обычно назывался процессом пакетной загрузки данных из нескольких баз данных и систем в общее хранилище данных. В этом хранилище данных можно выполнять интенсивную обработку анализа данных без ущерба для общей производительности системы.
Однако новые тенденции меняют то, как это делается. ETL по-прежнему играет важную роль в передаче данных в хранилища данных и озера данных.
3. Поток данных Spring Cloud
С помощью Spring Cloud Data Flow (SCDF) разработчики могут создавать конвейеры данных в двух вариантах:
В этой статье мы рассмотрим первое, долговечное потоковое приложение, основанное на Spring Cloud Stream.
3.1. Приложения Spring Cloud Stream
Конвейеры потоков SCDF состоят из шагов, где каждый шаг представляет собой приложение, построенное в стиле Spring Boot с использованием микро-фреймворка Spring Cloud Stream. Эти приложения интегрированы с помощью промежуточного программного обеспечения для обмена сообщениями, такого как Apache Kafka или RabbitMQ.
Эти приложения подразделяются на источники, процессоры и приемники. По сравнению с процессом ETL, мы могли бы сказать, что источник-это “извлечение”, процессор – “трансформатор”, а приемник-часть “нагрузки”.
В некоторых случаях мы можем использовать стартер приложения на одном или нескольких этапах конвейера. Это означает, что нам не нужно будет реализовывать новое приложение для шага, а вместо этого настроить уже реализованный запуск существующего приложения.
3.2. Сервер потока данных Spring Cloud
Кроме того, мы можем запустить поток как локальное развертывание.
4. Настройка среды
Теперь давайте проверим системные требования для запуска этого сервера.
4.1. Системные требования
Чтобы запустить сервер SCDF, нам нужно определить и настроить две зависимости:
Для промежуточного программного обеспечения обмена сообщениями мы будем работать с RabbitMQ, и мы выбираем PostgreSQL в качестве СУБД для хранения наших определений потоков конвейера.
Для запуска RabbitMQ загрузите последнюю версию здесь и запустите экземпляр RabbitMQ, используя конфигурацию по умолчанию, или выполните следующую команду Docker:
В качестве последнего шага установки установите и запустите СУБД PostgreSQL на порту 5432 по умолчанию. После этого создайте базу данных, в которой SCDF может хранить свои определения потоков, используя следующий сценарий:
4.2. Локальный сервер потока данных Spring Cloud
Здесь мы запустим локальный сервер SCDF как Java-приложение. Для настройки приложения мы должны определить конфигурацию как параметры приложения Java. Нам понадобится Java 8 в системном пути.
Наконец, давайте запустим локальный сервер SCDF:
Мы можем проверить, работает ли он, посмотрев на этот URL-адрес:
4.3. Оболочка потока данных Spring Cloud
Загрузите последнюю версию jar в домашнюю папку SCDF, доступную здесь. Как только это будет сделано, выполните следующую команду (обновите версию по мере необходимости):
Если вместо ” поток данных:>” вы получаете ” сервер-неизвестно:>” в последней строке, вы не запускаете сервер SCDF на локальном хосте. В этом случае выполните следующую команду для подключения к другому хосту:
Теперь оболочка подключена к серверу SCDF, и мы можем запускать наши команды.
Первое, что нам нужно сделать в оболочке, – это импортировать стартеры приложений. Найдите последнюю версию здесь для RabbitMQ+Maven в Spring Boot 2.0.x и выполните следующую команду (снова обновите версию, здесь ” Darwin-SR1 “, по мере необходимости):
Для проверки установленных приложений выполните следующую команду оболочки:
В результате мы должны увидеть таблицу, содержащую все установленные приложения.
5. Составление конвейера ETL
Теперь давайте создадим наш потоковый конвейер. Для этого мы будем использовать стартер исходного приложения JDBC для извлечения информации из нашей реляционной базы данных.
Кроме того, мы создадим пользовательский процессор для преобразования информационной структуры и пользовательский приемник для загрузки наших данных в MongoDB.
5.1. Извлечение – Подготовка реляционной базы данных к извлечению
Давайте создадим базу данных с именем crm и таблицу с именем customer :
Теперь давайте вставим некоторые данные:
5.2. Трансформация – Сопоставление полей JDBC со структурой полей MongoDB
Для этого мы создадим новый проект с именем customer-transform . Самый простой способ сделать это-использовать сайт Spring Initializr для создания проекта. Перейдя на веб-сайт, выберите Группу и имя артефакта. Мы будем использовать com.customer и customer-transform, соответственно.
Как только это будет сделано, нажмите на кнопку “Создать проект”, чтобы загрузить проект. Затем распакуйте проект и импортируйте его в свою любимую среду IDE, а также добавьте следующую зависимость в pom.xml :
Аннотации @JsonProperty будут выполнять преобразование при десериализации из JSON в Java:
Процессор должен получать данные с входного сигнала, выполнять преобразование и привязывать результат к выходному каналу. Давайте создадим класс для этого:
5.3. Приемник нагрузки в MongoDB
Затем распакуйте и импортируйте его в свою любимую среду IDE.
Теперь мы создадим еще один класс Customer для получения входных данных на этом шаге:
5.4. Определение потока
Затем мы регистрируем их с помощью оболочки потока данных Spring Cloud:
Наконец, давайте проверим, хранятся ли приложения в SCDF, выполните команду списка приложений в оболочке:
Наконец, давайте проверим, хранятся ли приложения в SCDF, выполните команду списка приложений в оболочке:
5.4.1. Потоковый конвейер Специфичный для домена Язык – DSL
DSL определяет конфигурацию и поток данных между приложениями. DSL SCDF прост. В первом слове мы определяем имя приложения, а затем конфигурации.
Это создает HTTP-приложение, обслуживаемое в порту 8181, которое отправляет любую полученную полезную нагрузку тела в журнал.
Теперь давайте посмотрим, как создать определение потока DSL источника JDBC.
5.4.2. Определение исходного потока JDBC
Кроме того, мы определим источник JDBC для опроса с фиксированной задержкой в 30 секунд и опросом не более 1000 строк. Наконец, мы определим конфигурации подключения, такие как драйвер, имя пользователя, пароль и URL-адрес подключения:
5.4.3. Определение потока приемника MongoDB клиента
Наше приложение полностью основано на MongoDataAutoConfiguration. Вы можете проверить другие возможные конфигурации здесь. В принципе, мы определим spring.data.mongodb.uri :
5.4.4. Создание и развертывание потока
Во-первых, чтобы создать окончательное определение потока, вернитесь в оболочку и выполните следующую команду (без разрывов строк, они только что были вставлены для удобства чтения):
Этот поток DSL определяет поток с именем jdbc -to- mongodb. Далее мы развернем поток по его имени :
Наконец, мы должны увидеть расположение всех доступных журналов в выходных данных журнала:
6. Заключение
В этой статье мы рассмотрели полный пример конвейера данных ETL с использованием потока данных Spring Cloud.
Наиболее примечательно, что мы увидели конфигурации стартера приложений, создали конвейер потока ETL с использованием оболочки потока данных Spring Cloud и реализовали пользовательские приложения для чтения, преобразования и записи данных.
Как всегда, пример кода можно найти в проекте GitHub.
Spring Cloud Data Flow
Microservice based Streaming and Batch data processing for Cloud Foundry and Kubernetes.
Spring Cloud Data Flow provides tools to create complex topologies for streaming and batch data pipelines. The data pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks.
Spring Cloud Data Flow supports a range of data processing use cases, from ETL to import/export, event streaming, and predictive analytics.
Features
The Spring Cloud Data Flow server uses Spring Cloud Deployer, to deploy data pipelines made of Spring Cloud Stream or Spring Cloud Task applications onto modern platforms such as Cloud Foundry and Kubernetes.
A selection of pre-built stream and task/batch starter apps for various data integration and processing scenarios facilitate learning and experimentation.
Custom stream and task applications, targeting different middleware or data services, can be built using the familiar Spring Boot style programming model.
A simple stream pipeline DSL makes it easy to specify which apps to deploy and how to connect outputs and inputs. The composed task DSL is useful for when a series of task apps require to be run as a directed graph.
The dashboard offers a graphical editor for building data pipelines interactively, as well as views of deployable apps and monitoring them with metrics using Wavefront, Prometheus, Influx DB, or other monitoring systems.
The Spring Cloud Data Flow server exposes a REST API for composing and deploying data pipelines. A separate shell makes it easy to work with the API from the command line.
Getting Started
The recently launched brand new Spring Cloud Data Flow Microsite is the best place to get started.
Quickstart Your Project
Documentation
2.9.1 CURRENT GA | Reference Doc. | API Doc. |
2.9.2-SNAPSHOT SNAPSHOT | Reference Doc. | API Doc. |
2.8.3 GA | Reference Doc. | API Doc. |
2.7.2 GA | Reference Doc. | API Doc. |
2.6.4 GA | Reference Doc. | API Doc. |
2.5.3.RELEASE GA | Reference Doc. | API Doc. |
2.4.2.RELEASE GA | Reference Doc. | API Doc. |
2.3.1.RELEASE GA | Reference Doc. | API Doc. |
OSS support
Free security updates and bugfixes with support from the Spring community. See VMware Tanzu OSS support policy.
Commercial support
Business support from Spring experts during the OSS timeline, plus extended support after OSS End-Of-Life.
Publicly available releases for critical bugfixes and security issues when requested by customers.
Future release
Generation not yet released, timeline is subject to changes.
About commercial support (*)
A few examples to try out:
Get ahead
VMware offers training and certification to turbo-charge your progress.
Get support
Spring Runtime offers support and binaries for OpenJDK™, Spring, and Apache Tomcat® in one simple subscription.
Upcoming events
Check out all the upcoming events in the Spring community.
Spring cloud data flow что такое
This guide explains the main concepts of Data Flow’s architecture:
Other sections of the guide explain:
Data Flow has two main components:
The main entry point to access Data Flow is through the RESTful API of the Data Flow Server. The Web Dashboard is served from the Data Flow Server. The Data Flow Server and the Data Flow Shell application both communicate through the web API.
The servers can be run on several platforms: Cloud Foundry, Kubernetes, or on your Local machine. Each server stores its state in a relational database.
The following image shows a high-level view of the architecture and the paths of communication:
The Data Flow Server is responsible for
The Skipper Server is responsible for:
The Data Flow Server and Skipper Server need to have an RDBMS installed. By default, the servers use an embedded H2 database. You can configure the servers to use external databases. The supported databases are H2, HSQLDB, MySQL, Oracle, Postgresql, DB2, and SqlServer. The schemas are automatically created when each server starts.
The Data Flow and Skipper Server executable jars use OAuth 2.0 authentication to secure the relevant REST endpoints. You can access these either by using basic authentication or by using OAuth2 access tokens. For an OAuth provider, we recommend the CloudFoundry User Account and Authentication (UAA) Server, which also provides comprehensive LDAP support. See the Security Section in the reference guide for more information on configuring security features to your needs.
By default, the REST endpoints (administration, management, and health) as well as the Dashboard UI do not require authenticated access.
Applications come in two flavors:
Long-lived applications. There are two types of long-lived applications:
Short-lived applications that process a finite set of data and then terminate. There are two variations of short-lived applications.
It is common to write long-lived applications based on the Spring Cloud Stream framework and short-lived applications based on the Spring Cloud Task or Spring Batch frameworks. There are many guides in the documentation that show you how to use these frameworks in developing data pipelines. However, you can also write long-lived and short-lived applications that do not use Spring. They can also be written in other programming languages.
Depending on the runtime, you can package applications in two ways:
Long-lived applications are expected to run continuously. If the application stops, the platform is responsible for restarting it.
The Spring Cloud Stream framework provides a programming model to simplify the writing of message-driven microservice applications that are connected to a common messaging system. You can write core business logic that is agnostic to the specific middleware. The middleware to use is determined by adding a Spring Cloud Stream Binder library as a dependency to the application. There are binding libraries for the following messaging middleware products:
The Data Flow server delegates to the Skipper server to deploy long-lived applications.
Streams with Sources, Processors, and Sinks
Spring Cloud Stream defines the concept of a binding interface that encapsulates in code a message exchange pattern, namely what the application’s inputs and outputs are. Spring Cloud Stream provides several binding interfaces that correspond to the following common message exchange contracts:
The following example shows the shell syntax for registration of an http source (an application that listens for HTTP requests and sends HTTP payload to a destination) and a log sink (an application that consumes from a destination and logs the received message):
With http and log registered with Data Flow, you can create a stream definition by using the Stream Pipeline DSL, which uses a pipes and filters syntax, as the following example shows:
The pipe symbol in http | log represents the connection of the source output to the sink input. Data Flow sets the appropriate properties when deploying the stream so that the source can communicate with the sink over the messaging middleware.
Streams with Multiple Inputs and Outputs
Sources, sinks, and processors all have a single output, a single input, or both. This is what makes it possible for Data Flow to set application properties that pair an output destination to an input destination. However, a message processing application could have more than one input or output destination. Spring Cloud Stream supports this by letting you define a custom binding interface.
The following example shows a fictional orderStream :
When you define a stream by using the | symbol, Data Flow can configure each application in the stream to communicate with its neighboring application in the DSL, since there is always one output paired to one input. When you use the || symbol, you must provide configuration properties that pair together the multiple output and input destinations.
You can also create a stream with a single application by using the Stream Application DSL as well as deploying an application that does not use messaging middleware.
These examples give you a general sense of the long-lived application types. Additional guides go into more detail on how to develop, test, and register long-lived applications and how to deploy them.
The next major section discusses the runtime architecture of the deployed stream.
Short-lived applications run for a period of time (often minutes to hours) and then terminate. Their runs may be based on a schedule (for example, 6 a.m. every weekday) or in response to an event (for example, a file being put in an FTP server).
The Spring Cloud Task framework lets you develop a short-lived microservice that records the life cycle events (such as the start time, the end time, and the exit code) of a short-lived application.
A task application is registered with Data Flow by using the name task to describe the type of application.
The following example shows the shell syntax for registering a timestamp task (an application that prints the current time and exits):
The task definition is created by referencing the name of the task, as the following example shows:
The Spring Batch framework is probably what comes to mind for Spring developers who write short-lived applications. Spring Batch provides a much richer set of functionality than Spring Cloud Task and is recommended when processing large volumes of data. A use case might be to read many CSV files, transform each row of data, and write each transformed row to a database. Spring Batch provides its own database schema with a much more rich set of information about the execution of a Spring Batch job. Spring Cloud Task is integrated with Spring Batch so that, if a Spring Cloud Task application defined a Spring Batch job, a link between the Spring Cloud Task and Spring Batch run tables is created.
Tasks that use Spring Batch are registered and created in the same way as shown previously.
The Spring Cloud Data Flow server launches the task to the platform.
Spring Cloud Data Flow lets a user create a directed graph, where each node of the graph is a task application.
This is done by using the Composed Task Domain Specific Language for composed tasks. There are several symbols in the Composed Task DSL that determine the overall flow. The reference guide goes into detail. The following example shows how the double ampersand symbol ( && ) is used for conditional execution:
The DSL expression ( task1 && task2 ) means that task2 is launched only if task1 has run successfully. The graph of tasks are run through a task application called the Composed Task Runner.
Additional guides will go into more detail on how to develop, test, and register short-lived applications and how to deploy them.
The long-lived and the short-lived applications can provide metadata about the supported configuration properties. The metadata is used by Shell and UI tools to offer contextual help and code completion when building data pipelines. You can find more about how to generate and use Application Metadata in this detailed guide.
To kickstart your development, you can use many pre-built applications to integrate with common data sources and sinks. For example, you can use a cassandra sink that writes data to Cassandra and a groovy-transform processor that transforms the incoming data by using a Groovy script.
The installation instructions show how to register these applications with Spring Cloud Data Flow.
You can find more information on pre-built applications in the Applications guide.
Microservice Architectural Style
The Data Flow and Skipper servers deploy streams and composed batch jobs to the platform as a collection of microservice applications, each running in their own process. Each microservice application can be scaled up or down independently of the other, and each has its own versioning lifecycle. Skipper lets you independently upgrade or roll back each application in a stream at runtime.
When using Spring Cloud Stream and Spring Cloud Task, each microservice application builds upon Spring Boot as the foundational library. This gives all microservice applications functionality, such as health checks, security, configurable logging, monitoring, and management functionality, as well as executable JAR packaging.
In addition to passing the appropriate application properties to each applications, the Data Flow and Skipper servers are responsible for preparing the target platform’s infrastructure. For example, in Cloud Foundry, it would bind specified services to the applications. For Kubernetes, it would create the deployment and service resources.
The Data Flow Server helps simplify the deployment of multiple related applications onto a target runtime, setting up necessary input and output topics, partitions, and metrics functionality. However, you can also opt to deploy each of the microservice applications manually and not use Data Flow or Skipper at all. This approach might be more appropriate to start out with for small scale deployments, gradually adopting the convenience and consistency of Data Flow as you develop more applications. Manual deployment of stream and task-based microservices is also a useful educational exercise that can help you better understand some of the automatic application configuration and platform targeting steps that the Data Flow Server provides. The stream and batch developer guides follow this approach.
Comparison to Other Architectures
Spring Cloud Data Flow’s architectural style is different than other stream and batch processing platforms. For example, in Apache Spark, Apache Flink, and Google Cloud Dataflow, applications run on a dedicated compute engine cluster. The nature of the compute engine gives these platforms a richer environment for performing complex calculations on the data as compared to Spring Cloud Data Flow, but it introduces the complexity of another execution environment that is often not needed when creating data-centric applications. That does not mean that you cannot do real-time data computations when you use Spring Cloud Data Flow. For example, you can develop applications that use the Kafka Streams API time-sliding-window and moving-average functionality as well as joins of the incoming messages against sets of reference data.
A benefit of this approach is that we can delegate to popular platforms at runtime. Data Flow can benefit from their feature set (resilience and scalability) as well as the knowledge you may already have about those platforms as you may be using them for other purposes. This reduces the cognitive distance for creating and managing data-centric applications, as many of the same skills used for deploying other end-user/web applications are applicable.
The following image shows the runtime architecture of a simple stream:
The Stream DSL is sent by POST to the Data Flow Server. Based on the mapping of DSL application names to Maven and Docker artifacts, the http source and jdbc sink applications are deployed by Skipper to the target platform. Data that is posted to the HTTP application is then stored in a database.
The http source and jdbc sink applications are running on the specified platform and have no connection to the Data Flow or Skipper server.
The following image shows the runtime architecture of a stream consisting of applications that can have multiple inputs and outputs:
Tasks and Batch Jobs
The following image shows the runtime architecture for a Task and a Spring Batch job:
The following image shows the runtime architecture for a composed task:
You can deploy the Spring Cloud Data Flow Server and the Skipper Server on Cloud Foundry, Kubernetes, and your local machine.
You can also deploy the applications that are deployed by these servers to multiple platforms:
The most common architecture is to install the Data Flow and Skipper server on the same platform where you deploy your applications. You can also deploy to multiple Cloud Foundry org, space, and foundations and multiple Kubernetes Clusters.
There are community implementations that let you deploy to other platforms, namely HashiCorp Nomad, Red Hat OpenShift, and Apache Mesos.
The local server is supported in production for task deployment as a replacement for the Spring Batch admin project. The local server is not supported in production for stream deployments.