Containers Trend Report. Explore the current state of containers, containerization strategies, and modernizing architecture.
Securing Your Software Supply Chain with JFrog and Azure. Leave with a roadmap for keeping your company and customers safe.
Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.
Imagine you're working on a critical project, pouring hours of effort into writing code, only to accidentally delete a crucial file. Panic sets in as you realize there's no way to retrieve the previous version. But wait! Introducing Git, the superhero of version control systems. With Git, you can effortlessly track changes, revert to previous versions, collaborate seamlessly with teammates, and even branch out to experiment without fear of irreversible consequences. Git saves the day by empowering developers to confidently navigate the complex world of software development, ensuring smooth workflows and protecting precious code from the clutches of accidental deletions. Git is the most popular tool. Git is a version-control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source-code management in software development, but it can be used to keep track of changes in any set of files. How It Works Creating Repository: You can create a repository using the command git init. Navigate to your project folder and enter the command git init to initialize a git repository for your project on the local system. Git Initialization Making Changes: Once the directory has been initialized, you can check the status of the files, whether they are being tracked by a git or not, using the command git status. Status of Git Repository Since no files are being tracked right now, let us stage these files. For that, use the command git add, which will track all the files in the project folder. Git add Once the files or changes have been staged, we are ready to commit them to our repository. We can commit the files using the command git commit -m "custom message." Git Commit Syncing Repositories: Once everything is ready on our locally, we can start pushing our changes to the remote repository. Copy your repository link and paste it in the command git remote add origin "<URL to repository>." To push the changes to your repository, enter the command git push origin <branch-name>. In our case branch is master, hence git push origin master. This command will then prompt for username and password, enter the values, and hit enter. Git Push Your local repository is now synced with the remote repository on GitHub. Remote Repository Similarly, if we want to download the remote repository to our local system, we can use the command git clone <URL>. This folder will create a folder with the repository name and download all the contents of the repository inside this folder. Git Clone The git pull command is also used for pulling the latest changes from the repository; unlike git clone, this command can only work inside an initialized git repository. This command is used when you are already working in the cloned repository, and want to pull the latest changes, that others might have pushed to the remote repository git pull <URL>. Git Pull Until now, you have seen how we can work on Git. But now, imagine multiple developers working on the same repository or project. To handle the workspace of multiple developers, we use branches. To create a branch from an existing branch, we use the command git branch <new-branch-name>. Similarly, to delete the branch git branch -D <branch-name>. To switch to the new branch, use the command git checkout <branch-name>. Want to check the log for every commit detail in your repository? You can accomplish that using the command git log. Git log Want to save your work without committing the code? Git has got you covered. This can be helpful when you want to switch branches but do not want to save your changes to the git repository. To stash your staged files without committing, just type in git stash. If you want to stash your untracked files as well, type git stash -u. Once you are back and want to retrieve working, type in git stash pop. Git Stash git revert command helps you in reverting a commit to a previous version git revert <commit-id>. <commit-id> can be obtained from the output of git log. Git Revert git diff command helps us in checking the differences between 2 versions of a file. git diff <commit-id of version x> <commit-id of version y> Git Diff In conclusion, Git is an essential tool for developers and teams, providing efficient and reliable version control capabilities that facilitate collaboration, track changes, and simplify the management of code and files throughout a project's lifecycle. It's versatility and robust features make it a valuable asset in software development and other industries where version control is crucial.
One of Apache Kafka’s most known mantras is “it preserves the message ordering per topic-partition,” but is it always true? In this blog post, we’ll analyze a few real scenarios where accepting the dogma without questioning it could result in unexpected and erroneous sequences of messages. Basic Scenario: Single Producer We can start our journey with a basic scenario: a single producer sending messages to an Apache Kafka topic with a single partition, in sequence, one after the other. In this basic situation, as per the known mantra, we should always expect correct ordering. But is it true? Well… it depends! The Network Is Not Equal In an ideal world, the single-producer scenario should always result in correct ordering. But our world isn’t perfect! Different network paths, errors, and delays could mean that a message gets delayed or lost. Let’s imagine the situation below: a single producer sending three messages to a topic: Message 1, for some reason, finds a long network route to Apache Kafka Message 2 finds the quickest network route to Apache Kafka Message 3 gets lost in the network Even in this basic scenario, with only one producer, we could get an unexpected series of messages on the topic. The end result on the Kafka topic will show only two events being stored, with the unexpected ordering 2, 1. If you think about it, it’s the correct ordering from the Apache Kafka point of view: a topic is only a log of information, and Apache Kafka will write the messages to the log depending on when it “senses” the arrival of a new event. It’s based on Kafka ingestion time and not on when the message was created (event time). Acks and Retries But not all is lost! If we look into the producing libraries (aiokafka being an example), we have ways to ensure that messages are delivered properly. First of all, to avoid the problem with the message 3 in the above scenario, we could define a proper acknowledgment mechanism. The acks producer parameter allows us to define what confirmation of message reception we want to have from Apache Kafka. Setting this parameter to 1 will ensure that we receive an acknowledgment from the primary broker responsible for the topic (and partition). Setting it to all will ensure that we receive the ack only if both the primary and the replicas correctly store the message, thus saving us from problems when only the primary receives the message and then fails before propagating it to the replicas. Once we set a sensible ack, we should set the possibility to retry sending the message if we don’t receive a proper acknowledgment. Differently from other libraries (kafka-python being one of them), aiokafka will retry sending the message automatically until the timeout (set by the request_timeout_ms parameter) has been exceeded. With acknowledgment and automatic retries, we should solve the problem of the message 3. The first time it is sent, the producer will not receive the ackTherefore, after the retry_backoff_ms interval, it will send the message 3 again. Max In-Flight Requests However, if you watch the end result in the Apache Kafka topic closely, the resulting ordering is not correct: we sent 1,2,3 and got 2,1,3 in the topic… how to fix that? The old method (available in kafka-python) was to set the maximum in-flight request per connection: the number of messages we allow to be “in the air” at the same time without acknowledgment. The more messages we allow in the air at the same time, the more risk of getting out-of-order messages. When using kafka-python, if we absolutely needed to have a specific ordering in the topic, we were forced to limit the max_in_flight_requests_per_connection to 1. Basically, supposing that we set the ack parameter to at least 1, we were waiting for an acknowledgment of every single message (or batch of messages if the message size is less than the batch size) before sending the following one. The absolute correctness of ordering, acknowledgment, and retries comes at the cost of throughput. The smaller amount of messages we allow to be “in the air” at the same time, the more acks we need to receive, and the fewer overall messages we can deliver to Kafka in a defined timeframe. Idempotent Producers To overcome the strict serialization of sending one message at a time and waiting for acknowledgment, we can define idempotent producers. With an idempotent producer, each message gets labeled with both a producer ID and a serial number (a sequence maintained for each partition). This composed ID is then sent to the broker alongside the message. The broker keeps track of the serial number per producer and topic/partition. Whenever a new message arrives, the broker checks the composed ID, and if, within the same producer, the value is equal to the previous number + 1, then the new message is acknowledged. Otherwise, it is rejected. This provides a guarantee of the global ordering of messages allowing a higher number of in-flight requests per connection (maximum of 5 for the Java client). Increase Complexity With Multiple Producers So far, we imagined a basic scenario with only one producer, but the reality in Apache Kafka is that often the producers will be multiple. What are the little details to be aware of if we want to be sure about the end ordering result? Different Locations, Different Latency Again, the network is not equal, and with several producers located in possibly very remote positions, the different latency means that the Kafka ordering could differ from the one based on event time. Unfortunately, the different latencies between different locations on Earth can’t be fixed. Therefore, we will need to accept this scenario. Batching, an Additional Variable To achieve higher throughput, we might want to batch messages. With batching, we send messages in “groups,” minimizing the overall number of calls and increasing the payload to overall message size ratio. But, in doing so, we can again alter the ordering of events. The messages in Apache Kafka will be stored per batch, depending on the batch ingestion time. Therefore, the ordering of messages will be correct per batch, but different batches could have different ordered messages within them. Now, with both different latencies and batching in place, it seems that our global ordering premise would be completely lost… So, why are we claiming that we can manage the events in order? The Savior: Event Time We understood that the original premise about Kafka keeping the message ordering is not 100% true. The ordering of the messages depends on the Kafka ingestion time and not on the event generation time. But what if the ordering based on event time is important? Well, we can’t fix the problem on the production side, but we can do it on the consumer side. All the most common tools that work with Apache Kafka have the ability to define which field to use as event time, including Kafka Streams, Kafka Connect with the dedicated Timestamp extractor single message transformation (SMT), and Apache Flink®. Consumers, when properly defined, will be able to reshuffle the ordering of messages coming from a particular Apache Kafka topic. Let’s analyze the Apache Flink example below: CREATE TABLE CPU_IN ( hostname STRING, cpu STRING, usage DOUBLE, occurred_at BIGINT, time_ltz AS TO_TIMESTAMP_LTZ(occurred_at, 3), WATERMARK FOR time_ltz AS time_ltz - INTERVAL '10' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = '', 'topic' = 'cpu_load_stats_real', 'value.format' = 'json', 'scan.startup.mode' = 'earliest-offset' ) In the above Apache Flink table definition, we can notice: occurred_at: the field is defined in the source Apache Kafka topic in unix time (datatype is BIGINT). time_ltz AS TO_TIMESTAMP_LTZ(occurred_at, 3): transforms the unix time into the Flink timestamp. WATERMARK FOR time_ltz AS time_ltz - INTERVAL '10' SECOND defines the new time_ltz field (calculated from occurred_at) as the event time and defines a threshold for late arrival of events with a maximum of 10 seconds delay. Once the above table is defined, the time_ltz field can then be used to correctly order events and define aggregation windows, making sure that all events within the accepted latency are included in the calculations. The - INTERVAL '10' SECOND defines the latency of the data pipeline and is the penalty we need to include to allow the correct ingestion of late-arriving events. Please note, however, that the throughput is not impacted. We can have as many messages flowing in our pipeline as we want, but we’re “waiting 10 seconds” before calculating any final KPI in order to make sure we include in the picture all the events in a specific timeframe. An alternative approach that works only if the events contain the full state is to keep a certain key (hostname and cpu in the above example) the maximum event time reached so far, and only accept changes where the new event time is greater than the maximum. Wrapping Up The concept of ordering in Kafka can be tricky, even if we only include a single topic with a single partition. This post shared a few common situations that could result in an unexpected series of events. Luckily options like limiting the number of messages in flight, or using idempotent producers, can help achieve an ordering in line with expectations. In the case of multiple producers and the unpredictability of network latency, the option available is to fix the overall ordering on the consumer side by properly handling the event time that needs to be specified in the payload. Some further readings: Kafka Streams event time Check out the Timestamp router SMT in Kafka Connect
Managing Kubernetes add-ons can be a challenging task, especially when dealing with complex deployments and frequent configuration changes. In this article, we will explore how Sveltos and Carvel ytt can work together to simplify Kubernetes resource management. Sveltos is a powerful Kubernetes add-on management tool, while Carvel ytt is a templating and patching tool for YAML files. We will delve into the integration of Carvel ytt with Sveltos using the ytt controller, enabling seamless deployment and configuration management. Introducing Sveltos Sveltos is an open-source project tool that simplifies the process of managing and deploying add-ons to Kubernetes clusters. It provides a comprehensive solution for installing, configuring, and managing add-ons, making it easier to enhance the functionality and capabilities of Kubernetes.Sveltos provides support for Helm charts, Kustomize, and resource YAMLs. To know more about Sveltos, this article delves into the management of Kubernetes add-ons using Sveltos. This other article focuses on deploying add-ons as a result of events. An Overview of Carvel Ytt Carvel ytt is a tool that is part of the Carvel suite. Its main purpose is to facilitate the generation and management of YAML files based on templates. With ytt, you can easily create and modify YAML files by leveraging templates and data values. This enables you to have a flexible and dynamic approach to configuration management within Kubernetes environments. Unlike Helm and other similar templating tools that treat YAML templates purely as text templates, ytt takes advantage of the inherent language structure of YAML. This means that ytt understands the underlying structure of YAML configurations and utilizes comments to annotate those structures. As a result, ytt goes beyond traditional text templating and becomes a YAML structure-aware templating solution. This unique feature alleviates the need for developers to ensure the structural validity of their generated YAML configurations and makes the process of writing templates much more straightforward. Integrating Carvel ytt With Sveltos via ytt Controller To harness the capabilities of Carvel ytt with Sveltos, we have developed the ytt controller. The ytt controller acts as a bridge between Sveltos and Carvel ytt, enabling the processing of ytt files and making the output accessible for Sveltos. In order to utilize the ytt controller, a Kubernetes Custom Resource Definition (CRD) called YttSource was introduced. By creating instances of YttSource, you can specify the sources of ytt files through various options such as Flux Sources (GitRepository/OCIRepository/Bucket), ConfigMap, or Secret. The integration process involves the following steps: 1) Install the ytt controller Shell [sourcecode language="bash"] kubectl apply -f https://raw.githubusercontent.com/gianlucam76/ytt-controller/main/manifest/manifest.yaml [/sourcecode] 2) Using GitRepository as a source: YAML [sourcecode language="yaml"] apiVersion: extension.projectsveltos.io/v1alpha1 kind: YttSource metadata: name: yttsource-flux spec: namespace: flux-system name: flux-system kind: GitRepository path: ./deployment/ [/sourcecode] Flux is utilized to synchronize the ytt-examples GitHub repository, which contains the ytt files. The YttSource is instructing ytt controller to get ytt files from Flux GitRepository. The ytt controller automatically detects changes in the repository and invokes the ytt module to process the files. The resulting output is stored in the Status section of the YttSource instance. 3) Sveltos can then utilize its template feature to deploy the generated Kubernetes resources to the managed cluster. YAML [sourcecode language="yaml"] apiVersion: config.projectsveltos.io/v1alpha1 kind: ClusterProfile metadata: name: deploy-resources spec: clusterSelector: env=fv templateResourceRefs: - resource: apiVersion: extension.projectsveltos.io/v1alpha1 kind: YttSource name: yttsource-flux namespace: default identifier: YttSource policyRefs: - kind: ConfigMap name: info namespace: default --- apiVersion: v1 kind: ConfigMap metadata: name: info namespace: default annotations: projectsveltos.io/template: "true" # add annotation to indicate Sveltos content is a template data: resource.yaml: | {{ (index .MgtmResources "YttSource").status.resources } [/sourcecode] Shell [sourcecode language="bash"] kubectl exec -it -n projectsveltos sveltosctl-0 -- ./sveltosctl show addons +-------------------------------------+-----------------+-----------+----------------------+---------+-------------------------------+------------------+ | CLUSTER | RESOURCE TYPE | NAMESPACE | NAME | VERSION | TIME | CLUSTER PROFILES | +-------------------------------------+-----------------+-----------+----------------------+---------+-------------------------------+------------------+ | default/sveltos-management-workload | :Service | staging | sample-app | N/A | 2023-05-22 08:00:28 -0700 PDT | deploy-resources | | default/sveltos-management-workload | apps:Deployment | staging | sample-app | N/A | 2023-05-22 08:00:28 -0700 PDT | deploy-resources | | default/sveltos-management-workload | :Secret | staging | application-settings | N/A | 2023-05-22 08:00:28 -0700 PDT | deploy-resources | +-------------------------------------+-----------------+-----------+----------------------+---------+-------------------------------+------------------+ [/sourcecode] For detailed information on the ytt controller and its usage with ConfigMap/Secret, please refer to the Sveltos documentation. This documentation provides comprehensive insights into the ytt controller and offers guidance on integrating it with ConfigMap and Secret resources. Conclusion By integrating Carvel ytt with Sveltos using the ytt controller, we can greatly simplify Kubernetes resource management. This powerful combination enables clean and efficient configuration management, seamless deployment of resources, and effortless synchronization of changes. Sveltos empowers DevOps teams to focus on their core tasks while providing a unified and intuitive interface for managing Kubernetes infrastructure effectively. Carvel ytt enhances the deployment process by enabling declarative configuration management and ensuring consistency across deployments. Together, Sveltos and Carvel ytt create a robust solution for managing Kubernetes resources with ease and efficiency.
To get more clarity about ISR in Apache Kafka, we should first carefully examine the replication process in the Kafka broker. In short, replication means having multiple copies of our data spread across multiple brokers. Maintaining the same copies of data in different brokers makes possible the high availability in case one or more brokers go down or are untraceable in a multi-node Kafka cluster to server the requests. Because of this reason, it is mandatory to mention how many copies of data we want to maintain in the multi-node Kafka cluster while creating a topic. It is termed a replication factor, and that’s why it can’t be more than one while creating a topic on a single-node Kafka cluster. The number of replicas specified while creating a topic can be changed in the future based on node availability in the cluster. On a single-node Kafka cluster, however, we can have more than one partition in the broker because each topic can have one or more partitions. The Partitions are nothing but sub-divisions of the topic into multiple parts across all the brokers on the cluster, and each partition would hold the actual data(messages). Internally, each partition is a single log file upon which records are written in an append-only fashion. Based on the provided number, the topic internally split into the number of partitions at the time of creation. Thanks to partitioning, messages can be distributed in parallel among several brokers in the cluster. Kafka scales to accommodate several consumers and producers at once by employing this parallelism technique. This partitioning technique enables linear scaling for both consumers and providers. Even though more partitions in a Kafka cluster provide a higher throughput but with more partitions, there are pitfalls too. Briefly, more file handlers would be created if we increase the number of partitions as each partition maps to a directory in the file system in the broker. Now it would be easy for us to understand better the ISR as we have discussed replication and partitions of Apache Kafka above. The ISR is just a partition’s replicas that are “in sync” with the leader, and the leader is nothing but a replica that all requests from clients and other brokers of Kafka go to it. Other replicas that are not the leader are termed followers. A follower that is in sync with the leader is called an ISR (in-sync replica). For example, if we set the topic’s replication factor to 3, Kafka will store the topic-partition log in three different places and will only consider a record to be committed once all three of these replicas have verified that they have written the record to the disc successfully and eventually send back the acknowledgment to the leader. In a multi-broker (multi-node) Kafka cluster (please click here to read how a multi-node Kafka cluster can be created), one broker is selected as the leader to serve the other brokers, and this leader broker would be responsible to handle all the read and write requests for a partition while the followers (other brokers) passively replicate the leader to achieve the data consistency. Each partition can only have one leader at a time and handles all reads and writes of records for that partition. The Followers replicate leaders and take over if the leader dies. By leveraging Apache Zookeeper, Kafka internally selects the replica of one broker’s partition, and if the leader of that partition fails (due to an outage of that broker), Kafka chooses a new ISR (in-sync replica) as the new leader. When all of the ISRs for a partition write to their log, the record is said to have been “committed,” and the consumer can only read committed records. The minimum in-sync replica count specifies the minimum number of replicas that must be present for the producer to successfully send records to a partition. Even though the high number of minimum in-sync replicas gives a higher persistence but there might be a repulsive effect, too, in terms of availability. The data availability automatically gets reduced if the minimum number of in-sync replicas won’t be available before publishing. The minimum number of in-sync replicas indicates how many replicas must be available for the producer to send records to a partition successfully. For example, if we have a three-node operational Kafka cluster with minimum in-sync replicas configuration as three, and subsequently, if one node goes down or unreachable, then the rest other two nodes will not be able to receive any data/messages from the producers because of only two active/available in sync replicas across the brokers. The third replica, which existed on the dead or unavailable broker, won’t be able to send the acknowledgment to the leader that it was synced with the latest data like how the other two live replicas did on the available brokers in the cluster. Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.
Building a cluster of single-board mini-computers is an excellent way to explore and learn about distributed computing. With the scarcity of Raspberry Pi boards, and the prices starting to get prohibitive for some projects, alternatives such as Orange Pi have gained popularity. In this article, I’ll show you how to build a (surprisingly cheap) 4-node cluster packed with 16 cores and 4GB RAM to deploy a MariaDB replicated topology that includes three database servers and a database proxy, all running on a Docker Swarm cluster and automated with Ansible. This article was inspired by a member of the audience who asked my opinion about Orange Pi during a talk I gave in Colombia. I hope this completes the answer I gave you. What Is a Cluster? A cluster is a group of computers that work together to achieve a common goal. In the context of distributed computing, a cluster typically refers to a group of computers that are connected to each other and work together to perform computation tasks. Building a cluster allows you to harness the power of multiple computers to solve problems that a single computer cannot handle. For example, a database can be replicated in multiple nodes to achieve high availability—if one node fails, other nodes can take over. It can also be used to implement read/write splitting to make one node handle writes, and another reads in order to achieve horizontal scalability. What Is Orange Pi Zero2? The Orange Pi Zero2 is a small single-board computer that runs on the ARM Cortex-A53 quad-core processor. It has 512MB or 1GB of DDR3 RAM, 100Mbps Ethernet, Wi-Fi, and Bluetooth connectivity. The Orange Pi Zero2 is an excellent choice for building a cluster due to its low cost, small size, and good performance. The only downside I found was that the Wi-Fi connection didn’t seem to perform as well as with other single-board computers. From time to time, the boards disconnect from the network, so I had to place them close to a Wi-Fi repeater. This could be a problem with my setup or with the boards. I’m not entirely sure. Having said that, this is not a production environment, so it worked pretty well for my purposes. What You Need Here are the ingredients: Orange Pi Zero2: I recommend the 1GB RAM variant and try to get at least 4 of them. I recently bought 4 of them for €30 each. Not bad at all! Give it a try! MicroSD cards: One per board. Try to use fast ones — it will make quite a difference in performance! I recommend at least 16GB. For reference, I used SanDisk Extreme Pro Micro/SDXC with 32GB, which offers a write speed of 90 MB/s and reads at 170 MB/s. A USB power hub: To power the devices, I recommend a dedicated USB power supply. You could also just use individual chargers, but the setup will be messier and require a power strip with as many outlets as devices as you have. It’s better to use a USB multi-port power supply. I used an Anker PowerPort 6, but there are also good and cheaper alternatives. You’ll have to Google this too. Check that each port can supply 5V and at least 2.4A. USB cables: Each board needs to be powered via a USB-C port. You need a cable with one end of type USB-C and the other of the type your power hub accepts. Bolts and nuts: To stack up the boards. Heat sinks (optional): These boards can get hot. I recommend getting heat sinks to help with heat dissipation. Materials needed for building an Orange Pi Zero2 cluster Assembling the Cluster One of the fun parts of building this cluster is the physical assembly of the boards on a case or some kind of structure that makes them look like a single manageable unit. Since my objective here is to keep the budget as low as possible, I used cheap bolts and nuts to stack the boards one on top of the other. I didn’t find any ready-to-use cluster cases for the Orange Pi Zero2. One alternative is to 3D-print your own case. When stacking the boards together, keep an eye on the antenna placement. Avoid crushing the cable, especially if you installed heat sinks. An assembled Orange Pi Zero2 cluster with 4 nodes Installing the Operating System The second step is to install the operating system on each microSD card. I used Armbian bullseye legacy 4.9.318. Download the file and use a tool like balenaEtcher to make bootable microSD cards. Download and install this tool on your computer. Select the Armbian image file and the drive that corresponds to the micro SD card. Flash the image and repeat the process for each micro SD card. Configuring Orange Pi WiFi Connection (Headless) To configure the Wi-Fi connection, Armbian includes the /boot/armbian_first_run.txt.template file which allows you to configure the operating system when it runs for the first time. The template includes instructions, so it’s worth checking. You have to rename this file to armbian_first_run.txt. Here’s what I used: Plain Text FR_general_delete_this_file_after_completion=1 FR_net_change_defaults=1 FR_net_ethernet_enabled=0 FR_net_wifi_enabled=1 FR_net_wifi_ssid='my_connection_id>' FR_net_wifi_key='my_password' FR_net_wifi_countrycode='FI' FR_net_use_static=1 FR_net_static_gateway='192.168.1.1' FR_net_static_mask='255.255.255.0' FR_net_static_dns='192.168.1.1 8.8.8.8' FR_net_static_ip='192.168.1.181' Use your own Wi-Fi details, including connection name, password, country code, gateway, mask, and DNS. I wasn’t able to read the SD card from macOS. I had to use another laptop with Linux on it to make the changes to the configuration file on each SD card. To mount the SD card on Linux, run the following command before and after inserting the SD card and see what changes: Shell sudo fdisk -l I created a Bash script to automate the process. The script accepts as a parameter the IP to set. For example: Shell sudo ./armbian-setup.sh 192.168.1.181 I run this command on each of the four SD cards changing the IP address from 192.168.1.181 to 192.168.1.184. Connecting Through SSH Insert the flashed and configured micro SD cards on each board and turn the power supply on. Be patient! Give the small devices time to boot. It can take several minutes the first time you boot them. An Orange Pi cluster running Armbian Use the ping command to check whether the devices are ready and connected to the network: Shell ping 192.168.1.181 Once they respond, connect to the mini-computers through SSH using the root user and the IP address that you configured. For example: Shell ssh root@192.168.1.181 The default password is: Plain Text 1234 You’ll be presented with a wizard-like tool to complete the installation. Follow the steps to finish the configuration and repeat the process for each board. Installing Ansible Imagine you want to update the operating system on each machine. You’ll have to log into a machine and run the update command and end the remote session. Then repeat for each machine in the cluster. A tedious job even if you have only 4 nodes. Ansible is an automation tool that allows you to run a command on multiple machines using a single call. You can also create a playbook, a file that contains commands to be executed in a set of machines defined in an inventory. Install Ansible on your working computer and generate a configuration file: Shell sudo su ansible-config init --disabled -t all > /etc/ansible/ansible.cfg exit In the /etc/ansible/ansible.cfg file, set the following properties (enable them by removing the semicolon): Plain Text host_key_checking=False become_allow_same_user=True ask_pass=True This will make the whole process easier. Never do this in a production environment! You also need an inventory. Edit the /etc/ansible/hosts file and add the Orange Pi nodes as follows: Plain Text ############################################################################## # 4-node Orange Pi Zero 2 cluster ############################################################################## [opiesz] 192.168.1.181 ansible_user=orangepi hostname=opiz01 192.168.1.182 ansible_user=orangepi hostname=opiz02 192.168.1.183 ansible_user=orangepi hostname=opiz03 192.168.1.184 ansible_user=orangepi hostname=opiz04 [opiesz_manager] opiz01.local ansible_user=orangepi [opiesz_workers] opiz[02:04].local ansible_user=orangepi In the ansible_user variable, specify the username that you created during the installation of Armbian. Also, change the IP addresses if you used something different. Setting up a Cluster With Ansible Playbooks A key feature of a computer cluster is that the nodes should be somehow logically interconnected. Docker Swarm is a container orchestration tool that will convert your arrangement of Orange Pi devices into a real cluster. You can later deploy any kind of server software. Docker Swarm will automatically pick one of the machines to host the software. To make the process easier, I have created a set of Ansible playbooks to further configure the boards, update the packages, reboot or power off the machines, install Docker, set up Docker Swarm, and even install a MariaDB database with replication and a database cluster. Clone or download this GitHub repository: Shell git clone https://github.com/alejandro-du/orange-pi-zero-cluster-ansible-playbooks.git Let’s start by upgrading the Linux packages on all the boards: Shell ansible-playbook upgrade.yml --ask-become-pass Now configure the nodes to have an easy-to-remember hostname with the help of Avahi, and configure the LED activity (red LED activates on SD card activity): Shell ansible-playbook configure-hosts.yml --ask-become-pass Reboot all the boards: Shell ansible-playbook reboot.yml --ask-become-pass Install Docker: Shell ansible-playbook docker.yml --ask-become-pass Set up Docker Swarm: Shell ansible-playbook docker-swarm.yml --ask-become-pass Done! You have an Orange Pi cluster ready for fun! Deploying MariaDB on Docker Swarm I have to warn you here. I don’t recommend running a database on container orchestration software. That’s Docker Swarm, Kubernetes, and others. Unless you are willing to put a lot of effort into it. This article is a lab. A learning exercise. Don’t do this in production! Now let’s get back to the fun… Run the following to deploy one MariaDB primary server, two MariaDB replica servers, and one MaxScale proxy: Shell ansible-playbook mariadb-stack.yml --ask-become-pass The first time you do this, it will take some time. Be patient. SSH into the manager node: Shell ssh orangepi@opiz01.local Inspect the nodes in the Docker Swarm cluster: Shell docker node ls Inspect the MariaDB stack: Shell docker stack ps mariadb A cooler way to inspect the containers in the cluster is by using the Docker Swarm Visualizer. Deploy it as follows: Shell docker service create --name=viz --publish=9000:8080 --constraint=node.role==manager --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock alexellis2/visualizer-arm:latest On your working computer, open a web browser and go to this URL. You should see all the nodes in the cluster and the deployed containers. Docker Swarm Visualizer showing MariaDB deployed MaxScale is an intelligent database proxy with tons of features. For now, let’s see how to connect to the MariaDB cluster through this proxy. Use a tool like DBeaver, DbGate, or even a database extension for your favorite IDE. Create a new database connection using the following connection details: Host: opiz01.local Port: 4000 Username: user Password: password Create a new table: MariaDB SQL USE demo; CREATE TABLE messages( id INT PRIMARY KEY AUTO_INCREMENT, content TEXT NOT NULL ); Insert some data: MariaDB SQL INSERT INTO messages(content) VALUES ("It works!"), ("Hello, MariaDB"), ("Hello, Orange Pi"); When you execute this command, MaxScale sends it to the primary server. Now read the data: MariaDB SQL SELECT * FROM messages; When you execute this command, MaxScale sends it to one of the replicas. This division of reads and writes is called read-write splitting. The MaxScale UI showing a MariaDB cluster with replication and read-write splitting You can also access the MaxScale UI. Use the following credentials: Username: admin Password: mariadb Watch the following video if you want to learn more about MaxScale and its features. You won’t regret it!
Developing and releasing new software versions is an ongoing process that demands careful attention to detail. The ability to monitor and analyze the entire process is critical for identifying any potential issues and implementing effective corrective measures. The concept of continuous integration becomes relevant at this point. By adopting a continuous integration approach, software development teams can carefully monitor each stage of the development process and conduct an in-depth analysis of the outcomes. This facilitates the early detection and diagnosis of potential issues, enabling developers to make necessary adjustments and improve the overall development process. In other words, continuous integration provides a systematic way of identifying problems and continuously enhancing software quality, ultimately leading to a better end product. The focus of this post is on exploring the benefits of continuous integration in software development. Specifically, we will delve into the practical aspects of implementing continuous integration using Jenkins, a popular automation tool, and share valuable insights on how this approach can help optimize and streamline your software development process. By the end of this post, you will have a better understanding of how continuous integration can improve your workflow and help you build better software more efficiently. From the Development Team’s Perspective, What Initiates Continuous Integration? With continuous integration, the development team initiates the process by pushing code changes to the repository, which triggers an automated pipeline to build, test, and deploy the updated software version. This streamlines the development cycle, leading to faster feedback and higher-quality software. A structured workflow aims to establish a standardized order of operations for developers, ensuring that subsequent versions of the software are built according to the software development life cycle defined by management. Here are some primary benefits of continuous integration: Version control - With continuous integration, developers can easily track production versions and compare the performance of different versions during development. In addition, the ability to roll back to a previous version is also available, should any production issues arise. Quality assurance - Developers can test their versions on a staging environment, demonstrating how the new version performs in an environment similar to production. Instead of running the version on their local machine, which may not be comparable to the real environment, developers can define a set of tests, including unit tests and integration tests, among others, that will take the new version through a predefined workflow. This testing process serves as their signature, ensuring the new version is safe to be deployed in a production environment. Scheduled triggering - Developers no longer need to manually trigger their pipeline or define a new pipeline for each new project. As a DevOps team, it is our responsibility to create a robust system that attaches to each project its own pipeline. Whether it is a common pipeline with slight changes to match the project or the same pipeline, developers can focus on writing code while continuous integration takes care of the rest. Scheduling an automatic triggering (for example, every morning or evening) ensures that the current code in GitHub is always ready for release. Jenkins in the Era of Continuous Integration To establish the desired pipeline workflow, we will deploy Jenkins and design a comprehensive pipeline that emphasizes version control, automated testing, and triggers. Prerequisite A virtual machine with a Docker engine Containerizing Jenkins To simplify the deployment of our CI/CD pipelines, we will deploy Jenkins in a Docker container. Deployment of Jenkins: docker run -d \ --name jenkins -p 8080:8080 -u root -p 50000:50000 \ -v /var/run/docker.sock:/var/run/docker.sock \ naturalett/jenkins:2.387-jdk11-hello-world Validate the Jenkins container: docker ps | grep -i jenkins Retrieve the Jenkins initial password: docker exec jenkins bash -c -- 'cat /var/jenkins_home/secrets/initialAdminPassword' Connect to Jenkins on the localhost (http://localhost:8080/). Building a Continuous Integration Pipeline I chose to utilize Groovy in Jenkins pipelines due to its numerous benefits: Groovy is a scripting language that is straightforward to learn and utilize. Groovy offers features that enable developers to write code that is concise, readable, and maintainable. Groovy's syntax is similar to Java, making it easier for Java developers to adopt. Groovy has excellent support for working with data formats commonly used in software development. Groovy provides an efficient and effective way to build robust and flexible CI/CD pipelines in Jenkins. The Four Phases of Our Pipeline Phase 1: The Agent To ensure that our code is built with no incompatible dependencies, each pipeline requires a virtual environment. In the following phase, we create an agent (virtual environment) in a Docker container. As Jenkins is also running in a Docker container, we'll mount the Docker socket to enable agent execution. pipeline { agent { docker { image 'docker:19.03.12' args '-v /var/run/docker.sock:/var/run/docker.sock' } } ... ... ... } Phase 2: The History of Versions We recognize the importance of versioning in software development, which allows developers to monitor code changes and evaluate software performance to make informed decisions about rolling back to a previous version or releasing a new one. In the subsequent phase, we generate a Docker image from our code and assign it a tag based on our predetermined set of definitions. For example: Date — Jenkins Build Number — Commit Hash pipeline { agent { ... } stages { stage('Build') { steps { script { def currentDate = new java.text.SimpleDateFormat("MM-dd-yyyy").format(new Date()) def shortCommit = sh(returnStdout: true, script: "git log -n 1 --pretty=format:'%h'").trim() customImage = docker.build("naturalett/hello-world:${currentDate}-${env.BUILD_ID}-${shortCommit}") } } } } } Upon completion of the previous phase, a Docker image of our code has been successfully created and is now available for use in our local environment. docker image | grep -i hello-world Phase 3: The Test In order to ensure that a new release version meets all functional and requirements tests, testing is a critical step. In the following stage, we execute tests against the Docker image that was generated in the previous stage and contains the potential next release. pipeline { agent { ... } stages { stage('Test') { steps { script { customImage.inside { sh """#!/bin/bash cd /app pytest test_*.py -v --junitxml='test-results.xml'""" } } } } } } Phase 4: The Scheduling Trigger Automating the pipeline trigger is crucial in allowing developers to concentrate on writing code while ensuring the stability and readiness of the next release. We accomplish this by setting up a morning schedule that automatically triggers the pipeline as the development team begins their workday. pipeline { agent { ... } triggers { // https://crontab.guru cron '00 7 * * *' } stages { ... } } An End-to-End Pipeline of the Process The pipeline execution process has been made simple by incorporating a pre-defined pipeline into Jenkins. You can get started by initiating the "my-first-pipeline" Jenkins job. The Agent stage creates a virtual environment used for the pipeline. The Trigger stage is responsible for automatic scheduling in the pipeline. The Clone stage is responsible for cloning the project repository. The Build stage involves creating a Docker image for the project. (To access the latest commit and other Git features, we install the Git package.) The Test stage involves performing tests on our Docker image. pipeline { agent { docker { image 'docker:19.03.12' args '-v /var/run/docker.sock:/var/run/docker.sock' } } triggers { // https://crontab.guru cron '00 7 * * *' } stages { stage('Clone') { steps { git branch: 'main', url: 'https://github.com/naturalett/hello-world.git' } } stage('Build') { steps { script { sh 'apk add git' def currentDate = new java.text.SimpleDateFormat("MM-dd-yyyy").format(new Date()) def shortCommit = sh(returnStdout: true, script: "git log -n 1 --pretty=format:'%h'").trim() customImage = docker.build("naturalett/hello-world:${currentDate}-${env.BUILD_ID}-${shortCommit}") } } } stage('Test') { steps { script { customImage.inside { sh """#!/bin/bash cd /app pytest test_*.py -v --junitxml='test-results.xml'""" } } } } } } Summary We have gained a deeper understanding of how Continuous Integration (CI) fits into our daily work and have obtained practical experience with essential pipeline workflows.
Apache Kafka and Apache Flink are increasingly joining forces to build innovative real-time stream processing applications. This blog post explores the benefits of combining both open-source frameworks, shows unique differentiators of Flink versus Kafka, and discusses when to use a Kafka-native streaming engine like Kafka Streams instead of Flink. The Tremendous Adoption of Apache Kafka and Apache Flink Apache Kafka became the de facto standard for data streaming. The core of Kafka is messaging at any scale in combination with a distributed storage (= commit log) for reliable durability, decoupling of applications, and replayability of historical data. Kafka also includes a stream processing engine with Kafka Streams. And KSQL is another successful Kafka-native streaming SQL engine built on top of Kafka Streams. Both are fantastic tools. In parallel, Apache Flink became a very successful stream-processing engine. The first prominent Kafka + Flink case study I remember is the fraud detection use case of ING Bank. The first publications came up in 2017, i.e., over five years ago: "StreamING Machine Learning Models: How ING Adds Fraud Detection Models at Runtime with Apache Kafka and Apache Flink." This is just one of many Kafka fraud detection case studies. One of the last case studies I blogged about goes in the same direction: "Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink." The adoption of Kafka is already outstanding. And Flink gets into enterprises more and more, very often in combination with Kafka. This article is no introduction to Apache Kafka or Apache Flink. Instead, I explore why these two technologies are a perfect match for many use cases and when other Kafka-native tools are the appropriate choice instead of Flink. Top Reasons Apache Flink Is a Perfect Complementary Technology for Kafka Stream processing is a paradigm that continuously correlates events of one or more data sources. Data is processed in motion, in contrast to traditional processing at rest with a database and request-response API (e.g., a web service or a SQL query). Stream processing is either stateless (e.g., filter or transform a single message) or stateful (e.g., an aggregation or sliding window). Especially state management is very challenging in a distributed stream processing application. A vital advantage of the Apache Flink engine is its efficiency in stateful applications. Flink has expressive APIs, advanced operators, and low-level control. But Flink is also scalable in stateful applications, even for relatively complex streaming JOIN queries. Flink's scalable and flexible engine is fundamental to providing a tremendous stream processing framework for big data workloads. But there is more. The following aspects are my favorite features and design principles of Apache Flink: Unified streaming and batch APIs Connectivity to one or multiple Kafka clusters Transactions across Kafka and Flink Complex Event Processing Standard SQL support Machine Learning with Kafka, Flink, and Python But keep in mind that every design approach has pros and cons. While there are a lot of advantages, sometimes it is also a drawback. Unified Streaming and Batch APIs Apache Flink's DataStream API unifies batch and streaming APIs. It supports different runtime execution modes for stream processing and batch processing, from which you can choose the right one for your use case and the characteristics of your job. In the case of SQL/Table API, the switch happens automatically based on the characteristics of the sources: all bounded events go into batch execution mode; at least one unbounded event means STREAMING execution mode. The unification of streaming and batch brings a lot of advantages: Reuse of logic/code for real-time and historical processing Consistent semantics across stream and batch processing A single system to operate Applications mixing historical and real-time data processing This sounds similar to Apache Spark. But there is a significant difference: Contrary to Spark, the foundation of Flink is data streaming, not batch processing. Hence, streaming is the default execution runtime mode in Apache Flink. Continuous stateless or stateful processing enables real-time streaming analytics using an unbounded stream of events. Batch execution is more efficient for bounded jobs (i.e., a bounded subset of a stream) for which you have a known fixed input and which do not run continuously. This executes jobs in a way that is more reminiscent of batch processing frameworks, such as MapReduce in the Hadoop and Spark ecosystems. Apache Flink makes moving from a Lambda to Kappa enterprise architecture easier. The foundation of the architecture is real-time, with Kafka as its heart. But batch processing is still possible out-of-the-box with Kafka and Flink using consistent semantics. Though, this combination will likely not (try to) replace traditional ETL batch tools, e.g., for a one-time lift-and-shift migration of large workloads. Connectivity to One or Multiple Kafka Clusters Apache Flink is a separate infrastructure from the Kafka cluster. This has various pros and cons. First, I often emphasize the vast benefit of Kafka-native applications: you only need to operate, scale, and support one infrastructure for end-to-end data processing. A second infrastructure adds additional complexity, cost, and risk. However, imagine a cloud vendor taking over that burden, so you consume the end-to-end pipeline as a single cloud service. With that in mind, let's look at a few benefits of separate clusters for the data hub (Kafka) and the stream processing engine (Flink): Focus on data processing in a separate infrastructure with dedicated APIs and features independent of the data streaming platform. More efficient streaming pipelines before hitting the Kafka Topics again; the data exchange happens directly between the Flink workers. Data processing across different Kafka topics of independent Kafka clusters of different business units. If it makes sense from a technical and organizational perspective, you can connect directly to non-Kafka sources and sinks. But be careful, this can quickly become an anti-pattern in the enterprise architecture and create complex and unmanageable "spaghetti integrations". Implement new fail-over strategies for applications. I emphasize Flink is usually NOT the recommended choice for implementing your aggregation, migration, or hybrid integration scenario. Multiple Kafka clusters for hybrid and global architectures are the norm, not an exception. Flink does not change these architectures. Kafka-native replication tools like MirrorMaker 2 or Confluent Cluster Linking are still the right choice for disaster recovery. It is still easier to do such a scenario with just one technology. Tools like Cluster Linking solve challenges like offset management out-of-the-box. Transactions Across Kafka and Flink Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. SLAs are very different, too. Many people think that data streaming is not built for transactions and should only be used for big data analytics. However, Apache Kafka and Apache Flink are deployed in many resilient, mission-critical architectures. The concept of exactly-once semantics (EOS) allows stream processing applications to process data through Kafka without loss or duplication. This ensures that computed results are always accurate. Transactions are possible across Kafka and Flink. The feature is mature and battle-tested in production. Operating separate clusters is still challenging for transactional workloads. However, a cloud service can take over this risk and burden. Many companies already use EOS in production with Kafka Streams. But EOS can even be used if you combine Kafka and Flink. That is a massive benefit if you choose Flink for transactional workloads. So, to be clear: EOS is not a differentiator in Flink (vs. Kafka Streams), but it is an excellent option to use EOS across Kafka and Flink, too. Complex Event Processing With FlinkCEP The goal of complex event processing (CEP) is to identify meaningful events in real-time situations and respond to them as quickly as possible. CEP does usually not send continuous events to other systems but detects when something significant occurs. A common use case for CEP is handling late-arriving events or the non-occurrence of events. The big difference between CEP and event stream processing (ESP) is that CEP generates new events to trigger action based on situations it detects across multiple event streams with events of different types (situations that build up over time and space). ESP detects patterns over event streams with homogenous events (i.e. patterns over time). Pattern matching is a technique to implement either pattern but the features look different. FlinkCEP is an add-on for Flink to do complex event processing. The powerful pattern API of FlinkCEP allows you to define complex pattern sequences you want to extract from your input stream. After specifying the pattern sequence, you apply them to the input stream to detect potential matches. This is also possible with SQL via the MATCH_RECOGNIZE clause. Standard SQL Support Structured Query Language (SQL) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). However, it is so predominant that other technologies like non-relational databases (NoSQL) and streaming platforms adopt it, too. SQL became a standard of the American National Standards Institute (ANSI) in 1986 and the International Organization for Standardization (ISO) in 1987. Hence, if a tool supports ANSI SQL, it ensures that any 3rd party tool can easily integrate using standard SQL queries (at least in theory). Apache Flink supports ANSI SQL, including the Data Definition Language (DDL), Data Manipulation Language (DML), and Query Language. Flink’s SQL support is based on Apache Calcite, which implements the SQL standard. This is great because many personas, including developers, architects, and business analysts, already use SQL in their daily job. The SQL integration is based on the so-called Flink SQL Gateway, which is part of the Flink framework allowing other applications to interact with a Flink cluster through a REST API easily. User applications (e.g., Java/Python/Shell program, Postman) can use the REST API to submit queries, cancel jobs, retrieve results, etc. This enables a possible integration of Flink SQL with traditional business intelligence tools like Tableau, Microsoft Power BI, or Qlik. However, to be clear, ANSI SQL was not built for stream processing. Incorporating Streaming SQL functionality into the official SQL standard is still in the works. The Streaming SQL working group includes database vendors like Microsoft, Oracle, and IBM, cloud vendors like Google and Alibaba, and data streaming vendors like Confluent. More details: "The History and Future of SQL: Databases Meet Stream Processing". Having said this, Flink supports continuous sliding windows and various streaming joins via ANSI SQL. There are things that require additional non-standard SQL keywords but continuous sliding windows or streaming joins, in general, are possible. Machine Learning with Kafka, Flink, and Python In conjunction with data streaming, machine learning solves the impedance mismatch of reliably bringing analytic models into production for real-time scoring at any scale. I explored ML deployments within Kafka applications in various blog posts, e.g., embedded models in Kafka Streams applications or using a machine learning model server with streaming capabilities like Seldon. PyFlink is a Python API for Apache Flink that allows you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines, and ETL processes. If you’re already familiar with Python and libraries such as Pandas, then PyFlink makes it simpler to leverage the full capabilities of the Flink ecosystem. PyFlink is the missing piece for an ML-powered data streaming infrastructure, as almost every data engineer uses Python. The combination of Tiered Storage in Kafka and Data Streaming with Flink in Python is excellent for model training without the need for a separate data lake. When To Use Kafka Streams Instead of Apache Flink Don't underestimate the power and use cases of Kafka-native stream processing with Kafka Streams. The adoption rate is massive, as Kafka Streams is easy to use. And it is part of Apache Kafka. To be clear: Kafka Streams is already included if you download Kafka from the Apache website. Kafka Streams Is a Library, Apache Flink Is a Cluster The most significant difference between Kafka Streams and Apache Flink is that Kafka Streams is a Java library, while Flink is a separate cluster infrastructure. Developers can deploy the Flink infrastructure in session mode for bigger workloads (e.g., many small, homogenous workloads like SQL queries) or application mode for fewer bigger, heterogeneous data processing tasks (e.g., isolated applications running in a Kubernetes cluster). No matter your deployment option, you still need to operate a complex cluster infrastructure for Flink (including separate metadata management on a ZooKeeper cluster or an etcd cluster in a Kubernetes environment). TL;DR: Apache Flink is a fantastic stream processing framework and a top #5 Apache open-source project. But it is also complex to deploy and difficult to manage. Benefits of Using the Lightweight Library of Kafka Streams Kafka Streams is a single Java library. This adds a few benefits: Kafka-native integration supports critical SLAs and low latency for end-to-end data pipelines and applications with a single cluster infrastructure instead of operating separate messaging and processing engines with Kafka and Flink. Kafka Streams apps still run in their VMs or Kubernetes containers, but high availability and persistence are guaranteed via Kafka Topics. Very lightweight with no other dependencies (Flink needs S3 or similar storage as the state backend) Easy integration into testing/CI/DevOps pipelines Embedded stream processing into any existing JVM application, like a lightweight Spring Boot app or a legacy monolith built with old Java EE technologies like EJB. Interactive Queries allow leveraging the state of your application from outside your application. The Kafka Streams API enables your applications to be queryable. Flink's similar feature "queryable state" is approaching the end of its life due to a lack of maintainers. Kafka Streams is well-known for building independent, decoupled, lightweight microservices. This differs from submitting a processing job into the Flink (or Spark) cluster; each data product team controls its destiny (e.g., don’t depend on the central Flink team for upgrades or get forced to upgrade). Flink's application mode enables a similar deployment style for microservices. But: Kafka Streams and Apache Flink Live In Different Parts of a Company Today, Kafka Streams and Flink are usually used for different applications. While Flink provides an application mode to build microservices, most people use Kafka Streams for this today. Interactive queries are available in Kafka Streams and Flink, but it got deprecated in Flink as there is not much demand from the community. These are two examples that show that there is no clear winner. Sometimes Flink is the better choice, and sometimes Kafka Streams makes more sense. "In summary, while there certainly is an overlap between the Streams API in Kafka and Flink, they live in different parts of a company, largely due to differences in their architecture and thus we see them as complementary systems." That's the quote of a "Kafka Streams vs. Flink comparison" article written in 2016 (!) by Stephan Ewen, former CTO of Data Artisans, and Neha Narkhede, former CTO of Confluent. While some details changed over time, this old blog post is still pretty accurate today and a good read for a more technical audience. The domain-specific language (DSL) of Kafka Streams differs from Flink but is also very similar. How are both characteristics possible? It depends on who you ask. This (legitimate) subject for debate often segregates Kafka Streams and Flink communities. Kafka Streams has Stream and Table APIs. Flink has DataStream, Table, and SQL API. I guess 95% of use cases can be built with both technologies. APIs, infrastructure, experience, history, and many other factors are relevant for choosing the proper stream processing framework. Some architectural aspects are very different in Kafka Streams and Flink. These need to be understood and can be a pro or con for your use case. For instance, Flink's checkpointing has the advantage of getting a consistent snapshot, but the disadvantage is that every local error always stops the whole job and everything has to be rolled back to the last checkpoint. Kafka Streams does not have this concept. Local errors can be recovered locally (move the corresponding tasks somewhere else; the task/threads without errors just continue normally). Another example is Kafka Streams' hot standby for high availability versus Flink's fault-tolerant checkpointing system. Kafka + Flink = A Powerful Combination for Stream Processing Apache Kafka is the de facto standard for data streaming. It includes Kafka Streams, a widely used Java library for stream processing. Apache Flink is an independent and successful open-source project offering a stream processing engine for real-time and batch workloads. The combination of Kafka (including Kafka Streams) and Flink is already widespread in enterprises across all industries. Both Kafka Streams and Flink have benefits and tradeoffs for stream processing. The freedom of choice of these two leading open-source technologies and the tight integration of Kafka with Flink enables any kind of stream processing use case. This includes hybrid, global, and multi-cloud deployments, mission-critical transactional workloads, and powerful analytics with embedded machine learning. As always, understand the different options and choose the right tool for your use case and requirements. What is your favorite for streaming processing, Kafka Streams, Apache Flink, or another open-source or proprietary engine? In which use cases do you leverage stream processing? Let’s connect on LinkedIn and discuss it!
What Is Terraform? Terraform is an open-source “Infrastructure as Code” tool created by HashiCorp. A declarative coding tool, Terraform enables developers to use a high-level configuration language called HCL (HashiCorp Configuration Language) to describe the desired “end-state” cloud or on-premises infrastructure for running an application. It then generates a plan for reaching that end-state and executes the plan to provision the infrastructure. Because Terraform uses a simple syntax, can provision infrastructure across multiple clouds and on-premises data centers, and can safely and efficiently re-provision infrastructure in response to configuration changes, it is currently one of the most popular infrastructure automation tools available. If your organization plans to deploy a hybrid cloud or multi-cloud environment, you’ll likely want or need to get to know Terraform. Why Infrastructure as Code (IaC)? To better understand the advantages of Terraform, it helps to first understand the benefits of Infrastructure as Code (IaC). IaC allows developers to codify infrastructure in a way that makes provisioning automated, faster, and repeatable. It’s a key component of Agile and DevOps practices such as version control, continuous integration, and continuous deployment. Infrastructure as code can help with the following: Improve speed: Automation is faster than manually navigating an interface when you need to deploy and/or connect resources. Improve reliability: If your infrastructure is large, it becomes easy to misconfigure a resource or provision services in the wrong order. With IaC, the resources are always provisioned and configured exactly as declared. Prevent configuration drift: Configuration drift occurs when the configuration that provisioned your environment no longer matches the actual environment. (See ‘Immutable infrastructure’ below.) Support experimentation, testing, and optimization: Because Infrastructure as Code makes provisioning new infrastructure so much faster and easier, you can make and test experimental changes without investing lots of time and resources, and if you like the results, you can quickly scale up the new infrastructure for production. Why Terraform? There are a few key reasons developers choose to use Terraform over other Infrastructure as Code tools: Open source: Terraform is backed by large communities of contributors who build plugins to the platform. Regardless of which cloud provider you use, it’s easy to find plugins, extensions, and professional support. This also means Terraform evolves quickly, with new benefits and improvements added consistently. Platform agnostic: Meaning you can use it with any cloud services provider. Most other IaC tools are designed to work with a single cloud provider. Immutable infrastructure: Most Infrastructure as Code tools create mutable infrastructure, meaning the infrastructure can change to accommodate changes such as a middleware upgrade or new storage server. The danger with mutable infrastructure is configuration drift — as the changes pile up, the actual provisioning of different servers or other infrastructure elements ‘drifts’ further from the original configuration, making bugs or performance issues difficult to diagnose and correct. Terraform provisions immutable infrastructure, which means that with each change to the environment, the current configuration is replaced with a new one that accounts for the change, and the infrastructure is reprovisioned. Even better, previous configurations can be retained as versions to enable rollbacks if necessary or desired. Terraform Modules Terraform modules are small, reusable Terraform configurations for multiple infrastructure resources that are used together. Terraform modules are useful because they allow complex resources to be automated with reusable, configurable constructs. Writing even a very simple Terraform file results in a module. A module can call other modules — called child modules — which can make assembling configuration faster and more concise. Modules can also be called multiple times, either within the same configuration or in separate configurations. Terraform Providers Terraform providers are plugins that implement resource types. Providers contain all the code needed to authenticate and connect to a service — typically from a public cloud provider — on behalf of the user. You can find providers for the cloud platforms and services you use, add them to your configuration, and then use their resources to provision infrastructure. Providers are available for nearly every major cloud provider, SaaS offering, and more, developed and/or supported by the Terraform community or individual organizations. Refer to the Terraform documentation for a detailed list. Terraform vs. Kubernetes Sometimes, there is confusion between Terraform and Kubernetes and what they actually do. The truth is that they are not alternatives and actually work effectively together. Kubernetes is an open-source container orchestration system that lets developers schedule deployments onto nodes in a compute cluster and actively manages containerized workloads to ensure that their state matches the users’ intentions. Terraform, on the other hand, is an Infrastructure as Code tool with a much broader reach, letting developers automate complete infrastructure that spans multiple public clouds and private clouds. Terraform can automate and manage Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), or even Software-as-a-Service (SaaS) level capabilities and build all these resources across all those providers in parallel. You can use Terraform to automate the provisioning of Kubernetes — particularly managed Kubernetes clusters on cloud platforms — and to automate the deployment of applications into a cluster. Terraform vs. Ansible Terraform and Ansible are both Infrastructure as Code tools, but there are a couple of significant differences between the two: While Terraform is purely a declarative tool (see above), Ansible combines both declarative and procedural configurations. In the procedural configuration, you specify the steps, or the precise manner, in which you want to provision infrastructure to the desired state. Procedural configuration is more work, but it provides more control. Terraform is open source; Ansible is developed and sold by Red Hat. IBM and Terraform IBM Cloud Schematics is IBM’s free cloud automation tool based on Terraform. IBM Cloud Schematics allows you to fully manage your Terraform-based infrastructure automation so you can spend more time building applications and less time building environments.
This article explains how to install the following in video tutorials: CMAK (Cluster manager for Apache Kafka) or Kafka manager Apache Kafka on Windows or Windows 10/11 Java 18 on Windows 10/11, JDK installation Java 19 on Windows 10/11, JDK installation Java JDK 19 on Amazon EC2 Instance or Linux operating system Apache Kafka on Amazon EC2 Instance or Linux operating system How to Install and Use CMAK (Cluster Manager for Apache Kafka) or Kafka Manager Learn how to install and use CMAK. This is a tool for managing Apache Kafka clusters that allow us to view all the topics, partitions, numbers of offsets, and which are assigned to what and all topics, etc. Apache Kafka on Windows or Windows 10/11 This video tutorial will cover the following topics: How to install Apache Kafka on the Windows operating system Hands-on Kafka using CLI Install and run Kafka on Windows Kafka installation How to download and set up Kafka in Windows Java 18 on Windows The following shows how to install Java 18 on a Windows operating system. Java 19 on Windows Here, learn how to download and install Java 19 on the Windows operating system. Java JDK 19 on Amazon EC2 or Linux Operating Systems This video tutorial explains how to install Java JDK 19 on Amazon EC2 or Linux operating systems. Apache Kafka on Amazon EC2 Instance or Linux Operating System This video tutorial explains the following: How to install Apache Kafka on Amazon EC2 Instance or Linux operating system Hands-on Kafka using CLI Install and Run Kafka on the Linux operating system Kafka installation How to download and set up Kafka in the Linux operating system
Flask is a popular web framework for building web applications in Python. Docker is a platform that allows developers to package and deploy applications in containers. In this tutorial, we'll walk through the steps to build a Flask web application using Docker. Prerequisites Before we begin, you must have Docker installed on your machine. You can download the appropriate version for your operating system from the official Docker website. Additionally, it would help if you had a basic understanding of Flask and Python. Creating a Flask Application The first step is to create a Flask application. We'll create a simple "Hello, World!" application for this tutorial. Create a new file called app.py and add the following code: Python from flask import Flask app = Flask(__name__) @app.route('/') def hello(): return 'Hello, World!' Save the file and navigate to its directory in a terminal. Creating a Dockerfile The next step is to create a Dockerfile. A Dockerfile is a script that describes the environment in which the application will run. We'll use the official Python 3.8 image as the base image for our Docker container. FROM python:3.8-slim-buster: This sets the base image for our Docker container to the official Python 3.8 image. WORKDIR /app: This sets the working directory inside the container to /app. COPY requirements.txt .: This copies the requirements.txt file from our local machine to the /app directory inside the container. RUN pip install --no-cache-dir -r requirements.txt: This installs the dependencies listed in requirements.txt. COPY . .: This copies the entire local directory to the /app directory inside the container. CMD [ "python", "app.py" ]: This sets the command to run when the container starts to python app.py. Create a new file called Dockerfile and add the following code: Dockerfile FROM python:3.8-slim-buster # Set the working directory WORKDIR /app # Install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy the application code COPY . . # Run the application CMD [ "python", "app.py" ] Save the Dockerfile and navigate to its directory in a terminal. Building the Docker Image The next step is to build a Docker image from the Dockerfile. Run the following command to build the image: Python docker build -t my-flask-app . This command builds an image named my-flask-app from the Dockerfile in the current directory. The . at the end of the command specifies that the build context is the current directory. Starting the Docker Container Now that we have a Docker image, we can start a container from it. Run the following command to start a new container from the my-flask-app image and map port 5000 on the host to port 5000 in the container: Python docker run -p 5000:5000 my-flask-app This command starts a new container from the my-flask-app image and maps port 5000 on the host to port 5000 in the container. Testing the Flask Application Finally, open your web browser and navigate to http://localhost:5000. You should see the "Hello, World!" message displayed in your browser, indicating that the Flask application is running inside the docker application. Customizing the Flask Application You can customize the Flask application by modifying the app.py file and rebuilding the Docker image. For example, you could modify the hello function to return a different message: Python @app.route('/') def hello(): return 'Welcome to my Flask application!' Save the app.py file and rebuild the Docker image using the docker build command from earlier. Once the image is built, start a new container using the docker run command from earlier. When you navigate to http://localhost:5000, you should see the updated message displayed in your browser. Advantages Docker simplifies the process of building and deploying Flask applications, as it provides a consistent and reproducible environment across different machines and operating systems. Docker allows for easy management of dependencies and versions, as everything needed to run the application is contained within the Docker image. Docker facilitates scaling and deployment of the Flask application, allowing for the quick and easy creation of new containers. Disadvantages Docker adds an additional layer of complexity to the development and deployment process, which may require additional time and effort to learn and configure. Docker may not be necessary for small or simple Flask applications, as the benefits may not outweigh the additional overhead and configuration. Docker images and containers can take up significant disk space, which may concern applications with large dependencies or machines with limited storage capacity. Conclusion In this tutorial, we've walked through the steps to build a Flask web application using Docker. We've created a simple Flask application, written a Dockerfile to describe the environment in which the application will run, built a Docker image from the Dockerfile, started a Docker container from the image, and tested the Flask application inside the container. With Docker, you can easily package and deploy your Flask application in a consistent and reproducible manner, making it easier to manage and scale your application.
Bartłomiej Żyliński
Software Engineer,
SoftwareMill
Vishnu Vasudevan
Head of Product Engineering & Management,
Opsera
Abhishek Gupta
Principal Developer Advocate,
AWS
Yitaek Hwang
Software Engineer,
NYDIG