Technical Terminologies for New Age Tech Entrepreneurs
They say the only constant is change. This adage describes the startup technology world aptly. Also, technology grows exponentially and its development unfolds right before our eyes and we don’t even realise it. It continues to evolve and change at faster and faster rates.
If you happen to be a tech entrepreneur working on your startup, you should at least be aware of the latest technologies out there, being adopted by other software developers around the world. As a techie, when we start working on a startup idea, most of us want to quickly get the first version (minimum viable product) out to check its market feasibility. We do this with our technical knowhow from previous experiences. At this point, one hardly thinks about long-term concerns like products scaling, response time, and performance, unless you have a solid technical background.
I have gone through this while building my previous startups. Based on my experiences, I have came up with a list of new-age technologies that the best startups and organisations are using to make their product highly scalable with high performance while ensuring an awesome user experience.
Just being aware of these terminologies and technologies will help any technical founder/developer make better decisions about what technology to choose when starting up.
Jenkins is a powerful application that allows continuous integration and continuous delivery of projects, regardless of the platform. It is a free source that can handle any kind of build or continuous integration. You can integrate Jenkins with a number of testing and deployment technologies.
Kafka is a distributed publish-subscribe messaging system designed to be fast, scalable, and durable. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organisation. It can be elastically and transparently expanded without downtime. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
- Chef:
Chef is both the name of a company and the name of a configuration management tool. Chef is used to streamline the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms such as Internap, Amazon EC2, Google Cloud Platform, OpenStack, SoftLayer, Microsoft Azure, and Rackspace to automatically provision and configure new machines.
Docker enables users to package any application in a lightweight, portable container so that installing a server-side Linux app becomes as easy as installing a mobile app from the command line. It packages an application with all of its dependencies into a single unit. Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, and system libraries that can be installed on a server.
ZooKeeper is a distributed coordination service for distributed systems. It provides centralised infrastructure and services that enable synchronisation across a cluster. ZooKeeper maintains common objects needed in large cluster environments. It is a centralised service for maintaining configuration information, naming, providing distributed synchronisation, and providing group services.
Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your data set and then perform data analysis queries. Regardless of the size of the data set, Amazon Redshift offers fast query performance.
- S3:
Amazon S3 (Simple Storage Service) is an online file storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces (REST, SOAP, and BitTorrent).
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop consists of different parts:
- HDFS - Hadoop Distributed File System
- YARN - Yet Another Resource Negotiator (or Resource Manager)
- MapReduce - The batch processing Framework of Hadoop
- Hive:
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarisation, query, analysis and managing large datasets residing in distributed storage.
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.
HBase is a non-relational (NoSQL) database that runs on top of HDFS (Hadoop Distributed File System). It is most suited for real-time read/write access to large datasets. HBase scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas. HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Scala is a general-purpose programming language. Scala has full support for functional programming and a very strong static-type system. Scala source code is intended to be compiled to Java bytecode, so that the resulting executable code runs on a Java virtual machine.
MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce programme is composed of a Map () procedure (method) that performs filtering and sorting and a Reduce () method that performs a summary operation.
Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity of ANSI SQL. Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments.
Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
Lambda Architecture is a generic, scalable, and fault-tolerant data processing architecture. It is an approach to building stream processing applications on top of MapReduce and Storm or similar systems.
- AMQP :
The Advanced Message Queuing Protocol (AMQP) is an open standard for passing business messages between applications or organisations. It connects systems, feeds business processes with the information they need, and reliably transmits onward the instructions that achieve their goals.
Memcached is a general-purpose distributed memory caching system. It is used to speed up dynamic database-driven websites by caching data and objects in RAM to reduce the number of times an external data source must be read.
- ETL:
Extract, Transform and Load (ETL) refers to a process in data warehousing operations of extracting data from source systems and bringing it into the data warehouse.
- JMS:
The Java Message Service (JMS) API is a Java Message Oriented Middleware API for sending messages between two or more clients. It is a messaging standard that allows application components to create, send, receive, and read messages between different components of a distributed application.
HAProxy is a free, very fast, and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. Its most common use is to improve the performance and reliability of a server environment by distributing the workload across multiple servers (web, application, and database).
RabbitMQ is a complete and highly reliable enterprise messaging system based on the emerging AMQP standard. RabbitMQ is a message broker. The principal idea is it accepts and forwards messages. It can be thought of as a post office: when we send mail to the post box, we are pretty sure that the postman will eventually deliver the mail to our recipient. Using this metaphor, RabbitMQ is a post box, a post office, and a postman.
Apache Lucene is an extremely rich and powerful full-text search library written in Java. It is used to provide full-text indexing across both database objects and documents in various formats.
- Solr:
Solr is an open-source enterprise search platform, written in Java, built on top of Apache Lucene. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration and NoSQL features, while providing distributed search and index replication. Solr is designed for scalability and Fault tolerance. Solr is the second-most popular enterprise search engine after ElasticSearch.
ElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine, build on top of Apache Lucene. It is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. ElasticSearch is developed in Java and is released as open source under the terms of the Apache License.
Redis is an open source in-memory data structure store, used as database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, and geospatial indexes with radius queries. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
Apache Cassandra, a top-level Apache project , is a distributed database for managing large amounts of structured data across many servers, while providing highly available service and no single point of failure. It offers continuous availability, linear scale performance, operational simplicity, and easy data distribution across multiple data centers and cloud availability zones.
Pingdom is a service that tracks the uptime, downtime, and performance of websites. Based in Sweden, Pingdom monitors websites from multiple locations globally so that it can distinguish genuine downtime from routing and access problems.
IBM Watson is a technology platform that uses natural language processing and machine learning to reveal insights from large amounts of unstructured data. Watson Analytics combines visualisation with data tagging, machine learning, and cloud storage.
Kubernetes is an open-source system for automating deployment, operations, and scaling of containerised applications. Kubernetes is a powerful system, developed by Google, for managing containerised applications in a clustered environment. It aims to provide better ways of managing related, distributed components across varied infrastructure. It schedules containers to run across a cluster of machines, deploying them individually or in tightly coupled groups called pods, and keeping resource needs in mind as it distributes the work.
Apache Mesos abstracts computational resources such as CPU, memory, storage away from machines (physical or virtual), enabling distributed systems to easily build and run effectively. It basically acts like an operating system for the datacenter, distributing work across multiple machines without your having to manage and monitor resources on those machines yourself.
Splunk enables searching, monitoring, and analysing machine-generated big data, via a web-style interface. Splunk captures, indexes, and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualisations.
OpenShift is a cloud Platform-as-a-Service (PaaS) developed by Red Hat build on Docker and Kubernetes. It lets developers quickly develop, deploy, and run applications in a cloud environment.
AWS Lambda is a compute service where you can upload your code to AWS Lambda and the service can run the code on your behalf using AWS infrastructure. After you upload your code and create what is called a Lambda function, AWS Lambda takes care of provisioning and managing the servers that you can use to run the code.
- Gulp: Gulp is a fast and intuitive streaming build tool built on Node.js that helps you automate time-consuming tasks in your development workflow.
- NGINX:
NGINX (pronounced "engine x") is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption. It can act as well as a load balancer and an HTTP cache.
Feel free to suggest more technologies which should be added in the list.
(Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the views of YourStory.)