FATA #1 / Big Data- Elastic Stack
[FATA] - From test automation to architecture article series
Elasticsearch — is a distributed, real-time search and analytics engine for all types of data.
Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON
Used for:
- Logging & Log analytics
- Complex search
- Security analysis
- Marketing & Operations
- Business analytics
Features:
- Distributed — runs on multiple nodes within a cluster can scale to 1k nodes, which means performance of search can scale linearly with the number of nodes.
- Highly available and fault-tolerant — multiple copies of data are stored within the cluster, and every index is replicated.
- REST API — can be used for CRUD operations.
- Schema-less — documents can be indexed without explicitly providing a schema, used inverted index concept for lookup.
- Near real-time operations — read and write operations take less than a second to complete.
- Complementary tooling an plugins — Kibana, Logstash, Beats.
- Easy application development — Java, Python, PHP, JavaScript, Node.js, Ruby…
ELK Stack: Elasticsearch, Logstash, Kibana
- Elasticsearch is a search and analytics engine.
- Logstash is a server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch.
- Kibana lets users visualize data with charts and graphs in Elasticsearch.
- Beat — data shippers, send data from machines to Logstash (if you need transformation and parsing) or Elasticsearch.
Cluster and nodes
Node — is instance of Elasticsearch that stores data.
Cluster — is a collection of related nodes that have the same cluster.name attribute. Clusters are completely independent of each other, it’s not common to perform cross-cluster searches.
Major components
- Indices — the largest unit of data in Elasticsearch, are logical partitions of documents and can be compared to a database in the world of relational databases.
- Documents — are JSON objects that are stored within an Elasticsearch index and are considered the base unit of storage. In the world of relational databases, documents can be compared to a row in table. Data in documents is defined with fields comprised of keys and value
Each document is also associated with metadata, the most important items being:
_index — The index where the document is stored
_id — The unique ID which identifies the document in the index
- Fields
- Mapping — It defines the fields for documents of a specific type — the data type (such as keyword and integer) and how the fields should be indexed and stored in Elasticsearch.
- Shards — is a single index which allow facilitate its scalability, when you create index you can define how many shards you want. (data parts inside shard)
- Replica — fail-safe mechanisms which basically copies your index’s shards.
Analysis and Analyzers
An analyzer contains three lower-level building blocks: character filter, tokenizers, and token filters.
Manage Data in Elasticsearch
- cat indices
- cat plugins
- cat templates
- cat health
Analyze & Query your data
- Histogram — is a multi-bucket values source-based aggregation that can be applied on numeric values or numeric range values extracted from the documents.
- Terms — is a multi-bucket value source-based aggregation where buckets are dynamically built — one per unique value.
- Range — is a multi-bucket value source-based aggregation that enables the user to define a set of ranges — each representing a bucket.
Metrics aggregation: Cardinality and Percentiles aggregation.
Top interview question references:
- https://www.guru99.com/elasticsearch-interview-questions.html
- https://facingissuesonit.com/elasticsearch-interview-questions-and-answers/
- https://logit.io/blog/post/the-top-50-elk-stack-and-elasticsearch-interview-questions
- https://facingissuesonit.com/elasticsearch-interview-questions-and-answers/