Elasticsearch & OpenSearch
Elasticsearch is a distributed, JSON-document search and analytics engine built on Apache Lucene, originally released by Shay Banon in 2010. OpenSearch is the open-source fork created in 2021 by AWS after Elastic relicensed Elasticsearch under the SSPL/Elastic License; OpenSearch is Apache 2.0 and a drop-in replacement for the Elasticsearch 7.x APIs. Both engines are simultaneously a NoSQL document store, a distributed full-text search engine, and a real-time analytics platform.
Key Features:
- Inverted Index. Lucene’s posting lists make full-text search across billions of documents tractable in milliseconds.
- BM25 Relevance Scoring. Industry-standard probabilistic relevance ranking; tunable per-field.
- JSON Documents. Schema-flexible — each document is a JSON blob; mappings are inferred or declared.
- Aggregations. Bucketing, metrics, percentile, histograms, geo-aggregations — the analytical engine that powers Kibana / OpenSearch Dashboards.
- Distributed Sharding & Replication. Indexes split into shards, each replicated across nodes; automatic rebalancing.
- Vector Search. Both engines added native dense-vector ANN (HNSW) for semantic search and RAG use cases.
- Ingest Pipelines. Server-side document transformations (parsing, enrichment, geo-IP) before indexing.
- SQL. Limited SQL surface (
SELECT ... FROM index WHERE ...) on top of the underlying query DSL.
Elasticsearch vs. OpenSearch:
- Elasticsearch. Ahead on commercial features (machine-learning anomaly detection, graph exploration, security tier). SSPL / Elastic License.
- OpenSearch. Apache 2.0 (truly open). AWS-backed; default on AWS managed search. Increasingly feature-comparable.
- For new self-hosted deployments without commercial features, OpenSearch is now the more conservative choice.
Use Cases:
- Application search. Product catalogs, documentation, knowledge bases.
- Log analytics. The ELK / OpenSearch stack — Logstash / Beats + Elasticsearch + Kibana — is the dominant open observability platform.
- Security analytics. SIEM workloads (high-volume log ingest with correlation rules).
- Hybrid search for RAG. BM25 + dense vector ANN combined and reranked — the production RAG retrieval pattern.
- Geospatial search. Location-based discovery for marketplaces and mapping apps.