How to Index Data in Elasticsearch

How to Index Data in Elasticsearch Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. It enables real-time indexing, searching, and analyzing of large volumes of structured and unstructured data. At the heart of Elasticsearch’s functionality lies the process of indexing data —the act of storing and organizing documents so they can be efficiently retrieved

alex

Nov 6, 2025 - 19:42

How to Index Data in Elasticsearch

Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. It enables real-time indexing, searching, and analyzing of large volumes of structured and unstructured data. At the heart of Elasticsearchs functionality lies the process of indexing datathe act of storing and organizing documents so they can be efficiently retrieved and queried. Whether youre logging application events, storing product catalogs, or analyzing user behavior, mastering how to index data in Elasticsearch is essential for building scalable, high-performance search applications.

Indexing is not merely about inserting datait involves understanding document structure, mapping types, batch operations, error handling, and performance tuning. A poorly indexed dataset can lead to slow queries, high resource consumption, and inaccurate search results. Conversely, a well-indexed system delivers sub-second response times, supports complex aggregations, and scales seamlessly across clusters.

This comprehensive guide walks you through every aspect of indexing data in Elasticsearchfrom basic document insertion to advanced optimization techniques. By the end, youll have the knowledge to confidently index data in production environments, avoid common pitfalls, and leverage Elasticsearchs full potential.

Step-by-Step Guide

Prerequisites

Before you begin indexing data, ensure you have the following:

A running Elasticsearch cluster (version 7.x or 8.x recommended)
Access to the Elasticsearch REST API via HTTP (default port: 9200)
A tool to send HTTP requests (e.g., curl, Postman, Kibana Dev Tools, or a programming language client like Pythons elasticsearch-py)
Basic understanding of JSON format

You can verify your cluster is running by sending a GET request to http://localhost:9200. A successful response includes cluster name, version, and node information.

Step 1: Understand the Index Concept

In Elasticsearch, an index is a collection of documents that share similar characteristics. Think of it as a database table in a relational system, but with key differences: documents within an index are schema-flexible, and each document has a unique ID.

Before indexing, you must decide whether to create an index explicitly or allow Elasticsearch to auto-create it. While auto-creation is convenient for development, production systems benefit from explicit index creation to define mappings, settings, and replicas upfront.

Step 2: Create an Index with Custom Settings and Mappings

Auto-created indices use dynamic mapping, which may not always align with your data structure. For example, Elasticsearch might infer a string field as a text type (analyzed) when you intended it as a keyword type (not analyzed) for filtering.

Use the PUT method to create an index with explicit mappings:

PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s"
},
"mappings": {
"properties": {
"product_id": {
"type": "keyword"
},
"name": {
"type": "text",
"analyzer": "standard"
},
"description": {
"type": "text",
"analyzer": "english"
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
},
"tags": {
"type": "keyword"
}
}
}
}

Key elements explained:

number_of_shards: Determines how the index is split across nodes. More shards allow horizontal scaling but increase overhead.
number_of_replicas: Defines copies of each shard for fault tolerance. Set to 1 in production for redundancy.
refresh_interval: Controls how often new documents become searchable. Default is 1s; increase to 30s for bulk indexing to improve performance.
keyword: Used for exact matches, aggregations, and sorting. Ideal for IDs, categories, and status fields.
text: Used for full-text search. Analyzed using language-specific analyzers (e.g., English stemmer).
date: Supports multiple formats. Always define explicit formats to avoid parsing errors.

Step 3: Index a Single Document

Once the index is created, you can insert individual documents using the PUT or POST method.

Use PUT when you know the document ID:

PUT /products/_doc/1001 { "product_id": "SKU-1001", "name": "Wireless Bluetooth Headphones", "description": "High-fidelity sound with noise cancellation and 30-hour battery life.", "price": 129.99, "category": "Electronics", "created_at": "2024-03-15 10:30:00", "tags": ["audio", "wireless", "premium"] }

Use POST when you want Elasticsearch to auto-generate the ID:

POST /products/_doc { "product_id": "SKU-1002", "name": "Smart Fitness Watch", "description": "Tracks heart rate, sleep, and GPS location with water resistance.", "price": 199.5, "category": "Wearables", "created_at": "2024-03-16 08:15:00", "tags": ["fitness", "smartwatch", "health"] }

Response includes metadata such as _index, _id, _version, and result (e.g., "created" or "updated").

Step 4: Bulk Index Multiple Documents

Indexing documents one at a time is inefficient for large datasets. Use the Bulk API to index multiple documents in a single request, reducing network overhead and improving throughput.

The Bulk API requires a newline-delimited JSON (NDJSON) format. Each document is preceded by a metadata line specifying the action and target index:

POST /products/_bulk
{"index":{"_id":"1003"}}
{"product_id":"SKU-1003","name":"Smart Thermostat","price":249.99,"category":"Home Automation","created_at":"2024-03-16 12:45:00","tags":["smart","energy","IoT"]}
{"index":{"_id":"1004"}}
{"product_id":"SKU-1004","name":"4K Ultra HD TV","price":899.0,"category":"Electronics","created_at":"2024-03-15 14:20:00","tags":["tv","4k","media"]}
{"delete":{"_id":"1001"}}

Each line is processed independently. You can mix actions: index, create, update, and delete.

Important: The last line must end with a newline character. Failure to do so results in parsing errors.

Response returns a JSON object with errors (true/false) and a list of results for each action, including status codes and error messages if any.

Step 5: Verify Indexing Success

After indexing, confirm your data is stored and searchable:

Use GET /products/_count to check total document count.
Use GET /products/_search to retrieve all documents.
Use GET /products/_search?q=Bluetooth for simple keyword search.
Use GET /products/_mapping to inspect the current mapping.

For detailed insights, enable explain=true in search queries to see how scoring works:

GET /products/_search?explain=true
{
"query": {
"match": {
"name": "Smart"
}
}
}

Step 6: Handle Errors and Retries

Indexing can fail due to:

Invalid JSON format
Mapping conflicts (e.g., field type mismatch)
Network timeouts
Cluster overload

Always validate your JSON before sending requests. Use tools like JSONLint or your IDEs validator.

For bulk operations, inspect the response for "error" fields. Example error response:

{
"errors": true,
"items": [
{
"index": {
"_index": "products",
"_id": "1005",
"error": {
"type": "mapper_parsing_exception",
"reason": "failed to parse field [price] of type [float] in document with id '1005'. Preview of field's value: 'invalid'"
}
}
}
]
}

Implement retry logic with exponential backoff in your ingestion pipeline. For example, if a bulk request fails, split it into smaller batches and retry individually.

Step 7: Monitor Index Health and Performance

Use the following APIs to monitor your indices:

GET /_cat/indices?v Lists all indices with health, docs, size, and status.
GET /_cat/shards?v Shows shard distribution across nodes.
GET /_cluster/health?pretty Cluster-wide health status (green, yellow, red).
GET /products/_stats Index-level statistics (indexing rate, query latency, memory usage).

Green = all shards allocated. Yellow = primary shards allocated, replicas not. Red = some primary shards missingrequires immediate attention.

Best Practices

1. Define Mappings Explicitly

Never rely on dynamic mapping in production. Auto-generated mappings can lead to:

Incorrect field types (e.g., string as text instead of keyword)
Unintended tokenization (e.g., New York split into new and york)
Mapping explosions from unstructured data

Always define mappings for critical fields: IDs, dates, enums, and numeric values. Use keyword for filtering and aggregation; use text only for full-text search.

2. Use Appropriate Shard Count

Shards are the unit of distribution and parallelization in Elasticsearch. Too few shards limit scalability; too many increase overhead.

Guidelines:

Start with 15 shards per index for small datasets (
For large indices (> 50GB), use 1020 shards.
Aim for shard sizes between 1050GB.
Never exceed 1000 shards per node.

Shard count is fixed at index creation. Plan ahead.

3. Optimize Bulk Indexing

For high-volume ingestion:

Use bulk requests with 515MB per batch (not per document).
Disable refresh during bulk load: "refresh_interval": "-1"
Increase index.buffer_size if needed.
Use multiple threads (but avoid overloading the cluster).
Re-enable refresh and replica sync after bulk load: PUT /index/_settings { "refresh_interval": "30s", "number_of_replicas": 1 }

4. Avoid Large Documents

Documents larger than 100MB can cause memory pressure and slow down indexing. Split large records into smaller, logically related documents.

Example: Instead of storing an entire product catalog with 100 variants in one document, create 100 separate documents with a common product_group_id.

5. Use Index Lifecycle Management (ILM)

For time-series data (logs, metrics), use ILM to automate index rollover, cold storage, and deletion.

Example ILM policy:

Hot phase: Index new data, high replicas, fast storage.
Warm phase: Reduce replicas, move to slower storage.
Cold phase: Read-only, archived.
Delete: Remove after 1 year.

ILM reduces operational overhead and storage costs.

6. Secure Your Data

Enable Elasticsearch security features (X-Pack/Security):

Use HTTPS for all API calls.
Apply role-based access control (RBAC).
Restrict index creation to authorized users.
Log all indexing operations for audit.

7. Monitor and Alert on Indexing Latency

Set up monitoring for:

Indexing rate (docs/sec)
Queue size in thread pools
Slow log entries

Use Prometheus + Grafana or Elastic Observability to visualize metrics and trigger alerts when indexing slows below thresholds.

8. Test Mappings Before Production

Use the _analyze API to test how text is tokenized:

POST /products/_analyze { "text": "The quick brown fox jumps over the lazy dog", "analyzer": "english" }

Verify that stop words are removed, stems are correct, and no unwanted tokens are created.

Tools and Resources

Official Elasticsearch Tools

Kibana Dev Tools: Built-in console for executing API requests, testing queries, and visualizing data.
Elasticsearch Head (deprecated): Browser-based UI for managing clusters (use Kibana instead).
Elasticsearch-Curator: Python tool for managing indices (rollover, deletion, optimization).
Elastic Agent: Unified data collection agent for logs, metrics, and traces.

Third-Party Tools

Postman: For manual API testing and automation.
curl: Lightweight command-line tool for quick requests.
Logstash: Data processing pipeline for ingesting logs and transforming them before indexing.
Filebeat: Lightweight shipper for forwarding logs to Elasticsearch.
Apache NiFi: Data flow automation tool with Elasticsearch connectors.

Programming Language Clients

Use official Elasticsearch clients for seamless integration:

Python: elasticsearch-py https://github.com/elastic/elasticsearch-py
Java: elasticsearch-java Official Java client
Node.js: @elastic/elasticsearch
.NET: Elastic.Clients.Elasticsearch
Go: github.com/elastic/go-elasticsearch

Learning Resources

Elasticsearch Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Elastic University: Free courses on indexing, search, and cluster management.
Elastic Discuss Forum: Community support and troubleshooting.
GitHub Examples: Search for elasticsearch bulk indexing examples for code templates.

Sample Data Sets for Practice

GitHub Archive: Public event logs (JSON format).
Movie Dataset: Popular JSON dataset with titles, genres, ratings.
Log Files: Nginx or Apache logs converted to JSON.
Elastics Sample E-Commerce Data: Available in Kibanas sample data feature.

Real Examples

Example 1: Indexing Server Logs

Scenario: Youre collecting application logs from 10 servers and want to index them in Elasticsearch for real-time monitoring.

Step 1: Define index template for logs:

PUT /_index_template/log_template
{
"index_patterns": ["app-logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s"
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"level": { "type": "keyword" },
"service": { "type": "keyword" },
"message": { "type": "text" },
"host": { "type": "keyword" },
"duration_ms": { "type": "long" }
}
}
}
}

Step 2: Use Filebeat to ship logs to Elasticsearch:

filebeat.yml filebeat.inputs: - type: log paths: - /var/log/app/*.log output.elasticsearch: hosts: ["http://elasticsearch:9200"] index: "app-logs-%{+yyyy.MM.dd}"

Step 3: Query logs in Kibana:

Find all ERROR logs: level: ERROR
Group by service: Use Lens visualization ? Aggregation: Terms on service
Identify slow requests: duration_ms: > 5000

Example 2: E-Commerce Product Catalog

Scenario: You have a product database with 500,000 SKUs and want to enable fast search by name, category, and price range.

Step 1: Create index with optimized mappings (as shown in Step 2).

Step 2: Use Python to bulk index from a CSV:

import csv
import json
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch("http://localhost:9200")
def load_products_from_csv(filename):
with open(filename, newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
yield {
"_index": "products",
"_id": row["product_id"],
"_source": {
"product_id": row["product_id"],
"name": row["name"],
"description": row["description"],
"price": float(row["price"]),
"category": row["category"],
"created_at": row["created_at"],
"tags": row["tags"].split(",")
}
}
Bulk index
helpers.bulk(es, load_products_from_csv("products.csv"))

Step 3: Implement search with filters:

GET /products/_search
{
"query": {
"bool": {
"must": [
{ "match": { "name": "wireless headphones" } }
],
"filter": [
{ "range": { "price": { "gte": 50, "lte": 200 } } },
{ "term": { "category": "Electronics" } }
]
}
},
"sort": [{ "price": "asc" }]
}

Result: Sub-100ms response time with accurate filtering and sorting.

Example 3: Real-Time User Activity Tracking

Scenario: Track user clicks, page views, and session duration on a website.

Use a time-series index pattern: user-activity-2024.03.15

Each document:

{ "user_id": "u12345", "session_id": "s67890", "event_type": "page_view", "url": "/products/1001", "timestamp": "2024-03-15T14:23:45Z", "duration": 120 }

Use ILM to roll over daily:

Index name: user-activity-{now/d}
Roll over when index size > 50GB or age > 24h
After 7 days: move to warm tier
After 90 days: delete

Benefits: Efficient storage, fast queries on recent data, automated cleanup.

FAQs

What is the difference between index and document in Elasticsearch?

An index is a collection of related documents, similar to a table in a relational database. A document is a single JSON record within that index, analogous to a row. Each document has a unique ID and can have a different structure (schema-less).

Can I change the mapping of an existing index?

No, you cannot modify field mappings after an index is created. To change a mapping, you must:

Create a new index with the correct mapping.
Reindex data from the old index to the new one using the _reindex API.
Update aliases to point to the new index.
Delete the old index.

How do I handle duplicate documents during indexing?

Use the create action instead of index in bulk requests. If a document with the same ID already exists, Elasticsearch returns a 409 Conflict error. Alternatively, use op_type=create in PUT requests to enforce uniqueness.

Why is my indexing slow?

Common causes:

Too many shards (overhead)
Too many replicas during bulk load
Small bulk request sizes
High refresh interval (default 1s)
Insufficient heap memory or CPU
Network latency between client and cluster

Solutions: Increase bulk size, disable replicas temporarily, raise refresh interval, monitor resource usage.

Do I need to refresh after every index operation?

No. Elasticsearch refreshes indices automatically every second by default. For bulk operations, disable refresh ("refresh_interval": "-1") and manually trigger it once with POST /index/_refresh when done.

What happens if I index a document with a field not in the mapping?

If dynamic mapping is enabled (default), Elasticsearch adds the field automatically, inferring its type. This can lead to mapping conflicts later. Disable dynamic mapping with "dynamic": "strict" to reject unknown fields.

Can I index data from a database?

Yes. Use tools like Logstash with JDBC input, or write a custom script (Python, Java) to query your database and bulk index results. Avoid direct database-to-Elasticsearch replication without transformationensure data consistency and handle updates/deletes properly.

Is indexing in Elasticsearch transactional?

No. Elasticsearch is eventually consistent. A document may not be immediately searchable after indexing due to refresh intervals. For strong consistency, use the ?refresh=true parameter, but this impacts performance.

How do I delete an index?

Use: DELETE /index_name. To delete multiple indices: DELETE /index_*. Be cautiousthis action is irreversible.

Conclusion

Indexing data in Elasticsearch is a foundational skill for anyone building search-driven applications, analytics platforms, or real-time monitoring systems. This guide has walked you through the entire lifecyclefrom defining structured mappings and creating indices, to bulk-ingesting millions of records and optimizing performance for production workloads.

Remember: Indexing is not a one-time task. It requires thoughtful design, continuous monitoring, and iterative refinement. The choices you make todayshard count, field types, refresh intervals, and security policieswill directly impact scalability, speed, and reliability for years to come.

By following best practices, leveraging the right tools, and learning from real-world examples, youll transform Elasticsearch from a black-box search engine into a powerful, predictable data backbone. Whether youre indexing logs, products, or user behavior, mastering indexing ensures your data is not just storedbut truly usable.

Start small. Test thoroughly. Scale intentionally. And never underestimate the power of a well-indexed dataset.

alex

How to Index Data in Elasticsearch

How to Index Data in Elasticsearch

Step-by-Step Guide

Prerequisites

Step 1: Understand the Index Concept

Step 2: Create an Index with Custom Settings and Mappings

Step 3: Index a Single Document

Step 4: Bulk Index Multiple Documents

Step 5: Verify Indexing Success

Step 6: Handle Errors and Retries

Step 7: Monitor Index Health and Performance

Best Practices

1. Define Mappings Explicitly

2. Use Appropriate Shard Count

3. Optimize Bulk Indexing

4. Avoid Large Documents

5. Use Index Lifecycle Management (ILM)

6. Secure Your Data

7. Monitor and Alert on Indexing Latency

8. Test Mappings Before Production

Tools and Resources

Official Elasticsearch Tools

Third-Party Tools

Programming Language Clients

Learning Resources

Sample Data Sets for Practice

Real Examples

Example 1: Indexing Server Logs

filebeat.yml

Example 2: E-Commerce Product Catalog

Bulk index

Example 3: Real-Time User Activity Tracking

FAQs

What is the difference between index and document in Elasticsearch?

Can I change the mapping of an existing index?

How do I handle duplicate documents during indexing?

Why is my indexing slow?

Do I need to refresh after every index operation?

What happens if I index a document with a field not in the mapping?

Can I index data from a database?

Is indexing in Elasticsearch transactional?

How do I delete an index?

Conclusion

Related Posts

Popular Posts

Recommended Posts

Popular Tags