How to Aggregate Data in Mongodb
How to Aggregate Data in MongoDB MongoDB is a powerful, document-oriented NoSQL database that excels in handling unstructured and semi-structured data at scale. One of its most robust features is the Aggregation Pipeline—a framework designed to process and transform data through a series of stages, enabling complex analytics, data cleaning, grouping, filtering, and reporting directly within the da
How to Aggregate Data in MongoDB
MongoDB is a powerful, document-oriented NoSQL database that excels in handling unstructured and semi-structured data at scale. One of its most robust features is the Aggregation Pipelinea framework designed to process and transform data through a series of stages, enabling complex analytics, data cleaning, grouping, filtering, and reporting directly within the database. Unlike traditional SQL databases that rely heavily on JOINs and external tools for complex queries, MongoDBs aggregation framework allows developers and data analysts to perform sophisticated data operations natively, with high performance and minimal latency.
Aggregating data in MongoDB is essential for businesses that need to derive insights from vast collections of documentswhether its analyzing user behavior, generating real-time dashboards, calculating sales trends, or auditing system logs. Without aggregation, extracting meaningful patterns from raw document data would require exporting data to external systems, increasing complexity, bandwidth usage, and response time. By mastering MongoDB aggregation, you unlock the ability to turn raw data into actionable intelligence without leaving the database layer.
This comprehensive guide walks you through every aspect of aggregating data in MongoDBfrom foundational concepts to advanced pipeline construction, best practices, real-world use cases, and essential tools. Whether youre a developer building analytics features into your application or a data engineer optimizing reporting workflows, this tutorial will equip you with the knowledge to harness MongoDBs full aggregation potential.
Step-by-Step Guide
Understanding the Aggregation Pipeline
The MongoDB Aggregation Pipeline is a sequence of stages, where each stage processes documents and passes the results to the next. Each stage performs a specific operation such as filtering, grouping, sorting, or projecting fields. The pipeline operates on a collection of documents and returns a new set of documents as output.
Each stage is defined as an object in an array. The syntax is straightforward:
db.collection.aggregate([
{ $stage1: { parameters } },
{ $stage2: { parameters } },
...
])
For example, a basic pipeline that filters documents and then groups them might look like:
db.orders.aggregate([
{ $match: { status: "completed" } },
{ $group: { _id: "$customer_id", total: { $sum: "$amount" } } }
])
This pipeline first filters all orders with a status of completed, then groups them by customer ID and sums the total amount spent per customer.
Core Aggregation Stages
There are over 30 aggregation stages in MongoDB, but mastering the most commonly used ones is key to building effective pipelines. Below are the essential stages youll use daily.
$match
The $match stage filters documents based on specified conditions, similar to a WHERE clause in SQL. It should be placed as early as possible in the pipeline to reduce the number of documents processed downstream, improving performance.
db.products.aggregate([
{ $match: { category: "Electronics", price: { $gt: 100 } } }
])
This returns only products in the Electronics category with a price greater than $100.
$group
The $group stage aggregates documents by a specified identifier (typically _id) and calculates aggregated values such as sums, averages, counts, or maximum/minimum values.
db.sales.aggregate([
{ $group: {
_id: "$region",
totalSales: { $sum: "$amount" },
avgSale: { $avg: "$amount" },
count: { $sum: 1 }
}}
])
This groups sales data by region and calculates the total sales, average sale amount, and number of transactions per region.
$project
The $project stage reshapes each document in the streamadding, removing, or renaming fields. Its useful for selecting only the data you need, reducing payload size, and preparing documents for subsequent stages.
db.users.aggregate([
{ $project: {
name: 1,
email: 1,
age: { $subtract: [ { $year: new Date() }, { $year: "$birthDate" } ] },
_id: 0
}}
])
This returns only the name, email, and calculated age of users, excluding the _id field.
$sort
The $sort stage orders documents by one or more fields. Its often used after $group to present results in a logical order.
db.sales.aggregate([
{ $group: {
_id: "$region",
totalSales: { $sum: "$amount" }
}},
{ $sort: { totalSales: -1 } }
])
This sorts regions by total sales in descending order, so the highest-performing region appears first.
$limit and $skip
The $limit stage restricts the number of documents passed to the next stage. $skip ignores the first N documents. Together, they enable pagination.
db.products.aggregate([
{ $sort: { price: 1 } },
{ $skip: 10 },
{ $limit: 5 }
])
This skips the first 10 cheapest products and returns the next 5.
$lookup
The $lookup stage performs a left outer join between two collectionssimilar to SQL JOINs. Its invaluable when you need to enrich documents with related data from another collection.
db.orders.aggregate([
{
$lookup: {
from: "customers",
localField: "customer_id",
foreignField: "_id",
as: "customerInfo"
}
},
{ $unwind: "$customerInfo" },
{ $project: {
orderDate: 1,
amount: 1,
customerName: "$customerInfo.name",
email: "$customerInfo.email"
}}
])
This joins orders with customer data, unwinds the resulting array (since $lookup returns an array), and projects only the desired fields.
$unwind
The $unwind stage deconstructs an array field from each input document, outputting one document per array element. This is often used after $lookup or when storing arrays of values (e.g., tags, categories, or items in an order).
db.articles.aggregate([
{ $unwind: "$tags" },
{ $group: {
_id: "$tags",
count: { $sum: 1 }
}}
])
This counts how many articles are tagged with each tag by exploding the tags array and grouping by each unique tag.
$addFields and $set
The $addFields stage adds new fields to documents without removing existing ones. $set is an alias for $addFields introduced in MongoDB 4.2 and is functionally identical.
db.products.aggregate([
{ $addFields: {
discountedPrice: { $multiply: ["$price", 0.9] },
isExpensive: { $gt: ["$price", 500] }
}}
])
This adds two computed fields: a 10% discounted price and a boolean indicating whether the product is expensive.
$bucket and $bucketAuto
These stages group documents into ranges or buckets. $bucket requires explicit boundaries; $bucketAuto automatically determines optimal ranges based on the number of buckets you specify.
db.sales.aggregate([
{
$bucketAuto: {
groupBy: "$amount",
buckets: 5
}
}
])
This divides sales amounts into 5 automatically determined ranges (e.g., $0$200, $201$400, etc.) and counts documents in each range.
Building a Complete Aggregation Pipeline
Lets walk through building a realistic pipeline from scratch. Suppose you have a collection named transactions with the following schema:
{
"_id": ObjectId("..."),
"userId": "u123",
"amount": 250,
"currency": "USD",
"category": "Groceries",
"date": ISODate("2024-03-15T10:30:00Z"),
"merchant": "Walmart"
}
You want to generate a monthly spending report per user, showing total spent, average transaction, and top merchant, for transactions in 2024.
Heres the complete pipeline:
db.transactions.aggregate([
// 1. Filter for year 2024
{
$match: {
date: {
$gte: new Date("2024-01-01"),
$lt: new Date("2025-01-01")
}
}
},
// 2. Extract month and year from date
{
$addFields: {
month: { $month: "$date" },
year: { $year: "$date" }
}
},
// 3. Group by user and month
{
$group: {
_id: { userId: "$userId", month: "$month" },
totalSpent: { $sum: "$amount" },
avgTransaction: { $avg: "$amount" },
transactionCount: { $sum: 1 },
merchants: { $push: "$merchant" }
}
},
// 4. Find most frequent merchant per user-month
{
$addFields: {
topMerchant: {
$arrayElemAt: [
{
$sortArray: {
input: {
$map: {
input: { $setUnion: "$merchants" },
as: "m",
in: { merchant: "$$m", count: { $size: { $filter: { input: "$merchants", cond: { $eq: ["$$m", "$$m"] } } } } }
}
},
sortBy: { count: -1 }
}
},
0
]
}
}
},
// 5. Project final output
{
$project: {
_id: 0,
userId: "$_id.userId",
month: "$_id.month",
totalSpent: 1,
avgTransaction: 1,
transactionCount: 1,
topMerchant: "$topMerchant.merchant"
}
},
// 6. Sort by user and month
{
$sort: { userId: 1, month: 1 }
}
])
This pipeline demonstrates several advanced techniques:
- Using
$matchto reduce dataset size early - Extracting date components with
$monthand$year - Grouping by compound keys (
userIdandmonth) - Using
$pushto collect all merchants - Calculating the most frequent merchant using
$map,$filter, and$sortArray - Final projection and sorting for clean output
While complex, this pipeline is efficient because it avoids multiple queries and external processing. All logic is handled in the database, minimizing network overhead and maximizing performance.
Using the Aggregation Pipeline in Different Environments
MongoDB aggregation isnt limited to the MongoDB Shell. You can execute pipelines in multiple environments:
- MongoDB Shell (mongosh): Ideal for testing and ad-hoc queries. Use
db.collection.aggregate([...]). - MongoDB Compass: A GUI tool with a visual aggregation pipeline builder. Drag and drop stages, preview results in real time, and export the pipeline as code.
- Node.js (MongoDB Driver): Use
collection.aggregate(pipeline).toArray(). - Python (PyMongo): Use
collection.aggregate(pipeline). - Java, .NET, Go, etc.: All official MongoDB drivers support aggregation pipelines with the same syntax.
For production applications, always use your applications driver to execute pipelines. Never expose raw aggregation code to end usersvalidate inputs and sanitize parameters to prevent injection attacks.
Best Practices
Order Stages for Maximum Efficiency
The order of stages in your pipeline dramatically impacts performance. Follow these principles:
- Use $match early: Filter documents as soon as possible to reduce the number of documents flowing through subsequent stages.
- Use $project early: Remove unnecessary fields to reduce memory and network usage.
- Avoid $unwind before $match if possible: Unwinding arrays increases document count. If you can filter before unwinding, do so.
- Place $sort after $group: Sorting after grouping avoids sorting large intermediate datasets.
- Use $limit to cap results: If you only need the top 10 results, apply $limit early to reduce downstream processing.
Use Indexes Strategically
Indexes can dramatically speed up $match and $sort stages. MongoDB can use indexes for:
- $match conditions
- $sort fields (if the sort matches the index order)
- Fields used in $group _id expressions
For example, if you frequently group by category and sort by date, create a compound index:
db.collection.createIndex({ category: 1, date: -1 })
Use explain("executionStats") to verify whether your pipeline is using indexes effectively:
db.collection.aggregate([...]).explain("executionStats")
Look for stage: "IXSCAN" in the output to confirm index usage.
Avoid Memory Limits
By default, MongoDB limits aggregation memory usage to 100MB per stage. If your pipeline exceeds this, youll get a Document size limit exceeded error. To handle larger datasets:
- Use
$limitand$matchto reduce document volume. - Use
$outor$mergeto write intermediate results to a collection. - Set
allowDiskUse: truein your aggregation call to enable temporary disk storage:
db.collection.aggregate(pipeline, { allowDiskUse: true })
Enable this only when necessary, as disk-based aggregation is slower than in-memory processing.
Use $out and $merge for Persistent Results
If you need to store aggregation results for reuse (e.g., for dashboards or scheduled reports), use $out or $merge:
- $out: Replaces the entire target collection with the aggregation results.
- $merge: Merges results into an existing collection, updating or inserting documents based on a specified key.
db.transactions.aggregate([
{ $group: { _id: "$userId", total: { $sum: "$amount" } } },
{ $merge: { into: "user_totals", on: "_id" } }
])
This updates the user_totals collection with new totals, preserving existing documents not matched by the pipeline.
Use Pipeline Variables and Let for Readability
For complex expressions, use $let to define variables within stages:
{ $addFields: {
discount: {
$let: {
vars: { basePrice: "$price", discountRate: 0.1 },
in: { $multiply: ["$$basePrice", "$$discountRate"] }
}
}
}}
This improves readability and avoids repeating complex expressions.
Test with Small Datasets First
Always test your aggregation pipeline on a small subset of data before running it on production collections. Use $sample to extract a random subset:
db.collection.aggregate([
{ $sample: { size: 100 } },
{ $match: { ... } },
// ... rest of pipeline
])
This prevents performance issues and helps you debug logic before scaling.
Tools and Resources
MongoDB Compass
MongoDB Compass is the official GUI for MongoDB. Its visual aggregation pipeline builder lets you drag and drop stages, preview results in real time, and auto-generate the corresponding JavaScript code. Its ideal for learning, debugging, and prototyping pipelines without writing code.
MongoDB Atlas
MongoDB Atlas, the cloud-hosted version of MongoDB, provides built-in analytics features, including charting tools that auto-generate aggregation pipelines for visualizations. You can create dashboards based on real-time aggregations and export the underlying pipeline for use in applications.
Studio 3T
Studio 3T is a popular third-party MongoDB client with advanced aggregation pipeline tools, including a pipeline builder, debugger, and performance analyzer. It supports syntax highlighting, auto-completion, and execution history.
VS Code with MongoDB Extension
Install the MongoDB extension for VS Code to write, test, and format aggregation pipelines directly in your editor. It provides syntax highlighting, code snippets, and connection management.
Online Aggregation Playground
Use MongoPlayground.net to share and test aggregation pipelines with sample data. Its perfect for asking questions on forums or demonstrating solutions to colleagues.
Official Documentation
Always refer to the MongoDB Aggregation Documentation for the most accurate, up-to-date information on stages, operators, and behavior changes across versions.
Community Resources
- MongoDB Developer Community: forums.mongodb.com
- Stack Overflow: Search for
[mongodb-aggregation]tag - GitHub Repositories: Many open-source projects use aggregation pipelinesstudy their implementations.
Real Examples
Example 1: E-Commerce Sales Dashboard
Scenario: You run an e-commerce platform and need a daily sales summary by product category.
Collection: orders
{
"_id": ObjectId("..."),
"orderId": "ORD-2024-001",
"items": [
{ "productId": "P100", "quantity": 2, "price": 50 },
{ "productId": "P101", "quantity": 1, "price": 120 }
],
"orderDate": ISODate("2024-03-15T14:22:00Z"),
"status": "completed"
}
Pipeline:
db.orders.aggregate([
{ $match: { status: "completed", orderDate: { $gte: new Date("2024-03-15"), $lt: new Date("2024-03-16") } } },
{ $unwind: "$items" },
{
$group: {
_id: "$items.productId",
totalRevenue: { $sum: { $multiply: ["$items.quantity", "$items.price"] } },
totalUnitsSold: { $sum: "$items.quantity" },
orderCount: { $sum: 1 }
}
},
{
$lookup: {
from: "products",
localField: "_id",
foreignField: "_id",
as: "productInfo"
}
},
{ $unwind: "$productInfo" },
{
$project: {
_id: 0,
category: "$productInfo.category",
totalRevenue: 1,
totalUnitsSold: 1,
orderCount: 1
}
},
{
$group: {
_id: "$category",
totalRevenue: { $sum: "$totalRevenue" },
totalUnitsSold: { $sum: "$totalUnitsSold" },
totalOrders: { $sum: "$orderCount" }
}
},
{ $sort: { totalRevenue: -1 } }
])
Output:
[
{ "_id": "Electronics", "totalRevenue": 4500, "totalUnitsSold": 85, "totalOrders": 42 },
{ "_id": "Books", "totalRevenue": 1200, "totalUnitsSold": 30, "totalOrders": 25 },
{ "_id": "Clothing", "totalRevenue": 890, "totalUnitsSold": 23, "totalOrders": 18 }
]
Example 2: User Activity Analytics
Scenario: Track daily active users (DAU) and session duration for a mobile app.
Collection: sessions
{
"userId": "u789",
"sessionId": "sess_123",
"start": ISODate("2024-03-15T08:00:00Z"),
"end": ISODate("2024-03-15T08:15:00Z"),
"platform": "iOS"
}
Pipeline:
db.sessions.aggregate([
{
$addFields: {
date: { $dateToString: { format: "%Y-%m-%d", date: "$start" } },
duration: { $subtract: ["$end", "$start"] }
}
},
{
$group: {
_id: { date: "$date", platform: "$platform" },
dau: { $sum: 1 },
avgDuration: { $avg: "$duration" },
totalDuration: { $sum: "$duration" }
}
},
{
$project: {
_id: 0,
date: "$_id.date",
platform: "$_id.platform",
dau: 1,
avgDuration: { $divide: ["$avgDuration", 60000] }, // Convert ms to minutes
totalDurationMinutes: { $divide: ["$totalDuration", 60000] }
}
},
{ $sort: { date: 1, platform: 1 } }
])
Output:
[
{ "date": "2024-03-15", "platform": "iOS", "dau": 1250, "avgDuration": 15.2, "totalDurationMinutes": 18998 },
{ "date": "2024-03-15", "platform": "Android", "dau": 2100, "avgDuration": 12.8, "totalDurationMinutes": 26880 }
]
Example 3: Log Analysis and Error Tracking
Scenario: Monitor application logs to detect frequent error types and their occurrence rate.
Collection: logs
{
"timestamp": ISODate("2024-03-15T10:05:00Z"),
"level": "ERROR",
"message": "Database connection timeout",
"service": "payment-service"
}
Pipeline:
db.logs.aggregate([
{
$match: {
level: "ERROR",
timestamp: { $gte: new Date(Date.now() - 86400000) } // Last 24 hours
}
},
{
$group: {
_id: { service: "$service", errorType: "$message" },
count: { $sum: 1 }
}
},
{
$sort: { count: -1 }
},
{
$limit: 10
},
{
$project: {
_id: 0,
service: "$_id.service",
errorType: "$_id.errorType",
occurrences: "$count"
}
}
])
Output:
[
{ "service": "payment-service", "errorType": "Database connection timeout", "occurrences": 87 },
{ "service": "user-service", "errorType": "Invalid token", "occurrences": 65 },
{ "service": "notification-service", "errorType": "SMTP server unreachable", "occurrences": 42 }
]
FAQs
What is the difference between find() and aggregate() in MongoDB?
find() retrieves documents that match a query and returns them as-is. Its simple and fast for basic filtering. aggregate() processes documents through multiple stages to transform, group, calculate, or join data. Its used for complex analytics and data manipulation beyond simple queries.
Can I use aggregation with sharded collections?
Yes, MongoDB supports aggregation on sharded collections. The query router (mongos) coordinates the pipeline across shards, collects results, and returns a unified response. However, stages like $group and $sort may require more resources, as data from multiple shards must be merged.
How do I debug a slow aggregation pipeline?
Use .explain("executionStats") to analyze performance. Look for:
- High number of documents scanned
- Missing index usage (no IXSCAN)
- Stages with high memory usage
- Long execution times in specific stages
Optimize by adding indexes, moving $match earlier, or reducing data volume with $project.
Is aggregation faster than doing the same logic in application code?
Generally, yes. Aggregation runs inside the database, eliminating network round trips and serialization overhead. It leverages MongoDBs optimized C++ engine and can utilize indexes. Application-level processing requires transferring large datasets, which is slower and consumes more bandwidth.
Can I update documents using aggregation?
Aggregation itself doesnt update documents. However, you can use $out or $merge to write results to a collection, effectively replacing or updating data. For direct updates based on aggregation logic, combine aggregation with updateOne() or updateMany() using the results.
What happens if an aggregation stage fails?
If any stage in the pipeline throws an error (e.g., invalid operator, missing field), the entire pipeline aborts and returns an error. Always validate your data schema and test with edge cases before deploying to production.
Are there limits to the number of stages in a pipeline?
MongoDB allows up to 100 stages per aggregation pipeline. While technically possible to use all 100, its best practice to keep pipelines under 1015 stages for readability and maintainability.
Can I use aggregation to create new collections?
Yes. The $out stage writes the entire result set to a new or existing collection, replacing it. The $merge stage allows more flexible updatesinserting, updating, or replacing documents based on matching keys.
Conclusion
Aggregating data in MongoDB is not just a featureits a paradigm shift in how you think about data processing. Instead of extracting, transforming, and loading (ETL) data into external systems, you can perform complex analytics directly within the database, reducing latency, minimizing data movement, and improving scalability. From simple filtering and grouping to advanced joins, array manipulations, and dynamic field calculations, MongoDBs aggregation framework offers unparalleled flexibility for modern data applications.
Mastering aggregation requires practice, but the payoff is immense: faster applications, cleaner code, and deeper insights from your data. By following the best practices outlined hereordering stages efficiently, leveraging indexes, using $out and $merge for persistence, and testing rigorouslyyoull build pipelines that are not only powerful but also performant and maintainable.
As data volumes continue to grow and real-time analytics become table stakes, the ability to aggregate data natively in MongoDB will remain a critical skill for developers, data engineers, and analysts alike. Start smallexperiment with $match and $group. Then gradually incorporate $lookup, $unwind, and $project. With time, youll be crafting sophisticated pipelines that turn raw documents into intelligent, actionable insights.
Remember: the best aggregation pipeline is the one that delivers the right answer, quickly, reliably, and with minimal resource usage. Keep testing, keep optimizing, and let your data speak.