Skip to content

Create a data lake of clickstream data

Last reviewed: about 20 hours ago

In this tutorial, you will learn how to build a data lake of website interaction events (clickstream data), using Pipelines.

Data lakes are a way to store large volumes of raw data in an object storage service such as R2. You can run queries over a data lake, to analyze the raw events and generate product insights.

For this tutorial, you will build a landing page for an e-commerce website. Users can click on the website, to view products or add them to the cart. As the user clicks on the page, events will be sent to a pipeline. These events are "client-side"; they're sent directly from a users' browser to your pipeline. Your pipeline will automatically batch the ingested data, build output files, and deliver them to an R2 bucket to build your data lake.

Prerequisites

  1. Install Node.js โ†—.

Node.js version manager

Use a Node version manager like Volta โ†— or nvm โ†— to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

1. Create a new Workers project

You will create a new Worker project that will use Static Assets to serve the HTML file. While you can use any front-end framework, this tutorial uses plain HTML and JavaScript to keep things simple. If you are interested in learning how to build and deploy a web application on Workers with Static Assets, you can refer to the Frameworks documentation.

Create a new Worker project by running the following commands:

Terminal window
npm create cloudflare@latest -- e-commerce-pipelines-client-side

For setup, select the following options:

  • For What would you like to start with?, choose Hello World example.
  • For Which template would you like to use?, choose Worker only.
  • For Which language do you want to use?, choose TypeScript.
  • For Do you want to use git for version control?, choose Yes.
  • For Do you want to deploy your application?, choose No (we will be making some changes before deploying).

Navigate to the e-commerce-pipelines-client-side directory:

cd e-commerce-pipelines-client-side

2. Create the website frontend

Using Workers Static Assets, you can serve the frontend of your application from your Worker. To use Static Assets, you need to add the required bindings to your wrangler.toml file.

[assets]
directory = "public"

Next, create a public directory and add an index.html file. The index.html file should contain the following HTML code:

Select to view the HTML code
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>E-commerce Store</title>
<script src="https://cdn.tailwindcss.com"></script>
</head>
<body>
<nav class="bg-gray-800 text-white p-4">
<div class="container mx-auto flex justify-between items-center">
<a href="/" class="text-xl font-bold"> E-Commerce Demo </a>
<div class="space-x-4 text-gray-800">
<a href="#">
<button class="border border-input bg-white h-10 px-4 py-2 rounded-md">Cart</button>
</a>
<a href="#">
<button class="border border-input bg-white h-10 px-4 py-2 rounded-md">Login</button>
</a>
<a href="#">
<button class="border border-input bg-white h-10 px-4 py-2 rounded-md">Signup</button>
</a>
</div>
</div>
</nav>
<div class="container mx-auto px-4 py-8">
<h1 class="text-3xl font-bold mb-6">Our Products</h1>
<div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6" id="products">
<!-- This section repeats for each product -->
<!-- End of product section -->
</div>
</div>
<script>
// demo products
const products = [
{
id: 1,
name: 'Smartphone X',
desc: 'Latest model with advanced features',
cost: 799,
},
{
id: 2,
name: 'Laptop Pro',
desc: 'High-performance laptop for professionals',
cost: 1299,
},
{
id: 3,
name: 'Wireless Earbuds',
desc: 'True wireless earbuds with noise cancellation',
cost: 149,
},
{
id: 4,
name: 'Smart Watch',
desc: 'Fitness tracker and smartwatch combo',
cost: 199,
},
{
id: 5,
name: '4K TV',
desc: 'Ultra HD smart TV with HDR',
cost: 599,
},
{
id: 6,
name: 'Gaming Console',
desc: 'Next-gen gaming system',
cost: 499,
},
];
// function to render products
function renderProducts() {
console.log('Rendering products...');
const productContainer = document.getElementById('products');
productContainer.innerHTML = ''; // Clear existing content
products.forEach((product) => {
const productElement = document.createElement('div');
productElement.classList.add('rounded-lg', 'border', 'bg-card', 'text-card-foreground', 'shadow-sm');
productElement.innerHTML = `
<div class="flex flex-col space-y-1.5 p-6">
<h2 class="text-2xl font-semibold leading-none tracking-tight">${product.name}</h2>
</div>
<div class="p-6 pt-0">
<p>${product.desc}</p>
<p class="font-bold mt-2">$${product.cost}</p>
</div>
<div class="flex items-center p-6 pt-0 flex justify-between">
<button class="border px-4 py-2 rounded-md" onclick="handleClick('product_view', ${product.id})" name="">View Details</button>
<button class="border px-4 py-2 rounded-md" onclick="handleClick('add_to_cart', ${product.id})">Add to Cart</button>
</div>
`;
productContainer.appendChild(productElement);
});
}
renderProducts();
// function to handle click events
async function handleClick(action, id) {
console.log(`Clicked ${action} for product with id ${id}`);
}
</script>
</body>
</html>

The above code does the following:

  • Uses Tailwind CSS to style the page.
  • Renders a list of products.
  • Adds a button to view the details of a product.
  • Adds a button to add a product to the cart.
  • Contains a handleClick function to handle the click events. This function logs the action and the product ID. In the next steps, you will create a pipeline and add the logic to send the click events to this pipeline.

3. Create an R2 Bucket

We'll create a new R2 bucket to use as the sink for our pipeline. Create a new r2 bucket clickstream-bucket using the Wrangler CLI. Open a terminal window, and run the following command:

Terminal window
npx wrangler r2 bucket create clickstream-bucket

4. Create a pipeline

You need to create a new pipeline and connect it to your R2 bucket.

Create a new pipeline clickstream-pipeline-client using the Wrangler CLI. Open a terminal window, and run the following command:

Terminal window
npx wrangler pipelines create clickstream-pipeline-client --r2-bucket clickstream-bucket --compression none --batch-max-seconds 5

When you run the command, you will be prompted to authorize Cloudflare Workers Pipelines to create R2 API tokens on your behalf. These tokens are required by your Pipeline. Your Pipeline uses these tokens when loading data into your bucket. You can approve the request through the browser link which will open automatically.

โœ… Successfully created Pipeline "clickstream-pipeline-client" with ID <PIPELINE_ID>
Id: <PIPELINE_ID>
Name: clickstream-pipeline-client
Sources:
HTTP:
Endpoint: https://<PIPELINE_ID>.pipelines.cloudflare.com
Authentication: off
Format: JSON
Worker:
Format: JSON
Destination:
Type: R2
Bucket: apr-6
Format: newline-delimited JSON
Compression: NONE
Batch hints:
Max bytes: 100 MB
Max duration: 300 seconds
Max records: 10,000,000
๐ŸŽ‰ You can now send data to your Pipeline!
Send data to your Pipeline's HTTP endpoint:
curl "https://<PIPELINE_ID>.pipelines.cloudflare.com" -d '[{"foo": "bar"}]'

Make a note of the URL of the pipeline. You will use this URL to send the clickstream data from the client-side.

5. Generate clickstream data

You need to send clickstream data like the timestamp, user_id, session_id, and device_info to your pipeline. You can generate this data on the client side. Add the following function in the <script> tag in your public/index.html. This function gets the device information:

public/index.html
function extractDeviceInfo(userAgent) {
let browser = "Unknown";
let os = "Unknown";
let device = "Unknown";
// Extract browser
if (userAgent.includes("Firefox")) {
browser = "Firefox";
} else if (userAgent.includes("Chrome")) {
browser = "Chrome";
} else if (userAgent.includes("Safari")) {
browser = "Safari";
} else if (userAgent.includes("Opera") || userAgent.includes("OPR")) {
browser = "Opera";
} else if (userAgent.includes("Edge")) {
browser = "Edge";
} else if (userAgent.includes("MSIE") || userAgent.includes("Trident/")) {
browser = "Internet Explorer";
}
// Extract OS
if (userAgent.includes("Win")) {
os = "Windows";
} else if (userAgent.includes("Mac")) {
os = "MacOS";
} else if (userAgent.includes("Linux")) {
os = "Linux";
} else if (userAgent.includes("Android")) {
os = "Android";
} else if (userAgent.includes("iOS")) {
os = "iOS";
}
// Extract device
const mobileKeywords = [
"Android",
"webOS",
"iPhone",
"iPad",
"iPod",
"BlackBerry",
"Windows Phone",
];
device = mobileKeywords.some((keyword) => userAgent.includes(keyword))
? "Mobile"
: "Desktop";
return { browser, os, device };
}

6. Send clickstream data to your pipeline

You will send the clickstream data to the pipline from the client-side. To do that, update the handleClick function to make a POST request to the pipeline URL with the data. Replace <PIPELINE_URL> with the URL of your pipeline.

public/index.html
async function handleClick(action, id) {
console.log(`Clicked ${action} for product with id ${id}`);
const userAgent = window.navigator.userAgent;
const timestamp = new Date().toISOString();
const { browser, os, device } = extractDeviceInfo(userAgent);
const data = {
timestamp,
session_id: "1234567890abcdef", // For production use a unique session ID
user_id: "user" + Math.floor(Math.random() * 1000), // For production use a unique user ID
event_data: {
event_id: Math.floor(Math.random() * 1000),
event_type: action,
page_url: window.location.href,
timestamp,
product_id: id,
},
device_info: {
browser,
os,
device,
userAgent,
},
referrer: document.referrer,
};
try {
const response = await fetch("<PIPELINE_URL>", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify([data]),
});
if (!response.ok) {
throw new Error("Failed to send data to pipeline");
}
} catch (error) {
console.error("Error sending data to pipeline:", error);
}
}

The handleClick function does the following:

  • Gets the device information using the extractDeviceInfo function.
  • Makes a POST request to the pipeline with the data.
  • Logs any errors that occur.

If you start the development server and open the application in the browser, you can see the handleClick function gets executed when you click on the View Details or Add to Cart button.

Terminal window
npm run dev

However, no data gets sent to the pipeline. Inspect the browser console to view the error message. The error message you see is for CORS โ†—. In the next step, you will update the CORS settings to allow the client-side JavaScript to send data to the pipeline.

7. Update CORS settings

By default, the HTTP ingestion endpoint for your pipeline does not allow cross-origin requests. You need to update the CORS settings to allow the client-side JavaScript to send data to the pipeline. To update the CORS settings, execute the following command:

Terminal window
npx wrangler pipelines update clickstream-pipeline-client --cors-origins http://localhost:8787

Now when you run the development server locally, and open the website in a browser, clickstream data will be successfully sent to the pipeline. You can learn more about the CORS settings in the Specifying CORS settings documentation.

8. Deploy the application

To deploy the application, run the following command:

Terminal window
npm run deploy

This will deploy the application to the Cloudflare Workers platform.

๐ŸŒ€ Building list of assets...
๐ŸŒ€ Starting asset upload...
๐ŸŒ€ Found 1 new or modified static asset to upload. Proceeding with upload...
+ /index.html
Uploaded 1 of 1 assets
โœจ Success! Uploaded 1 file (2.37 sec)
Total Upload: 25.73 KiB / gzip: 6.17 KiB
Worker Startup Time: 15 ms
Uploaded e-commerce-pipelines-client-side (11.79 sec)
Deployed e-commerce-pipelines-client-side triggers (7.60 sec)
https://<URL>.workers.dev
Current Version ID: <VERSION_ID>

We now need to update the pipeline's CORS settings again. This time, we'll include the URL of our newly deployed application. Run the command below, and replace <URL> with the URL of the application.

Terminal window
npx wrangler pipelines update clickstream-pipeline-client --cors-origins http://localhost:8787 https://<URL>.workers.dev

Now, you can access the application at the deployed URL. When you click on the View Details or Add to Cart button, the clickstream data will be sent to your pipeline.

9. View the data in R2

You can view the data in the R2 bucket. If you are not signed in to the Cloudflare dashboard, sign in and navigate to the R2 overview โ†— page.

Open the bucket you configured for your pipeline in Step 3. You can see files, representing the clickstream data. These files are newline delimited JSON files. Each row in a file represents one click event. Download one of the files, and open it in your preferred text editor to see the output:

{"timestamp":"2025-04-06T16:24:29.213Z","session_id":"1234567890abcdef","user_id":"user965","event_data":{"event_id":673,"event_type":"product_view","page_url":"https://<URL>.workers.dev/","timestamp":"2025-04-06T16:24:29.213Z","product_id":2},"device_info":{"browser":"Chrome","os":"Linux","device":"Mobile","userAgent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36"},"referrer":""}
{"timestamp":"2025-04-06T16:24:30.436Z","session_id":"1234567890abcdef","user_id":"user998","event_data":{"event_id":787,"event_type":"product_view","page_url":"https://<URL>.workers.dev/","timestamp":"2025-04-06T16:24:30.436Z","product_id":4},"device_info":{"browser":"Chrome","os":"Linux","device":"Mobile","userAgent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36"},"referrer":""}
{"timestamp":"2025-04-06T16:24:31.330Z","session_id":"1234567890abcdef","user_id":"user22","event_data":{"event_id":529,"event_type":"product_view","page_url":"https://<URL>.workers.dev/","timestamp":"2025-04-06T16:24:31.330Z","product_id":4},"device_info":{"browser":"Chrome","os":"Linux","device":"Mobile","userAgent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36"},"referrer":""}
{"timestamp":"2025-04-06T16:24:31.879Z","session_id":"1234567890abcdef","user_id":"user750","event_data":{"event_id":756,"event_type":"product_view","page_url":"https://<URL>.workers.dev/","timestamp":"2025-04-06T16:24:31.879Z","product_id":4},"device_info":{"browser":"Chrome","os":"Linux","device":"Mobile","userAgent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36"},"referrer":""}
{"timestamp":"2025-04-06T16:24:33.978Z","session_id":"1234567890abcdef","user_id":"user333","event_data":{"event_id":467,"event_type":"product_view","page_url":"https://<URL>.workers.dev/","timestamp":"2025-04-06T16:24:33.978Z","product_id":6},"device_info":{"browser":"Chrome","os":"Linux","device":"Mobile","userAgent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36"},"referrer":""}

10. Optional: Connect a query engine to your R2 bucket and query the data

Once you have collected the raw events in R2, you might want to query the events, to answer questions such as "how many product_view events occurred?". You can connect a query engine, such as MotherDuck, to your R2 bucket.

You can connect the bucket to MotherDuck in several ways, which you can learn about from the MotherDuck documentation โ†—. In this tutorial, you will connect the bucket to MotherDuck using the MotherDuck dashboard.

Connect your bucket to MotherDuck

Before connecting the bucket to MotherDuck, you need to obtain the Access Key ID and Secret Access Key for the R2 bucket. You can find the instructions to obtain the keys in the R2 API documentation.

Before connecting the bucket to MotherDuck, you need to obtain the Access Key ID and Secret Access Key for the R2 bucket. You can find the instructions to obtain the keys in the R2 API documentation.

  1. Log in to the MotherDuck dashboard and select your profile.

  2. Navigate to the Secrets page.

  3. Select the Add Secret button and enter the following information:

    • Secret Name: Clickstream pipeline
    • Secret Type: Cloudflare R2
    • Access Key ID: ACCESS_KEY_ID (replace with the Access Key ID)
    • Secret Access Key: SECRET_ACCESS_KEY (replace with the Secret Access Key)
  4. Select the Add Secret button to save the secret.

Query the data

In this step, you will query the data stored in the R2 bucket using MotherDuck.

  1. Navigate back to the MotherDuck dashboard and select the + icon to add a new Notebook.

  2. Select the Add Cell button to add a new cell to the notebook.

  3. In the cell, enter the following query and select the Run button to execute the query:

SELECT count(*) FROM read_json_auto('r2://clickstream-bucket/**/*');

The query will return a count of all the events received.

Conclusion

You have successfully created a Pipeline and used it to send clickstream data from the client. Through this tutorial, you've gained hands-on experience in:

  1. Creating a Workers project, using static assets
  2. Generating and capturing clickstream data
  3. Setting up a pipeline to ingest data into R2
  4. Deploying the application to Workers
  5. Using MotherDuck to query the data

You can find the source code of the application in the GitHub repository โ†—.