SAP Data Pipeline Automation - ETL Infrastructure for Bee360 Integration

Situation

Processing SAP flat files before importing into Bee360 database required manual data quality validation and file processing workflows. The SAP data contained problematic characters (human input errors like ! and " marks) that would break the Bee360 import interface, causing import failures and requiring manual intervention. Without automated data quality infrastructure, each file import was a manual, error-prone process.

The challenge was compounded by the need to handle multiple file types (OPEX, CAPEX, CAPEX depreciation) with different headers and requirements, automate daily processing workflows combining daily files with previous month data, and provide tools for comparing files and identifying new problematic characters that appear in new files but weren't in previous files. The infrastructure needed to be replicable and version-controlled to ensure consistent data processing across environments.

Additionally, the system needed to handle Windows path separators (backslashes), preserve data integrity during cleanup, work with Bee360 import requirements (empty files must have headers), and support both compressed (.gz) and uncompressed files.

Task

The goal was to engineer Ruby-based ETL (Extract, Transform, Load) infrastructure automation for processing SAP flat files. The solution needed to:

Implement automated data pipeline with data quality validation
Handle problematic characters in SAP data that break Bee360 import interface
Support multiple file types (OPEX, CAPEX, CAPEX depreciation) with different headers
Automate file processing workflows (import, cleanup, combine, archive)
Provide tools for file comparison and character analysis
Automate daily processing workflows combining daily files with previous month data
Version-control all automation scripts in Git repository for replicability

Success criteria included: automated data quality validation preventing import failures, support for 3 file types with standardized processing, automated daily job orchestration, proactive issue detection through character analysis, and complete infrastructure replicability from Git repository.

Action

Data Quality Infrastructure

Implemented automated data quality validation to handle problematic characters:

clean_sap_files Function: Removes specific problematic characters from CSV files as part of automated ETL pipeline
Character Analysis Tool: analyze_problematic_characters identifies issues before processing, enabling proactive data quality validation
CSV Structure Preservation: Preserves CSV structure during cleanup, ensuring data integrity
Multi-Format Support: Processes both compressed (.gz) and uncompressed files
Version-Controlled Functions: All data quality functions version-controlled for replicability
Result: Prevents Bee360 import failures by removing problematic characters while preserving data integrity

Automated ETL Pipeline Infrastructure

Built comprehensive ETL pipeline with pattern-based file classification:

Pattern-Based Classification: Fast file type identification for automated processing
Type-Specific Header Definitions:
- OPEX: 36 columns
- CAPEX: 23 columns
- CAPEX depreciation: 19 columns
Standardized Data Processing: Type-specific headers enable consistent processing across file types
create_empty_csv Function: Adds proper headers to empty files preventing Bee360 import failures
Automated Archive Creation: Timestamped archives for data retention
Separated Concerns: Modular design with separated concerns (import, cleanup, combine, archive)
Daily Job Automation: Enables scheduled processing workflows

File Comparison and Character Analysis Tools

Created tools for proactive issue detection:

find_new_chars_missing_on_previous_file: Single file comparison identifying new problematic characters
compare_files_between_two_folders_and_find_new_unique_chars: Folder-level comparison for comprehensive analysis
Character Analysis Output: Clear output showing problematic characters, enabling proactive detection before import failures
Result: Enables proactive issue detection, preventing import failures before they occur

Technical Implementation Details

Architecture Decisions:

ETL pipeline automation (Extract from SAP SFTP, Transform with cleanup/validation/merge, Load to Bee360 import)
Multi-stage processing (file import, data cleanup, file classification, file merging, header management, archive creation)
Menu-driven interface for manual operations
Automated daily job orchestration
Version-controlled automation scripts in Git repository

Implementation Strategy:

Pattern-based file classification for fast file type identification
Type-specific header definitions for standardized data processing
Automated data quality validation with character cleanup
Timestamped archive directories for data retention
Environment variable configuration for flexible directory paths
Modular design with separated concerns (import, cleanup, combine, archive)
Utility library for reusable data processing functions

Core Scripts:

menu.rb: Menu-driven interface with 7 options for manual and automated operations
combine_and_archive.rb: File combination and archive creation automation
daily_job.rb: Automated daily processing combining daily files with previous month data

Utility Functions: 20+ reusable utility functions for file operations, CSV processing, and character analysis

Observability/Operations:

Console output for all operations (file counts, processing status)
File listing before operations (transparency)
Error messages for failed operations
User confirmation prompts for destructive operations

Result

ETL Infrastructure Achievement

Automated data pipeline from SAP to Bee360 with data quality validation, eliminating manual file processing and reducing import failures. The ETL pipeline handles Extract (from SAP SFTP), Transform (cleanup/validation/merge), and Load (to Bee360 import) stages automatically.

Data Quality Automation Success

Character analysis tools and file comparison tools enable proactive issue detection, and clean_sap_files function removes problematic characters preventing Bee360 import failures. The automated data quality validation ensures data integrity while preventing import failures.

Processing Automation Impact

3 core scripts (menu.rb, combine_and_archive.rb, daily_job.rb) orchestrate ETL pipeline with 7 menu options for manual and automated operations, supporting 3 file types (OPEX, CAPEX, CAPEX depreciation) with standardized processing. The automated daily processing combines daily files with previous month data, creating full month combined files.

Infrastructure Replicability Excellence

All automation scripts version-controlled in Git repository (15+ commits showing evolution), 20+ reusable utility functions for file operations, CSV processing, and character analysis, enabling consistent data processing across environments. The version-controlled approach ensures complete infrastructure replicability.

Daily Job Automation Success

Automated daily processing combining daily files with previous month data, creating full month combined files, enabling scheduled processing workflows. The automation eliminates manual daily processing tasks, reducing operational overhead.

Technical Stack

Ruby, CSV processing, Windows file system, Git, gzip compression, Windows Task Scheduler

Conclusion

The SAP Data Pipeline Automation project demonstrates replicable & resilient systems engineering through comprehensive ETL infrastructure automation. By implementing automated data quality validation, pattern-based file classification, and standardized processing workflows, the project eliminates manual file processing and reduces import failures. The holistic "full-stack" infrastructure view is evident in the integration of data extraction, transformation, and loading stages, while the proactive issue detection through character analysis tools demonstrates systematic problem-solving. The version-controlled automation scripts ensure complete infrastructure replicability, enabling consistent data processing across environments. This infrastructure investment—doing it right the first time—eliminates technical debt and reduces operational overhead, transforming data processing from manual, error-prone workflows into automated, reliable ETL infrastructure. The project showcases how systematic infrastructure engineering can deliver reliable data pipeline automation while maintaining data integrity and enabling proactive issue detection.