SAP Data Pipeline Automation - ETL Infrastructure for Bee360 Integration
Situation
Processing SAP flat files before importing into Bee360 database required manual data quality validation and file processing workflows. The SAP data contained problematic characters (human input errors like ! and " marks) that would break the Bee360 import interface, causing import failures and requiring manual intervention. Without automated data quality infrastructure, each file import was a manual, error-prone process.
The challenge was compounded by the need to handle multiple file types (OPEX, CAPEX, CAPEX depreciation) with different headers and requirements, automate daily processing workflows combining daily files with previous month data, and provide tools for comparing files and identifying new problematic characters that appear in new files but weren't in previous files. The infrastructure needed to be replicable and version-controlled to ensure consistent data processing across environments.
Additionally, the system needed to handle Windows path separators (backslashes), preserve data integrity during cleanup, work with Bee360 import requirements (empty files must have headers), and support both compressed (.gz) and uncompressed files.
Task
The goal was to engineer Ruby-based ETL (Extract, Transform, Load) infrastructure automation for processing SAP flat files. The solution needed to:
- Implement automated data pipeline with data quality validation
- Handle problematic characters in SAP data that break Bee360 import interface
- Support multiple file types (OPEX, CAPEX, CAPEX depreciation) with different headers
- Automate file processing workflows (import, cleanup, combine, archive)
- Provide tools for file comparison and character analysis
- Automate daily processing workflows combining daily files with previous month data
- Version-control all automation scripts in Git repository for replicability
Success criteria included: automated data quality validation preventing import failures, support for 3 file types with standardized processing, automated daily job orchestration, proactive issue detection through character analysis, and complete infrastructure replicability from Git repository.
Action
Data Quality Infrastructure
Implemented automated data quality validation to handle problematic characters:
- clean_sap_files Function: Removes specific problematic characters from CSV files as part of automated ETL pipeline
- Character Analysis Tool:
analyze_problematic_charactersidentifies issues before processing, enabling proactive data quality validation - CSV Structure Preservation: Preserves CSV structure during cleanup, ensuring data integrity
- Multi-Format Support: Processes both compressed (.gz) and uncompressed files
- Version-Controlled Functions: All data quality functions version-controlled for replicability
- Result: Prevents Bee360 import failures by removing problematic characters while preserving data integrity
Automated ETL Pipeline Infrastructure
Built comprehensive ETL pipeline with pattern-based file classification:
- Pattern-Based Classification: Fast file type identification for automated processing
- Type-Specific Header Definitions:
- OPEX: 36 columns
- CAPEX: 23 columns
- CAPEX depreciation: 19 columns
- Standardized Data Processing: Type-specific headers enable consistent processing across file types
- create_empty_csv Function: Adds proper headers to empty files preventing Bee360 import failures
- Automated Archive Creation: Timestamped archives for data retention
- Separated Concerns: Modular design with separated concerns (import, cleanup, combine, archive)
- Daily Job Automation: Enables scheduled processing workflows
File Comparison and Character Analysis Tools
Created tools for proactive issue detection:
- find_new_chars_missing_on_previous_file: Single file comparison identifying new problematic characters
- compare_files_between_two_folders_and_find_new_unique_chars: Folder-level comparison for comprehensive analysis
- Character Analysis Output: Clear output showing problematic characters, enabling proactive detection before import failures
- Result: Enables proactive issue detection, preventing import failures before they occur
Technical Implementation Details
Architecture Decisions:
- ETL pipeline automation (Extract from SAP SFTP, Transform with cleanup/validation/merge, Load to Bee360 import)
- Multi-stage processing (file import, data cleanup, file classification, file merging, header management, archive creation)
- Menu-driven interface for manual operations
- Automated daily job orchestration
- Version-controlled automation scripts in Git repository
Implementation Strategy:
- Pattern-based file classification for fast file type identification
- Type-specific header definitions for standardized data processing
- Automated data quality validation with character cleanup
- Timestamped archive directories for data retention
- Environment variable configuration for flexible directory paths
- Modular design with separated concerns (import, cleanup, combine, archive)
- Utility library for reusable data processing functions
Core Scripts:
- menu.rb: Menu-driven interface with 7 options for manual and automated operations
- combine_and_archive.rb: File combination and archive creation automation
- daily_job.rb: Automated daily processing combining daily files with previous month data
Utility Functions: 20+ reusable utility functions for file operations, CSV processing, and character analysis
Observability/Operations:
- Console output for all operations (file counts, processing status)
- File listing before operations (transparency)
- Error messages for failed operations
- User confirmation prompts for destructive operations
Result
ETL Infrastructure Achievement
Automated data pipeline from SAP to Bee360 with data quality validation, eliminating manual file processing and reducing import failures. The ETL pipeline handles Extract (from SAP SFTP), Transform (cleanup/validation/merge), and Load (to Bee360 import) stages automatically.
Data Quality Automation Success
Character analysis tools and file comparison tools enable proactive issue detection, and clean_sap_files function removes problematic characters preventing Bee360 import failures. The automated data quality validation ensures data integrity while preventing import failures.
Processing Automation Impact
3 core scripts (menu.rb, combine_and_archive.rb, daily_job.rb) orchestrate ETL pipeline with 7 menu options for manual and automated operations, supporting 3 file types (OPEX, CAPEX, CAPEX depreciation) with standardized processing. The automated daily processing combines daily files with previous month data, creating full month combined files.
Infrastructure Replicability Excellence
All automation scripts version-controlled in Git repository (15+ commits showing evolution), 20+ reusable utility functions for file operations, CSV processing, and character analysis, enabling consistent data processing across environments. The version-controlled approach ensures complete infrastructure replicability.
Daily Job Automation Success
Automated daily processing combining daily files with previous month data, creating full month combined files, enabling scheduled processing workflows. The automation eliminates manual daily processing tasks, reducing operational overhead.
Technical Stack
Ruby, CSV processing, Windows file system, Git, gzip compression, Windows Task Scheduler
Conclusion
The SAP Data Pipeline Automation project demonstrates replicable & resilient systems engineering through comprehensive ETL infrastructure automation. By implementing automated data quality validation, pattern-based file classification, and standardized processing workflows, the project eliminates manual file processing and reduces import failures. The holistic "full-stack" infrastructure view is evident in the integration of data extraction, transformation, and loading stages, while the proactive issue detection through character analysis tools demonstrates systematic problem-solving. The version-controlled automation scripts ensure complete infrastructure replicability, enabling consistent data processing across environments. This infrastructure investment—doing it right the first time—eliminates technical debt and reduces operational overhead, transforming data processing from manual, error-prone workflows into automated, reliable ETL infrastructure. The project showcases how systematic infrastructure engineering can deliver reliable data pipeline automation while maintaining data integrity and enabling proactive issue detection.