A typical speech synthesis model requires a vocoder to listen to what it sounds like through mel-spectrogram after creating mel-spectrogram in text. In recent years, the trend has shifted to studies that extract waveform from text end-to-end, rather than using pre-trained vocoder as an inference, to process speech synthesis potentially with more variance information of speech waveform and to inference directly rather than going through two steps of inference. Meanwhile, speech synthesis research that generates a speech waveform containing emotion information is in progress, and more robust performance can be demonstrated by reflecting emotion information in training. In this work, we propose a model that directly generates speech waveform containing emotion information more liveliness with our proposed technique, Speech-SECat, using conditional adversarial training.